**MACHINE LEARNING / **COMPUTATIONAL DATA ANALYSIS

**GEORGIA INSTITUTE OF TECHNOLOGY**

**FALL 2011: CS 7641, CSE 6740, ISyE 6740**

**Time and Place:** TR 13:35-14:55, Clough 152, extra hour review T 11:00-12:00 Klaus 2443.

**Staff:**

- Instructor: Guy Lebanon (office hours: T 3-4, Klaus 1308)
- TA: Long Tran (office hours: R 3-4, Klaus 1315)
- TA: Parikshit Ram (office hour: M 4-5, Klaus 1315. Update: out of town until exam week)

**Important Announcement:**

- Errata published
- course moving to Clough 152 which can house additional students (overrides should be approved)
- Stanford is offering this semester a free online ML course with streaming video.

**Textbook and other Material****: **

- G. Lebanon.
*The Analysis of Data – First Incomplete Draft,*2011 (online chapters). - Lecture notes and papers distributed in the second part of the course

**Grade Composition:**

- 31% homework
- 31% exam 1
- 31% exam 2
- 7% service (choose one from: insightful online discussion, scribe notes, tutorial preparation, demo, etc.)

**Homeworks:**

- Due 9/1/2011. Chapter 1, exercises 1-5. No need to submit anything for ex 1 but please do it! In ex 4, 5 choose interesting values of N. use the command options(expressions=500000) to increase the number of nested recursions allowed. Compare the timing of the recursion implementation as much as possible, and continue beyond that for the two other implementations. Also – please implement the log factorial as log(A)+log(B)+… rather than log(A*B*…) to avoid overflow (try it out to see what happens).
- Due 9/8/2011. (1) Using the mpg data describe the relationship between highway mpg and car manufacturer. Describe which companies produce the most and least fuel efficient cars and display a graph supporting your conclusion. (2) Using the mpg data explore the three-way relationship between highway mpg, city mpg, and model class. What are your conclusions? Display a graph supporting your conclusion. (3) Generate two sets of N random points using the function runif and display a corresponding scatter plot. If you save the file to disk what is the resulting file size for the following file formats: ps, pdf, jpeg, png. How do these values scale with increasing N. (4) The diamond dataset within ggplot2 contains 10 columns (price, carat, cut, color, etc.) for 53940 different diamonds. Type help(diamonds) for more information. Plot histograms for color, carat, and price and comment on their shapes. Investigate the three-way relationships among price, carat, and cut. What are your conclusions? Provide graphs that support your conclusions. If you encounter computational difficulties consider using a smaller dataframe whose rows are sampled from the original diamonds dataframe. Use the function sample to create a subset of indices that may be used to create the smaller dataframe.
- Due 9/20/2011. (1) Some functionality can be accomplished by both the reshapes and plyr packages. Show two distinct examples and provide code that accomplishes the same results with both packages. For such tasks, which of the two packages would you prefer? What are the pros and cons if the two packages? (2) Consider a continuous RV X whose density is f(x) = 2x for x ∈ (0,1) and 0 otherwise. Compute its expectation and variance. (3) We define the support size of a discrete RV X as the number of values it can achieve with non-zero probability (that number may be infinity). Assuming that Y is a function of a discrete RV X, what is the relationship between the support sizes of X and Y ? (4) Consider the uniform distribution over two disjoint intervals [a, b]∪[c, d]. Write down the pdf and cdf, and derive the expectation and variance.
- Due 9/29/2011. (1) Assuming that and (both in probability) prove that in probability. (2) Comment on whether the WLLN implies the CLT and on whether the CLT implies the WLLN. Motivate your answer. (3) Consider a multinomial random vector . What is the distribution of where ? (4) Show a specific pair of dependent random variables whose covariance is 0.
- Due 11/08/2011. (1) Write and simplify as much as possible Fisher’s QDA Bayes classifier in the binary case (assuming a general loss function). (2) Repeat (1) but assume now that both classes have the same covariance matrix. (3) Repeat (1) but assume now both covariance matrices are the identity. (4) Apply Fisher’s LDA to the iris dataframe (use all 4 dimensions and the binary 0/1 loss – use only the first two classes: setosa and versicolor). Implement the classifier rule yourself without using a package. Assume the covariance matrix is (a) full (b) diagonal (c) spherical. If you get non-invertible covariance matrices add a small value to the diagonal to make it invertible. Repeat 10 times: split the dataset into two equal parts randomly, use one half for training and the second for testing or evaluation. Report the error rate (frequency of making wrong classification) over the training set and over the testing set averaged over the 10 random train/test splits for the three covariance matrix cases. Which of the three covariance models performs the best?
- Due 11/22/2011. Read http://rss.acs.unt.edu/Rdoc/library/kernlab/html/spam.html, install the kernlab package in R and load the spam data. (1) Program a simple gradient descent procedure in R for computing the MLE for binary logistic regression with L2 regularization. (2) Repeat for 10 times: partition the spam data to 50% train and 50% test, train the MLE program you computed in (1) on the train and evaluate it on the test set. Report the average train set accuracy and the average test set accuracy and investigate how these values depend on regularization parameter. Use sufficient values of the regularization parameter in your investigation to obtain meaningful conclusions.

Please submit the homework in class (hardcopy) or by uploading it to tsquare (by midnight of due date). Homework will involve both theoretical work and programming. Please work alone unless explicitly mentioned otherwise. Violations of this policy will be reported to the Dean of Students.

**Pre-requisites:**

- Undergraduate multivariate Calculus
- Undergraduate linear algebra
- Undergraduate calculus-based probability
- Basic programming in C, C++, Java, Matlab, or R

The pre-requisites above are essential. And yes-this includes the probability part. The course will be fast paced and rigorous and if you are not prepared at the outset you will trail behind. There are several other outstanding ML courses in CoC that may be more appropriate for your background.

**Overview:**

Machine learning is a subfield of computer science involved with drawing conclusions from data. In contrast to statistics it emphasizes computing, algorithms, and large scale data. In contrast to artificial intelligence it emphasizes automatically analyzing data rather than human knowledge. The increase in data availability and computing power makes machine learning more relevant now than ever. It is an essential component in a wide range of application areas including web technology, computational biology, business analytics, information visualization, finance, image and signal processing, networking, system analysis, and robotics. Arguably, it is the fastest growing field in computer science.

This course is a rigorous and fast paced graduate level introduction to machine learning. It will emphasizes both theory and computing. The first part of the course will follow the textbook above and will cover introduction to R, visualizing and processing data, probability (self study), stochastic convergence, estimation theory and maximum likelihood theory, confidence intervals and bootstrap, regression and classification. The second part of the course will follow additional lecture notes provided by the instructor and recent papers. Topics include Bayesian estimation, clustering and the EM algorithm, dimensionality reduction, semisupervised learning, graphical models, and kernel machines.

This course is a pre-requisite to the graduate course ML II offered next semester. That course will cover advanced topics.

**Schedule:**

Exam 2: December 15, 2011: 14:50-17:40

Exam 1: October 11, 2011: in class

Date | Topic | Reading |

12/08/2011 | Markov decision processes, generalization error bounds | MDP |

12/06/2011 | Dimensionality reduction: MDS, PCA, and NMF | tutorial |

12/01/2011 | classification and regression trees | book-pdf |

11/29/2011 | classification and regression trees | book-pdf |

11/22/2011 | feature selection: subset-selection, step-wise, stage-wise, entropy, mutual information, and information gain | cover and thomas chapter 2 |

11/17/2011 | linear regression | chapter 12 |

11/15/2011 | linear regression | chapter 12 |

11/10/2011 | Mercer kernels and non-linear SVM, exponential loss and boosting | chapter 13 |

11/08/2011 | k-nearest neighbor classification, clustering: k-means, hierarchical clustering (Guy is out of town, Pari is teaching) | |

11/03/2011 | support vector machines | chapter 12 |

11/01/2011 | support vector machines | chapter 12 |

10/27/2011 | discriminative classification, logistic regression | gchapter 12 |

10/25/2011 | generative classification, Fisher’s quadratic discriminant and Fisher’s linear discriminant | chapter 12 |

10/20/2011 | classification: loss and risk, bayes classifier, two bag of words document representations | chapter 12 |

10/18/2011 | fall break | |

10/13/2011 | bootstrap | chapter 11 |

10/11/2011 | midterm exam | |

10/06/2011 | confidence intervals: pivots, small sample CI, large sample CI, CI for the MLE | chapter 11 |

10/04/2011 | maximum likelihood estimation | chapter 10 |

09/27/2011 | maximum likelihood estimation | Sections 10.4-10.6 |

09/22/2011 | consistency and the Cramer Rao lower bound | Sections 10.3 |

09/20/2011 | point estimation, bias, variance, and mse | Sections 10-10.2 |

09/15/2011 | stochastic convergence | chapter 9 |

09/13/2011 | review of probability 2 | chapters 7-8 |

09/08/2011 | review of probability 1 | chapters 4-6 |

09/06/2011 | skewed data and power transformations, cast/melt with the reshape package and split apply combine with the plyr package | Sections 3.3-3.4 |

09/01/2011 | qq-plots, handling missing data and outliers | Sections 2.9-2.11, 3.1-3.2 |

08/30/2011 | scatter plots, smoothed scatter plots, facets, line plots, contour plots, quantiles, box-plots | Sections 2.5-2.8 |

08/25/2011 | R graphics, strip plots, histograms, smoothed histograms | Sections 2.1-2.4 |

08/23/2011 | Course overview, introduction to R | chapter 1 |

The schedule above contains pointers to both pdf notes, blog posts, and textbook chapters (identifiable via a missing hyperlink). Please read all of them! If you have questions or comments please feel free to post messages on the discussion board.

** **