**DATA AND VISUAL ANALYTICS**

**GEORGIA INSTITUTE OF TECHNOLOGY**

**SPRING 2011: CSE8803-DVA, CS4803-DVA**

(Data Visualization using Grapheur)

**Time and Place:** MW 13:05-14:25, Cherry Emerson 204

**Instructor:** Guy Lebanon (office hours: M 14:30-15:30, Klaus 1308)

**Grader: **Bharathi Ravishanker (office hours: F 10:00-11:00 @ CCB commons, bravishanker3@gatech.edu)

**Grade Composition:**

- 30% homework
- 30% project
- 40% final exam

** **

**Pre-requisites:**

- Undergraduate multivariate Calculus
- Undergraduate linear algebra
- Undergraduate calculus-based probability (e.g., first 13 of my probability notes)
- Programming in one of the following languages: C, C++, Python, Java, Matlab, R, Perl

**Homework and Project Policy:** Homework and project work will involve both theoretical work and programming. Please work alone unless explicitly mentioned otherwise. Violations of this policy will be reported to the Dean of Students.

**Computing with R:** We will make extensive use of R in this course. R is a language specifically designed for data analysis and visualization. It is significantly more powerful than Matlab in this regards. The usage of R is growing very fast in part due to its open source nature and platform independence and learning it is a worthy time investment. Relevant links are: R’s website, What is R?, official R manuals. Tutorial notes by Guy Lebanon are available here.

**Books:**

- R. Battiti and M. Brunato.
*Reactive Business Intelligence*(online). - H. Wickham.
*ggplot2: Elegant Graphics for Data Analysis*. Springer 2009 (sample chapter and code) - J. Adler.
*R in a Nutshell.*O’Rielly Press 2010.

**Schdeule:**

Date | Topic | Reading |

03/14/2011 | Project Discussion | |

03/09/2011 | Entropy and mutual information, classification and regression trees, Fisher’s LDA, Naive Bayes | 1 |

03/07/2011 | Low rank approximation using SVD, latent semantic analysis | 1 |

03/02/2011 | Low rank approximation using SVD, principal component analysis, non-negative matrix factorization | 1 |

02/28/2011 | Association Rule Mining | 1 |

02/23/2011 | Non-metric multidimensional scaling, local multidimensional scaling | 1 |

02/21/2011 | Dimensionality reduction, multidimensional scaling | 1 |

02/16/2011 | Split-apply-combine and the plyr package, case study: baseball batting data | 1 |

02/14/2011 | Linear regression formulas in R, the reshape package with cast() and melt() | 1 |

02/09/2011 | Linear Regression, residual plots, regression in R with lm() | 1 |

02/07/2011 | Graphing multivariate data, power transformations | 1 |

02/02/2011 | Maximum likelihood estimation | 1 |

01/31/2011 | Graphing 1-D Numeric Data | 1, 2 |

01/26/2011 | Plotting densities in R with qplot() and ggplot(). | 1 |

01/24/2011 | A taxonomy of data, review of univariate distributions | 1, 2 |

01/19/2011 | Course Overview, Data and Visual Analytics, Introduction to R | 1, 2 |

The schedule above contains pointers to both pdf notes, blog posts, and textbook chapters (identifiable via a missing hyperlink). Please read all of them! If you have questions or comments please feel free to post messages on the blog posts.

**Assignments:**

- Project
- Assignment 3 (due 2/16/2011): (1) Provide a full and detailed derivations of the MLE for theta=(mu,sigma^2) in the case of normally distributed data. (2) Derive the MLE for the parameter of a poisson distribution. (3) Sample 10 points from a Poisson distribution with parameter 20, and compute the MLE (either numerically or using the MLE formula you derived). Repeat this procedure 1000 times and plot a histogram of the 1000 MLE values and superimpose on it the true parameter value. (4) Repeat (3) but replace 10 with 100 and then with 500. (5) Reason about the the shapes of the three histograms of MLE values and connect it to the theoretical MLE properties discussed in class (consistency, asymptotic normality).
- Assignment 2 (due 2/7/2011): (1) Read the assigned reading including [2] (no submission needed) (2) The diamond dataframe has 10 columns. For each column, explain what it corresponds to, what type of data it is (refer to the data taxonomy), graph it using one of the methods in the lecture notes [1], and comment on what distribution would be a good model for it (if no standard distribution applies explain why and try to come up with your own distribution). (3) Sample 200 points from a mixture of Gaussians 0.3 N(0,1) + 0.7 N(3,1), plot the corresponding histogram in R and overlay it with the density.
- Assignment 1 (due 1/26/2011): (1) Read the note “A Quick Introduction to R” distributed in class and the first 6 chapters of the official R tutorial http://cran.r-project.org/doc/manuals/R-intro.pdf. (no need to submit anything, enter comments or questions in course website if you have questions). (2) Download R and install it. Execute the sample session in appendix A of the official R tutorial (see 1 above). Print and submit the command line output (use the function sink(file=’echoFile.txt’,split=TRUE)).

Please contribute to this blog post any interesting discussion or questions concerning data and visual analytics here and concerning R programming here.

Comments (1)