DATA AND VISUAL ANALYTICS

GEORGIA INSTITUTE OF TECHNOLOGY

SPRING 2011: CSE8803-DVA, CS4803-DVA

Grapheur

(Data Visualization using Grapheur)

Time and Place: MW 13:05-14:25, Cherry Emerson 204

Instructor: Guy Lebanon (office hours: M 14:30-15:30, Klaus 1308)

Grader: Bharathi Ravishanker (office hours: F 10:00-11:00 @  CCB commons, bravishanker3@gatech.edu)

Grade Composition:

  • 30% homework
  • 30% project
  • 40% final exam

Pre-requisites:

  • Undergraduate multivariate Calculus
  • Undergraduate linear algebra
  • Undergraduate calculus-based probability (e.g., first 13 of my probability notes)
  • Programming in one of the following languages: C, C++, Python, Java, Matlab, R, Perl

Homework and Project Policy: Homework and project work will involve both theoretical work and programming. Please work alone unless explicitly mentioned otherwise. Violations of this policy will be reported to the Dean of Students.

Computing with R: We will make extensive use of R in this course. R is a language specifically designed for data analysis and visualization. It is significantly more powerful than Matlab in this regards. The usage of R is growing very fast in part due to its open source nature and platform independence and learning it is a worthy time investment. Relevant links are: R’s website, What is R?official R manuals. Tutorial notes by Guy Lebanon are available here.

Books:

  1. R. Battiti and M. Brunato. Reactive Business Intelligence (online).
  2. H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer 2009 (sample chapter and code)
  3. J. Adler. R in a Nutshell. O’Rielly Press 2010.

Schdeule:

Date Topic Reading
03/14/2011 Project Discussion
03/09/2011 Entropy and mutual information, classification and regression trees, Fisher’s LDA, Naive Bayes 1
03/07/2011 Low rank approximation using SVD, latent semantic analysis 1
03/02/2011 Low rank approximation using SVD, principal component analysis, non-negative matrix factorization 1
02/28/2011 Association Rule Mining 1
02/23/2011 Non-metric multidimensional scaling, local multidimensional scaling 1
02/21/2011 Dimensionality reduction, multidimensional scaling 1
02/16/2011 Split-apply-combine and the plyr package, case study: baseball batting data 1
02/14/2011 Linear regression formulas in R, the reshape package with cast() and melt() 1
02/09/2011 Linear Regression, residual plots, regression in R with lm() 1
02/07/2011 Graphing multivariate data, power transformations 1
02/02/2011 Maximum likelihood estimation 1
01/31/2011 Graphing 1-D Numeric Data 1, 2
01/26/2011 Plotting densities in R with qplot() and ggplot(). 1
01/24/2011 A taxonomy of data, review of univariate distributions 1, 2
01/19/2011 Course Overview, Data and Visual Analytics, Introduction to R 1, 2

The schedule above contains pointers to both pdf notes, blog posts, and textbook chapters (identifiable via a missing hyperlink). Please read all of them! If you have questions or comments please feel free to post messages on the blog posts.

Assignments:

  • Project
  • Assignment 3 (due 2/16/2011): (1) Provide a full and detailed derivations of the MLE for theta=(mu,sigma^2) in the case of normally distributed data. (2) Derive the MLE for the parameter of a poisson distribution. (3) Sample 10 points from a Poisson distribution with parameter 20, and compute the MLE (either numerically or using the MLE formula you derived). Repeat this procedure 1000 times and plot a histogram of the 1000 MLE values and superimpose on it the true parameter value. (4) Repeat (3) but replace 10 with 100 and then with 500. (5) Reason about the the shapes of the three histograms of MLE values and connect it to the theoretical MLE properties discussed in class (consistency, asymptotic normality).
  • Assignment 2 (due 2/7/2011): (1) Read the assigned reading including [2] (no submission needed) (2) The diamond dataframe has 10 columns. For each column, explain what it corresponds to, what type of data it is (refer to the data taxonomy), graph it using one of the methods in the lecture notes [1], and comment on what distribution would be a good model for it (if no standard distribution applies explain why and try to come up with your own distribution). (3) Sample 200 points from a mixture of Gaussians 0.3 N(0,1) + 0.7 N(3,1), plot the corresponding histogram in R and overlay it with the density.
  • Assignment 1 (due 1/26/2011): (1) Read the note “A Quick Introduction to R” distributed in class and the first 6 chapters of the official R tutorial http://cran.r-project.org/doc/manuals/R-intro.pdf. (no need to submit anything, enter comments or questions in course website if you have questions). (2) Download R and install it. Execute the sample session in appendix A of the official R tutorial (see 1 above). Print and submit the command line output (use the function sink(file=’echoFile.txt’,split=TRUE)).

Please contribute to this blog post any interesting discussion or questions concerning data and visual analytics here and concerning R programming here.