In R, data is normally stored as an object called data frame. Roughly, this corresponds to a matrix where each row is a data sample and each column is a dimension. There are two main differences between a data frame and a matrix: (a) a data frame may have each column be of a different data type (binary, numeric, categorical, etc.) whereas a matrix is homogeneous, and (b) in a data frame each column has a specific name corresponding to that dimension (and which may be used to refer to the dimension instead of the less intuitive and harder to remember column index).
Reading Data into a Data Frame Object
In most cases, the data exists as a text or binary file which may be loaded into a data frame. The most common and easiest way is if the data is saved as a text file, separated by tabs or commas with each row representing a sample (missing values should be denoted by NA). The names of the columns may be specified in the first row of the file. In this case we load the data using the command read.table with a header argument indicating whether the first row has the dimensions names or is the first sample. Note that the names may not have spaces in them (use . to separate words) or other special characters or they may be parsed incorrectly into the frame.
For example consider the Iris dataset file which does have an initial line with the dimensions names (4 measurements for 150 iris flowers of three different types). We load it using
There are other R routines for loading text files in different formats. It is also possible of course to just convert them to the above form using word processor or a script.
We can see a summary of the data frame using the command summary
summary(ID) sepal.length sepal.width petal.length petal.width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 variety Setosa :50 Versicolor:50 Virginica :50
We see immediately several things: (a) R correctly identifies the first row as the names of the variables or dimensions, (b) it correctly identified that the first four variables are numeric and the last one is categorical (since the first four columns contained numbers and the last contained strings), (c) it nicely computes min, max and quartiles for each of the numeric variables and a histogram of different values for the categorical variable.
The list of dimensions names is obtained using the command names
names(ID)  "sepal.length" "sepal.width" "petal.length" "petal.width" "variety"
Accessing Data in Data Frames
The most straightforward way to access values in a data frame is using the matrix index notation, for example, ID[3,5] for the fifth dimension of the third sample or ID[,2] for the second dimension of all samples. The data is nicely presented with the a header line containing the dimension names displayed above the data.
ID[1,] sepal.length sepal.width petal.length petal.width variety 1 5.1 3.5 1.4 0.2 Setosa
A more intuitive way is to refer to the dimensions by their names rather than the column index. This is done using the $ notation. For example to display the sepal.length measurement of the first 10 samples:
ID$sepal.length[1:10]  5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
If one data frame is used frequently it may be attached. We can then refer to variable names without the data frame name and a $ sign
attach(ID) > sepal.length  4.9
Beside being easy to remember and code/debug, referring to variables using names rather than column indices has the benefit that plots and other graphics automatically label the axes correctly with the appropriate string. A simple example is
plot(sepal.length ~ sepal.width)
which produces the following figure (with correct labels on x and y axes automatically)
This “automatic labeling” of the axes works also with much more complex graphics (trellis, multiple panels, etc.)