In R, data is normally stored as an object called data frame. Roughly, this corresponds to a matrix where each row is a data sample and each column is a dimension. There are two main differences between a data frame and a matrix: (a) a data frame may have each column be of a different data type (binary, numeric, categorical, etc.) whereas a matrix is homogeneous, and (b) in a data frame each column has a specific name corresponding to that dimension (and which may be used to refer to the dimension instead of the less intuitive and harder to remember column index).

Reading Data into a Data Frame Object

In most cases, the data exists as a text or binary file which may be loaded into a data frame. The most common and easiest way is if the data is saved as a text file, separated by tabs or commas with each row representing a sample (missing values should be denoted by NA). The names of the columns may be specified in the first row of the file. In this case we load the data using the command read.table with a header argument indicating whether the first row has the dimensions names or is the first sample. Note that the names may not have spaces in them (use . to separate words) or other special characters or they may be parsed incorrectly into the frame.

For example consider the Iris dataset file which does have an initial line with the dimensions names (4 measurements for 150 iris flowers of three different types). We load it using

ID=read.table("iris.txt",header=T);

There are other R routines for loading text files in different formats. It is also possible of course to just convert them to the above form using word processor or a script.

We can see a summary of the data frame using the command summary

summary(ID)
sepal.length    sepal.width     petal.length    petal.width
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
 Median :5.800   Median :3.000   Median :4.350   Median :1.300
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
       variety
 Setosa    :50
 Versicolor:50
 Virginica :50

We see immediately several things: (a) R correctly identifies the first row as the names of the variables or dimensions, (b) it correctly identified that the first four variables are numeric and the last one is categorical (since the first four columns contained numbers and the last contained strings), (c) it nicely computes min, max and quartiles for each of the numeric variables and a histogram of different values for the categorical variable.

The list of dimensions names is obtained using the command names

names(ID)
[1] "sepal.length" "sepal.width"  "petal.length" "petal.width"  "variety"

Accessing Data in Data Frames

The most straightforward way to access values in a data frame is using the matrix index notation, for example, ID[3,5] for the fifth dimension of the third sample or ID[,2] for the second dimension of all samples. The data is nicely presented with the a header line containing the dimension names displayed above the data.

ID[1,]
  sepal.length sepal.width petal.length petal.width variety
1          5.1         3.5          1.4         0.2  Setosa

A more intuitive way is to refer to the dimensions by their names rather than the column index. This is done using the $ notation. For example to display the sepal.length measurement of the first 10 samples:

ID$sepal.length[1:10]
 [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9

If one data frame is used frequently it may be attached. We can then refer to variable names without the data frame name and a $ sign

attach(ID)
> sepal.length[10]
[1] 4.9

Beside being easy to remember and code/debug, referring to variables using names rather than column indices has the benefit that plots and other graphics automatically label the axes correctly with the appropriate string. A simple example is

plot(sepal.length ~ sepal.width)

which produces the following figure (with correct labels on x and y axes automatically)

This “automatic labeling” of the axes works also with much more complex graphics (trellis, multiple panels, etc.)