R and SPSS do many things differently. R is a command-line program whereas SPSS uses a point-and-click GUI interface. While R‘s interface is faster and more flexible, some things can be jarring to a veteran SPSS user. For example
If you have data saved already in an SPSS format, you can read it into R using the following commands:
library('foreign') my.data <- read.spss('path/to/my_data.sav')
Simply typing the name of the dataset (in this case “my.data”) will show you the data. SPSS has an annoying convention of naming variables in capital letters. Unless you prefer typing with capslock on, it’s probably best to change the variable names to lowercase. In addition, we attach the dataset to our workspace, so that we can refer to the variables without specifying which dataset they are in.
names(my.data) <- tolower(names(my.data)) attach(my.data) my.data
Data can also be read in a number of other formats documented on this wiki. read.csv() and read.table() are especially useful for text files.
When saving data, it is best to use a text format. Text formats can be read and written on any computer and with almost any statistics program. Unlike SPSS’s proprietary format, text files do not go out of date and hence are suitable for archiving.
For the purposes of this document, we shall use one of R’s built-in datasets. You can access it by doing the following:
data(USArrests) attach(USArrests) USArrests
The function sort() sorts a variable into increasing or decreasing order.
Murder #Unsorted -- as entered sort(Murder) #Sorted in increasing order sort(Murder,decreasing=TRUE) #Sorted in decreasing order
The function order() returns the index positions one would need to sort other variables. For indexing variables, see the R manual or this wiki. Indexing works as in many programming languages, except it indexes begin at 1 instead of 0. my.variable[1] is the first element of the variable, and my.data[1,4] is the first element of the fourth variable in the data frame.
o <- order(Murder) #The indexes for sorting by Murder increasing o.decreasing <- order(Murder, decreasing=TRUE)
These orders can be used to sort the entire dataset if desired.
USArrests[o,] #USArrests sorted by lowest murder rate first USArrests[o.decreasing,] #USArrests sorted by highest murder rate first
We don’t need to store the indexes in a separate variable, although sometimes that is useful. A more compact way to achieve the same result is
USArrests[order(Murder),]
Selecting cases is simply a function of indexing, as described in the R documentation. In R we can pass logic statements to the indexes, and we will see the results of any conditions that evaluate to TRUE. To see which Murder rates are greater than 5.0, we can do
Murder > 13
To hide all the Murder rates that are less than 5.0, we can pass the indexes generated by the condition “Murder > 13” to Murder.
Murder[Murder > 13]
Similarly to using order(), we can use the indexes to view all of the dataset for states with the highest murder rates.
> dangerous.states <- (Murder > 13) > USArrests[dangerous.states,] Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Florida 15.4 335 80 31.9 Georgia 17.4 211 60 25.8 Louisiana 15.4 249 66 22.2 Mississippi 16.1 259 44 17.1 South Carolina 14.4 279 48 22.5 Tennessee 13.2 188 59 26.9
Logic statements can be combined with the & and | (or) operators.
>USArrests[(Murder > 13) & (UrbanPop < 50),] Murder Assault UrbanPop Rape Mississippi 16.1 259 44 17.1 South Carolina 14.4 279 48 22.5
Labeled and ordered variables are handled more nicely in R than in SPSS. To create a factor where order does not matter, simply create a variable with text instead of numbers.
gender <- c("male","female","male","female","male") t.test(score ~ gender)
When data is imported into R, variables consisting of text are typically converted to factors. Factors have numerical values associated with each label. We can create a factor from a character vector as follows:
> gender <- as.factor(gender) #Make gender a factor > levels(gender) #Show how many levels there are in this variable "female" "male" > as.numeric(gender) #Convert each label to its numeric equivalent 2 1 2 1 2
In R one typically gives variables meaningful names rather than using variable labels. In other words, the variable name _is_ the label. For example, good names for a variable holding the respondent’s gender are “gender” or “sex”. Bad names include “VAR100” and “RGNDR”. R can handle long names like “subject.gender” or yearly.income.per.capita, although such names are unwieldy. Since R is a programming language, one can always give variables convenient names and document them in the source code of your analysis.
If one nevertheless has a strong desire to attach variable labels to variables, there are several ways of doing so. One is using attr().
attr(gender, "label") <- "Respondent's gender"
What we just did was to set another attribute to variable gender, named “label” (we could have named it no matter how, i.e. “varlab”). Let’s query what is the label of the variable TWO:
> attr(gender, "label") [1] "Respondent's gender"
Tables in SPSS are not suitable for publication. You should think carefully about table layout before putting one in your paper, and you should even think carefully about whether you really want to use a table rather than a graphic. If you use a word processor such as Microsoft Word, it is probably best to lay the table out by hand. If you use LaTeX, it is easier to create publication-quality tables. The package quantreg has a function latex.table() for this purpose.
Recoding and transforming variables is easy in R. Transformations basically use the same notation one would use when writing the transformation on paper.
Murder <- Murder^2 #the square of Murder Assault <- sqrt(Assault) #square root of Assault UrbanPop <- scale(UrbanPop) #Z score of UrbanPop
Common transformations in the social sciences include sqrt(), log(), atan() and others.
Turning continuous variables into discrete ones in R requires the use of indexes. In general you probably don’t actually want to discretize your variables. It’s a common practice in some areas, but it’s almost always bad statistical practice. One reason you might want to discretize your variables is if you want a quick-and-dirty way to fit a non-linear curve. Here’s how you might discretize a variable.
murder.factor <- ordered(NA, levels = c("Low", "Medium","High")) murder.factor[Murder < 5] <- "Low" murder.factor[Murder >= 5 & Murder < 13] <- "Medium" murder.factor[Murder >= 13] <- "High"
Also see: exchange data between SPSS and R