From SPSS to R

R and SPSS do many things differently. R is a command-line program whereas SPSS uses a point-and-click GUI interface. While R‘s interface is faster and more flexible, some things can be jarring to a veteran SPSS user. For example

  • R starts with a blank prompt. Data and variables are not visible until you ask for them
  • VALUE LABELs and VARIABLE LABELs and other fundamental SPSS concepts are superfluous in R.
  • Sorting operations, case selection, and transformations are done using commands.
  • Table output is by default formatted in readable ASCII. More work must be done to produce tables that can be pasted into a LaTeX or word processing document
  • SPSS has a primary proprietary format (.sav) whereas R does not. Many people choose to use CSV or tab-separated files for R

Reading and Writing Data in R

If you have data saved already in an SPSS format, you can read it into R using the following commands:

library('foreign')
my.data <- read.spss('path/to/my_data.sav')

Simply typing the name of the dataset (in this case “my.data”) will show you the data. SPSS has an annoying convention of naming variables in capital letters. Unless you prefer typing with capslock on, it’s probably best to change the variable names to lowercase. In addition, we attach the dataset to our workspace, so that we can refer to the variables without specifying which dataset they are in.

names(my.data) <- tolower(names(my.data))
attach(my.data)
my.data

Data can also be read in a number of other formats documented on this wiki. read.csv() and read.table() are especially useful for text files.

When saving data, it is best to use a text format. Text formats can be read and written on any computer and with almost any statistics program. Unlike SPSS’s proprietary format, text files do not go out of date and hence are suitable for archiving.

For the purposes of this document, we shall use one of R’s built-in datasets. You can access it by doing the following:

data(USArrests)
attach(USArrests)
USArrests

Sort cases

The function sort() sorts a variable into increasing or decreasing order.

Murder #Unsorted -- as entered
sort(Murder) #Sorted in increasing order
sort(Murder,decreasing=TRUE) #Sorted in decreasing order

The function order() returns the index positions one would need to sort other variables. For indexing variables, see the R manual or this wiki. Indexing works as in many programming languages, except it indexes begin at 1 instead of 0. my.variable[1] is the first element of the variable, and my.data[1,4] is the first element of the fourth variable in the data frame.

o <- order(Murder) #The indexes for sorting by Murder increasing
o.decreasing <- order(Murder, decreasing=TRUE)

These orders can be used to sort the entire dataset if desired.

USArrests[o,] #USArrests sorted by lowest murder rate first
USArrests[o.decreasing,] #USArrests sorted by highest murder rate first

We don’t need to store the indexes in a separate variable, although sometimes that is useful. A more compact way to achieve the same result is

USArrests[order(Murder),]

Select cases

Selecting cases is simply a function of indexing, as described in the R documentation. In R we can pass logic statements to the indexes, and we will see the results of any conditions that evaluate to TRUE. To see which Murder rates are greater than 5.0, we can do

Murder > 13

To hide all the Murder rates that are less than 5.0, we can pass the indexes generated by the condition “Murder > 13” to Murder.

Murder[Murder > 13]

Similarly to using order(), we can use the indexes to view all of the dataset for states with the highest murder rates.

 > dangerous.states <- (Murder > 13)
 > USArrests[dangerous.states,]
               Murder Assault UrbanPop Rape
 Alabama          13.2     236       58 21.2
 Florida          15.4     335       80 31.9
 Georgia          17.4     211       60 25.8
 Louisiana        15.4     249       66 22.2
 Mississippi      16.1     259       44 17.1
 South Carolina   14.4     279       48 22.5
 Tennessee        13.2     188       59 26.9
 
 

Logic statements can be combined with the & and | (or) operators.

>USArrests[(Murder > 13) & (UrbanPop < 50),]
               Murder Assault UrbanPop Rape
 Mississippi      16.1     259       44 17.1
 South Carolina   14.4     279       48 22.5
 

Value labels

Labeled and ordered variables are handled more nicely in R than in SPSS. To create a factor where order does not matter, simply create a variable with text instead of numbers.

gender <- c("male","female","male","female","male")
t.test(score ~ gender)

When data is imported into R, variables consisting of text are typically converted to factors. Factors have numerical values associated with each label. We can create a factor from a character vector as follows:

 > gender <- as.factor(gender) #Make gender a factor
 > levels(gender) #Show how many levels there are in this variable
 "female" "male" 
 > as.numeric(gender) #Convert each label to its numeric equivalent
 2 1 2 1 2

Variable labels

In R one typically gives variables meaningful names rather than using variable labels. In other words, the variable name _is_ the label. For example, good names for a variable holding the respondent’s gender are “gender” or “sex”. Bad names include “VAR100” and “RGNDR”. R can handle long names like “subject.gender” or yearly.income.per.capita, although such names are unwieldy. Since R is a programming language, one can always give variables convenient names and document them in the source code of your analysis.

If one nevertheless has a strong desire to attach variable labels to variables, there are several ways of doing so. One is using attr().

attr(gender, "label") <- "Respondent's gender"

What we just did was to set another attribute to variable gender, named “label” (we could have named it no matter how, i.e. “varlab”). Let’s query what is the label of the variable TWO:

 > attr(gender, "label")
 [1] "Respondent's gender"

Producing tables

Tables in SPSS are not suitable for publication. You should think carefully about table layout before putting one in your paper, and you should even think carefully about whether you really want to use a table rather than a graphic. If you use a word processor such as Microsoft Word, it is probably best to lay the table out by hand. If you use LaTeX, it is easier to create publication-quality tables. The package quantreg has a function latex.table() for this purpose.

Recoding and Transforming variables

Recoding and transforming variables is easy in R. Transformations basically use the same notation one would use when writing the transformation on paper.

Murder <- Murder^2 #the square of Murder
Assault <- sqrt(Assault) #square root of Assault
UrbanPop <- scale(UrbanPop) #Z score of UrbanPop

Common transformations in the social sciences include sqrt(), log(), atan() and others.

Turning continuous variables into discrete ones in R requires the use of indexes. In general you probably don’t actually want to discretize your variables. It’s a common practice in some areas, but it’s almost always bad statistical practice. One reason you might want to discretize your variables is if you want a quick-and-dirty way to fit a non-linear curve. Here’s how you might discretize a variable.

murder.factor <- ordered(NA, levels = c("Low", "Medium","High"))
murder.factor[Murder < 5] <- "Low"
murder.factor[Murder >= 5 & Murder < 13] <- "Medium"
murder.factor[Murder >= 13] <- "High"


Also see: exchange data between SPSS and R

 
getting-started\translations\spss2r.txt · Last modified: 2008/05/21
 
Recent changes RSS feed R Wiki powered by Driven by DokuWiki and optimized for Firefox Creative Commons License