Basic Data Manipulation (MT3)
Learning outcomes
At the end of today’s workshop you should be able to do the following:
Demonstrate knowledge of how to interact with R using RStudio
Identify the basic strutures of a data frame: the number and names of variables and cases
Create a new variable from existing variables in a data frame.
Filter information out of a data frame and compare summary statistics across filtered observations
Select out one or more variables from all of the variables in a data frame.
Preliminaries
Before you start, we need to install the required packages, which today are tidyverse and mosiac.
install.packages("tidyverse")
install.packages("mosaic")
Remember that you have to tell R what packages you plan to use in order to use them in the file. So we need to load these packages as well as install them (you could also use the command require()):
library(tidyverse)
library(mosaic)
Loading data
We now want to start to work with some data. We use a dataset from the World Values Survey.
You can go to the World Value Survey site to see what we’ll be working with. Don’t download anything: http://www.worldvaluessurvey.org/WVSContents.jsp. We will be looking at wave 6 data, collected between 2010 and 2014.
You can access the .rdata (R data file) file of the data here: https://github.com/thomcurran/PB130/raw/master/MT3/Workshop/WV6_Data_R_v20180912.rds. The file is called WV6_Data_R_v20180912.rds. This file is also in the MT3 workshop folder. The .rds extension is an R speicifc extension, most commonly we’ll work with .csv files.
Now we are going to open the file in R using RStudio or RStudio Server. Use the command below, but be sure to change the filepath to the one where you have put the WV6_Data_R_v20180912.rds file. An easy way to do this is to click on the dataframe name in the files window, change the name to WVS and click OK. The code will appear in the console. Just copy and paste that code in your R chunk.
WVS <- readRDS("C:/Users/currant/Dropbox/Work/LSE/PB130/MT3/Workshop/WV6_Data_R_v20180912.rds")
This command instructs R to read the rds file and assign it (<-) to an object called WVS (hence readRDS()). Importantly, we are using a command from the tidyverse() package to do this. There is also a base R function called read.rds() but it is much slower than readRDS(). WVS is a data table that we shall use to practise some basic analysis.
The data table contains the World Values Survey. You should see that the workspace area in the upper righthand corner of the RStudio window now lists a data set called WDS that has 89565 observations of 440 variables. As you interact with R, you will create a series of objects. Sometimes you load an object as we have done here, and sometimes you create an object yourself as the result of a computation or some analysis you have performed.
Examining the data
The World Values Survey, which started in 1981, consists of nationally representative surveys conducted in almost 100 countries. It is the largest non-commercial, cross-national, time series investigation of human beliefs and values ever executed, currently including interviews with almost 400,000 respondents.
The WVS seeks to help scientists and policy makers understand changes in the beliefs, values and motivations of people throughout the world. Thousands of political scientists, sociologists, social psychologists, anthropologists and economists have used these data to analyze topics such as economic development, democratization, religion, gender equality, social capital, and subjective well-being. You might remember from PB101 some trends that have been examined with this data.
Impresssive! So, what can we do to look at this data? There are a few options.
Doubleclick on it in the environment pane (right corner) to view it.
View() commend in the console
Just type the dataframe name in the console
View(WVS) # is the same as:
WVS
You should see columns of numbers with each row representing a different respondent. The first entry in each row, corresponding the first column, is simply the row number. The second column is the first variable, the second is the second variable, and so on. You can use the right arrow in the data screen to examine the complete data set.
Note that the row numbers in the first column are not part of the WVS data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored the WVS data in a kind of spreadsheet or table called a dataframe.
You would have seen that looking at the dataframe in its entirety is quite cumbersome. We would probably rather just take a look at some portion of our data, such as the top or the bottom of the data. We can do this with the head and tail commands. Run them now to check out the first few rows of the data. By default, head and tail show you 6 rows. You can change that by adding a command and a number after the the name of the data frame.
head(WVS)
head(WVS, 8)
tail(WVS)
tail(WVS, 12)
What happens?
Okay, a bit messy. What about finding out the dimensions of the data frame with dim()?
dim(WVS)
This command should output [1] 89565 440, indicating that there are 89565 rows and 440 columns (we’ll get to what the [1] means in a bit), just as it says next to the object in your workspace. Thats a lot of data!
You can see the names of these columns (or variables) by typing the names() command
names(WVS)
You should see that the data frame contains the columns that match those in the WVS codebook (see: https://github.com/thomcurran/PB130/raw/master/MT3/Workshop/F00007761-WV6_Codebook_v20180912.pdf).
Another way to obtain similar information is to ask for the structure of the data using the command str()
str(WVS)
As we saw with the dim command, the str command tells us that the data has 89565 observations and 440 variables. As with the names command, str also tells us the names of the variables and the format of these variables. It tells us that V1 (wave) contains only the number 6 and has the labelled format “num” (which is short for “numeric” which means a number). It also tells us that in V1 the number 6 has a “chr” label (which is short for “character”, which means a word or character string, in this case “wave”). Although not in this data frame, R can understand other data formats, including integers and categorcial variables.
At this point, you might notice that many of the commands in R look a lot like functions from math class. So, invoking R commands means supplying a function with some number of arguments that appear in parentheses. For example, when you say that a variable y is a function of another variable x you write: y=f(x). The dim, names and str commands each took a single argument, the name of a data frame WVS. The idea of an R command working like a function will remain helpful in the future.
Summary statistics
In psychology, we are often interested in knowing basic summary statistics of variables and also in generating new variables that contain new information. One very useful command in R, using the mosaic package, is the favstats command (read as “fave stats”). What, you might ask, are an psychologist’s favorite statistics? How about the following?
mean
median
standard deviation
minimum
maximum
Let’s try the favstats command with the WVS data frame for the Happiness question, V10:
“Taking all things together, would you say you are”
1.Very happy
2.Rather happy
3.Not very happy
4.Not at all happy
The function favstats() uses the following structure favstats(~VARIABLENAME, data = DATANAME). That is, we use a tilde ~ before the name of our variable, and we also specify the data as being a particular data object we plan to use.
favstats(~V10, data = WVS)
As you can see, we have summary information about people’s happiness from wave 6 of the WVS. But can you spot the problem here?
If you said the minimum value -5 is outside of the range of values (1-4) then you are correct! This is because in the WVS not everybody answered this question, or responded that they “did not know”. So rather than 1-4, the V10 variable ranges from -5 to 4, and includes (see codesheet here: https://github.com/thomcurran/PB130/raw/master/MT3/Workshop/F00007761-WV6_Codebook_v20180912.pdf):
-5. HT: Missing-Dropped out survey; RU: Inappropriate response
-4. Not asked in survey
-3. Not applicable
-2. No answer
-1. Don´t know
All of which are redundant for our interests. This is common when retrieving large data sets - they are often messy and require manipulating or cleaning! We want tidy data!
Data manipulation
During this course, we shall use a variety of data verbs, commands that “do” things to data (hence “verb”). The list of data verbs includes:
mutate(): change the data, thus ‘mutate’ it in some way
filter(): filter some subset of cases out from the total list of variables
select(): select some subset of variables out from the total list of variables
join(): join two different data tables together
group_by(): group a data table using a variable (e.g. gender) to understand some patterns in the data
summarise(): summarise the data to find some useful statistic; often combined with group_by to contrast differences across groups
gather() and spread(): gather and spread re-shape the data by making the data narrower (for gather) or wider (for spread) depending on what we need. Normally, we need to re-shape data when we want to change the particular cases that we have so that we can draw insights based on alternative cases to those we already have. For example, to draw a graph, we might need to change the case into a summary statistic, such as the average number of people engaged in an activity, rather than having the case be each unique individual in a dataset.
Piping
Before we do much else, we want to think about how to write good code. One of teh ways of doing that is by using “pipes”. In R, the %>% symbol is called a pipe. You should read the pipe as saying “Then” or “And then”. A pipe “pipes in” an argument so that you are able to use it in a function.
I like the to think of my programs that include the pipe as stories told by a 6-year-old:
That way, I don’t get confused about what it means because I always get the functions in the right sequence.
So the command we will use to tell R what data to use is the following:
WVS %>% # data first, and then..
Selecting variables
Let’s start to examine the data a little more closly. Rarely would we be conducting analyses on the entirity of a data set the size of the WVS, so a useful starting point is to retain only the vairables of interest (and get rid of all others). Let’s say we are interested only in a country-level analysis of the well-being variables measured by V10 (feeling of happiness) and V11 (subjective health). We can access only the data in these variables by creating a new dataframe (WVS_short) a using the select() function.
WVS_short <- # create new data frame called WVS_short
WVS %>% # original dataframe pipe ("and then")
select(V2, V10, V11) # variables of interest to retain
head(WVS_short) # inspect new dataframe
We now have a more manageable data frame WVS_short that contains only the variables of interest to our analyses.
Filtering and mutating
Great, now remember that we know V10 and V11 contain redundant information in the range of values (i.e., values of -5 to -1 are no response or dont knows). Lets now filter those redundant values out of our WVS_short dataframe using the filter() function.
WVS_short <- # into existing dataframe
WVS_short %>% # from existing daraframe pipe ("and then")
filter(V10 >= "1" & V11 >= "1") # Filter V10 and V11 at a value of 1 and above
WVS_short # check dataframe
From eyeballing the data we can see that there appears to be no minus values in our data frame. How else can we check this more rigorously using favstats()?
So we have successfully removed the redundant information. Now lets hone in on this data. Let’s say that in our project we are only interested in well-being within the USA. To do this, we need to create a new dataframe (Just_USA) from a subset of cases within the Short_WVS dataframe. Here we will use the V2 variable (country) and use the filter() function. Looking at the WVS codesheet tells us that the numeric code for USA in V2 is 840 (codesheet is here: https://github.com/thomcurran/PB130/raw/master/MT3/Workshop/F00007761-WV6_Codebook_v20180912.pdf).
Just_USA <- # into new dataframe
WVS_short %>% # from WVS_short
filter(V2 == "840") # Filter on the country code
Just_USA # check the new dataframe
Let’s go through a step-by-step interpretation of this code:
First, we tell R we want to assign the name Just_USA to our new variable.
Second, we tell R to pipe in (to use) the Short_WVS data.
Third, we then (%>%) tell R to use the filter command where it takes the variable “V2” and only selects the case(s) for which the value is equal to “840” (the use of two equals signs == is very important here becuase it is used to assign a specific value. You will get an error if you only use one equals sign)
Fourth, we tell R to display our new dataframe Just_USA. (notice, there’s no pipe %>% at the end of the line before Just_USA. Have a think why that might be)
Mutating
We now have a data file that contains interpretable information on happiness (V10) and subjective health (V11) for Americans contained in the Just_USA dataframe. This is great but there is one more thing we need to do. Remeber that we considered wellbeing as variable that consists of both happiness and subjective health. Well, we need to create that wellbeing variable in the Just_USA dataframe. To do that, we are going to use the mutate() fucntion to add happiness and wellbeing together and create a new variable called “wellbeing”.
Just_USA <- # into existing dataframe
Just_USA %>% # from existing dataframe
mutate(wellbeing = V10 + V11) # add up V10 and V10 and assign the sum to a new variable called wellbeing
Just_USA # check dataframe
Notice here that R understands 1 equals sign (e.g., =) as part of a calculation. So when you are calculating something it is =, when you are assigning something it is ==. We could of course do the same for any other calculation. Try for instance using the mutate() function to create a new variable called “product” that contains sum of V10 multiplied (*) by V11.
Now we have created the wellbeing variable, we can address our focal aim, which is to understand levels of wellbeing in the USA. To do that, lets use the favstats() function to see what the average wellbeing score is for people in the USA.
favstats(~wellbeing, data = Just_USA)
Given that the range of values for wellbeing contains 1 through 8, a mean of 3.67 suggests that people in the USA have pretty high wellbeing (remember lower scores indicte better well-being).
Activity
Now its your turn to do some data manipulation with the WVS. I want you to create a string of code to find the average wellbeing of people in Iraq. Then I want you to report the difference in average levels of wellbeing between people from the USA and people from Iraq.
Filter for only Iraq
Calculate the wellbeing score for Iraq
Find the average wellbeing score for Iraq
Calculate the average wellbeing difference between the USA and Iraq
Conclusion
That was an introduction to basic data manipulation R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses. Feel free to browse around the websites for R and RStudio if you’re interested in learning more, or find more labs for practice at http://openintro.org.
You can also practice R using a package called Swirl. The following code installs the package swirl.
install.packages("swirl")
To start swirl, you need to load the package with the library() function, and then run it with swirl()
library("swirl")
swirl()
Once you have loaded the package, enter your name and follow the instructions in the console. Complete lesson 1: “R Programming: The basics of programming in R”
Don’t also forget to complete the essential readings!