Welcome to the CSCU 2016 Data Carpentry Workshop!
June 13-14, 2016
Instructors:
- Erika Mudrak
- Lynn Johnson
- Emily Davenport
Helpers:
- Francoise Vermeylen
- Stephen Parry
- Kevin Packard
- David Kent
Helpful links:
Monday Morning Notes:
How can we do reproducible science?
- - For each step in the process (data cleaning, data summarizing, analysis, figure generation, etc), create a separate script. Future you will appreciate it!
- - Another benefit: if you update your raw data, you should be able to use the same scripts to update your results.
So what are we doing in this workshop?
- 1. Raw data - Excel (Monday morning)
- 2. Data cleaning - OpenRefine (Monday morning)
- 3. Sumamrizing - R dplyr, ggplot2, SQL databases (Monday afternoon/Tuesday morning)
- 4. Analysis - R loops and functions (Tuesday afternoon)
- 5. Results formatting - R Markdown (Tuesday afternoon)
Excel:
- Cardinal Rules for spreadsheets:
- Put all your variables in columns
- Put all your observations in it's own row
- Don't combine info into one cell
- Leave raw data alone!
- Other things to consider:
- What value to use for missing values? (NA, null, N/A?) Consider the downstream application you might open it in (R, SQL, etc)
- Column names: only alpha/numeric, avoid spaces in names, have the names be understandable, don't start with a number
- Place comments in their own column, don't use highlighting - it won't transfer to stats programs!
- Read the paper on Tidy Data by Hadley Wickham (we'll send it out after the workshop)
- What do we do with units? Keep a separate spreadsheet or file that lists more info about each of the columns, including units and a description of what the data is
- Dates - arrrrg!
- Excel doesn't actually store dates as you might think: It stores it as the number of days past some date in the past. This changes with versions of excel! See, arrrrg!
- So, what to do?
- Create a separate column, in that cell you can type `=MONTH()` and reference the full date. This will pull just the month from the original date column. Can also do this with `=DAY()` and `=YEAR()`.
- OR you can use a Julian date (1-365, see point below for how) column and a year column
- =A3-DATE(YEAR(A3), 1, 0)...where A3 is the cell you're transforming
- Data validation - in the 'Data' tab, you can set limits or rules for when you're entering data. Helps you catch typos or values that are way outside of what you might expect to see. If you enter something that doesn't fit with your rules, an error message will pop up.
- Another trick is to sort your data. Often wrong data will sort to the top or bottom of a column.
- You can also use conditional formatting to color cells by their values. This is NOT retained if you export to R or SASS, but can help you clean the data.
- How to save your data: proprietary formats can change or be discontinued. Try to save your data in a plain text format, either .csv or tab delimited by going to 'Save As" and choosing .csv or text (Tab delimited text). When you do this, coloring, fonts, column sizes will go away. The only thing that will stay is the text itself. This is why we don't want to use highlighting to note important information.
OpenRefine:
- Open up the program
- If you need to, use a browser to navigate to http://127.0.0.1:3333/
- Download the data somewhere on your computer you'll be able to find it: https://www.dropbox.com/s/kbb4k00eanm19lg/Portalrodents19772002_scinameUUIDs.csv?dl=0
- Open up that file using OpenRefine
- Text facets allow you to mass edit entries in your table (for example, change names, fix typos, strip extra spaces, etc.)
- Next to scientificName you'll see the arrow, click on that -> 'Facet' -> "Text Facet"
- You can see a summary of what's in that column
- If you click on 'cluster' in the upper right hand corner of that facet box, you can see you have tools that will try to be smart and cluster things that look very similar together. You can play with the type of clustering (for instance, try method "nearest neighbor" in that menu)
- You can also edit individual cells
- Text facet by column "country". You can batch edit in that facet. For instance, next to the 15 instances of 'US', hit edit, and we can change those all to "UNITED STATES"
- You could also change "HT" to "Haiti" to be more clear.
- You can also split a column into two by clicking the column name -> "Edit Column" -> "Split into two columns" (uncheck remove original column)
- These were all text facets
- You can also do numeric facets. In the facet, you'll see a histogram of the numeric values. You can drag that window in the histogram and filter out anything outside of the range you specify
- Can also create scatterplot facets to compare two variables to each other. Can go in just like the histogram and choose which values you want to examine.
- You can do facets on top of facets: so if you filter by a value of a variable, you can then sort using a text facet to sort by a species name, for example.
- Open refine tracks all of the changes you make! This is MUCH nicer than doing it by hand if you were editing in excel.
- You can save these steps by going to the "Undo/Redo" button -> click "Extract" , this will save a JSON file of all these steps.
- Now, if you want to do the same steps on another notebook with very similar data, you can go to "Undo/Redo" -> click "Apply" and paste that JSON code here.
- To save everything after processing, hit "export" at the very top of the page. You can save it as a csv or tab delimited
- Q: What if you have private data? No worries, this is run using a browser but is not online. It's only on your computer and everything is totally private.
Monday Afternoon Notes:
Get your project set up:
- Open up RStudio and make a script file "File" -> "New File" -> "R script"
- You can add comments to your scripts using #
- Go to File -> "Save As", new Folder, "R Session 2", save text file as Rcode.R
- Now download the data again: https://github.com/datacarpentry/ecology-workshop/blob/master/data.md and save it to that same "R Session 2" folder where we saved the .R file.
- What is an R project? It's a handy way to store your data, scripts, and results in one place.
- "File" -> "New Project" -> "Existing Directory"
Here's the code that Lynn is typing:
svydat <- read.csv("surveys.csv", header=TRUE) # Loads our data into R
# Some useful commands for looking at your data: ----------------------------------
head(svydat) # Look at the top few rows of your table
dim(svydat) # How many rows and columns are in your table
nrow(svydat) # How many rows?
ncol(svydat) # How many columns?
str(svydat) # useful function that will tell you the type of variable for each column in your dataset, the first few entries in each column,
# Important note: Always check that R is recognizing the variables the way you want it to! Are numbers numeric? Are grouping variables factors? Are descriptions characters?
# Subsetting the data: ---------------------------------------
# use brackets to subset [rows, columns]
svydat[1:10, 1:4] # displays the first ten rows of the first four columns
svydat[1:10,c("record_id", "plot_id", "species_id")] # can also include which columns to pull by naming them
svydat$sex # If we only want to see one column, we can use the "$" and the column name after the variable
svydat$sex[1:10] # If we only want to see the first 10 rows of the sex column
svydat$plot_id[is.na(svydat$weight)] # If you want to know the plot_ids for where weight is missing (is.na finds missing values)
table(svydat$plot_id[is.na(svydat$weight)]) # Makes a summary table for each plot_id of how many entries have missing values
table(svydat$plot_id[!is.na(svydat$weight)]) # If we want to know the summary of how many entries we have for each of our plots where we have weight data
# Let's force record_id and plot_id to be factors (they're now numerical) -----------------
svydat$plot_id <- as.factor(svydat$plot_id)
svydat$record_id <- as.factor(svydat$record_id)
# Let's look at a summary for all of our variables and make sure things make sense:
summary(svydat)
# There are some odd values for sex and species_id. Let's go in and make sure everything makes sense. --------------------------
table(svydat$plot_id, useNA="ifany") # the table() function won't normally include missing values, but you can force it to using the useNA argument
levels(svydat$plot_id) # There are 24 different levels of this factor
levels(svydat$species_id) # We see a bunch of two letter codes for each species, but we also see some blanks. Those should be "NA" values
svydat$species_id[svydat$species_id==""] <- NA
table(svydat$species_id, useNA="ifany") # R recognizes "" as a level, but there aren't actual values of that anymore in our data.
svydat$species_id <- droplevels(svydat$species_id) # This will drop any unused levels from a factor variable (for us, the "")
table(svydat$species_id, useNA="ifany")
svydat$sex[svydat$sex==""] <- NA
svydat$sex <- droplevels(svydat$sex)
# If you know your data is going to have missing values that are stored as blank, you can do this when importing your data:
svydat2 <- read.csv("surveys.csv", header=TRUE, na.strings="")
# Converting between factor and numeric can be tricky.
# If you want to go numeric -> factor, you can just type as.factor() around your variable and that's fine.
svydat2$hindfoot_length <- as.factor(svydat2$hindfoot_length)
svydat$hindfoot_length[1:10]
svydat2$hindfoot_length[1:10]
# But now let's say you want to go from factor -> numeric. You might think you could use the as.numeric() function:
as.numeric(svydat2$hindfoot_length)[1:10]
# Whaaaat?
# When you change values into a factor, it stores them alpha-numerically in order. So, things starting with "a" will be listed first. When you go to change back to numeric, it will reassign the value of where the level was in the list of all factor levels. That's NOT what you want!
# Two things you can do to change factors into numerics:
as.numeric(as.character(svydat$hindfoot_length))[1:10] # Change to a character and then back to a number
as.numeric(levels(svydat2$hindfoot_length))[svydat2$hindfoot_length][1:10] # This is more confusing, but this is faster if you have large data
# dplyr ----------------------------------
# If you have your own computer and haven't installed this package yet, type:
install.packages("dplyr")
# Everyone should then load the package:
library("dplyr")
# Create a local dataframe
svy <- tbl_df(svydat)
# What does this do?
# It'll print out info on your table in a pretty way on your screen. It'll only show what will fit on your screen and it'll list what you can't see.
# Otherwise it's fine to do anything we're going to do the rest of the afternoon on a normal data.frame
# Carefull: local dataframes do not retain row.names!
print(svy, n=20) # will show you more than the 10 things the local dataframe will display by default
# 5 main verbs/functions in dplyr:
# filter
# select
# arrange
# mutate
# summarize (+ groupby)
# FILTER: ------------------------
# Will return only the ROWS that meet the criteria you want to see.
# If we only wanted to see entries from January, 1983
svy[svy$year==1983 & svy$month==1, ] # Base R
filter(svy, year==1983, month==1) # dplyr
jan1983 <- filter(svy, year==1983, month==1) # Save that filtered data to a new variable
# What if we only wanted to see entries from January, 1983 for the species "DS"
print(svy[svy$year==1983 & svydat$month==1 & svy$species_id=="DS", ], n=22)
print(svy[svy$year==1983 & svydat$month==1 & svy$species_id=="DS" & !is.na(svy$species_id), ], n=22)
filter(svy, year==1983, month==1, species_id=="DS") # If you do this filtering with dplyr, it'll automatically filter out the NA values we saw with the base R command
# What if we wanted to keep all the observations for the species labeled "UP" and "UR"
filter(svy, species_id=="UP" | species_id=="UR")
filter(svy, species_id %in% c("UR", "UP")) # If we have a long list we can use the %in%
# Challenge: Create a data.frame containing observations for species "PF" where the weight is not missing.
# How many males and females are there in that data set?
pf <- filter(svy, species_id=="PF", !is.na(weight))
table(pf$sex)
# SELECT: ----------------------
# Will pull out COLUMNS we're interested in
# Let's say we only want species, sex, hindfoot_length, and weight
svy[ ,c("species_id", "sex", "hindfoot_length", "weight")] # base R
select(svy, species_id, sex, hindfoot_length, weight) # dplyr
# Can also chose columns in a row:
select(svy, sex:weight, contains("id")) # The contains will pull anything containing "id", so "record_id" and "plot_id", the sex:weight shows all columns between sex and weight (including)
# In addition to contains(), you can use starts_with(), ends_with(), matches()
# "chaining" or "pipes" --------------------------
# Comes from a package in R from magrittr package (comes with dplyr, so you have it already)
# Allows you to take output from one function and dump it in the next function, rather than nesting several functions or having to save intermediate data over and over
# We want to return year, species, and weight, but only when weight is less than 5.
# Just using what we know so far, we can do this in two steps:
sp_wts = select(svy, year, species_id, weight)
filter(sp_wts, weight < 5)
# This creates an intermediate dataset, which we don't need.
# We could nest our functions:
filter(select(svy, year, species_id, weight), weight<5) # This first evaluates the inside command (the select), then the outside command (the filter).
# This is ok, but it can get very confusing when you nest many things.
# How can we do this with pipes?
svy %>%
- select(year, species_id, weight) %>%
- filter(weight < 5)
# %>% is a "pipe". Anytime you see this, the output of the expression to the left of the pipe is input into the expression that is to the right of the pipe.
# The pipes are much easier to read than a bunch of nested functions
# We can save the output to a variable to use later:
sm_sp <- svy %>%
- select(year, species_id, weight) %>%
- filter(weight < 5)
# Challenge: Create a data.frame with year, species_id, and weight for observations from the year 2000.
# What is the mean weight of those observations?
sp2000 <- svy %>%
- select(year, species_id, weight) %>%
- filter(year==2000)
summary(sp2000$weight) # Summary will ignore NA values
mean(sp2000$weight) # the mean() will not ignore the NA and instead return an NA
mean(sp2000$weight, na.rm=TRUE) # But you can do this
# Or, you could filter out any NAs in your command to generate sp2000
sp2000 <- svy %>%
- select(year, species_id, weight) %>%
- filter(year==2000 $ !is.na(weight))
mean(sp2000$weight
# ARRANGE -------------------------
# Let's say we want to return year, month, sex, and hindfoot_length, but sorted by year and month
svy[order(svy$year, svy$month), c("year", "month", "hindfoot_length")] # Base R
svy %>%
- select(year, month, hindfoot_length) %>%
- arrange(year, month)
# The year is getting bigger in our table. What if we wanted to sort from soonest to most past by year? Use the desc() function to sort descending
svy %>%
- select(year, month, hindfoot_length) %>%
- arrange(desc(year), month)
# MUTATE -------------------------------
# What if we want to create a new variable: the ratio of weight to hindfoot length
svy$w1ratio1 <- svy$weight/svy$hindfoot_length # base R
mutate(svy, wlratio2=weight/hindfoot_length) # dplyr, doesn't actually save it to the dataset
svy %>%
- mutate(wlratio2 = weight/hindfoot_length) # With pipes, doesn't actually save it to the dataset
# If we want to save that to the dataset:
svy <- svy %>%
- mutate(wlratio2 = weight/hindfoot_length)
# What if we want to get rid of variables from our dataset?
svy <- svy %>%
- transmute(wlratio=NULL, wlratio2=NULL)
# SUMMARISE -------------------------------
# useful with the group_by function to create quick summaries of the variables in your dataset
# What if we want to calculate average weight by species?
tapply(svy$weight, svy$species_id, mean, na.rm=TRUE) # One way in base R
aggregate(weight ~ species_id, svy, mean) # Another base R way. Will only return species where it can compute the mean (it'll remove the species with NaNs)
svy %>%
- group_by(species_id) %>%
- summarise(mean(weight, na.rm=TRUE))
# That kept in species with NAs, but we could filter them out by piping on filtering commands:
svy %>%
- group_by(species_id) %>%
- filter(!is.na(weight)) %>%
- summarise(mean(weight))
# Can also use summarise to count the number of observations in your dataset:
svy %>%
- group_by(month) %>%
- summarise(count=n())
# Or by using the tally command:
svy %>%
# Challenge: For each species and year, count the number of observations per sex and also calculate the mean weight
svy %>%
- group_by(species_id, year, sex) %>%
- filter(!is.na(weights)) %>%
- summarise(count=n(), mean_wt=mean(weight))
# Create dataset for plots after break: -----------------------------
# Remove observations with missing values for species_id, weight, hindfoot_length, sex
svy_complete <- svy %>%
- filter(!is.na(species_id), !is.na(weight), !is.na(hindfoot_length), !is.na(sex))
# Let's also remove any species where there aren't many observations (perhaps they aren't reliable or we just want to make sure we have many observations)
species_nonrare <- svy %>%
- group_by(species_id) %>%
- summarise(count=n()) %>%
- filter(count >= 50, !is.na(species_id)) %>%
- select(species_id)
# Only keep data from svy_complete for species in the nonrare dataset:
svy_complete <- svy_complete %>%
- filter(species_id %in% species_nonrare$species_id)
# Your dimensions should be 30463 rows with 9 columns
# To write the file into your project folder:
write.csv(svy_complete, "svy_complete.csv", row.names=FALSE)
# ggplot2 ----------------------------------
# If you are on your own computer and don't have ggplot2 installed, run:
install.packages("ggplot2")
# Everyone, load ggplot2:
library("ggplot2")
# Let's all load the dataset from the webpage to make sure we all are working with the same data:
http://erdavenport.github.io/2016-06-13-cornell/
# Clear out everything we did earlier, start with a fresh R session:
rm(list=ls(all=TRUE))
# Read in the data we just downloaded:
svy <- read.csv("svy_complete.csv", header=TRUE)
# What does ggplot do?
# ggplot initializes a basic graph structure, then elements are added to the plot
# aes (aesthetics) - maps the variables to the plot
# geom - add geometric objects to the plot (geom_point, geom_line, geom_boxplot, geom_smoother, etc.)
# scales - modify axes and labels
# facets - plot different panels of your data
# Three examples: scatterplot, side-by-side boxplots, and longitudinal data
# Scatterplots --------------------------------------
# Let's make a scatterplot of hindfoot length vs. weight
ggplot(data=svy) # Initializes plot, don't really see anything!
ggplot(data=svy, aes(x=weight, y=hindfoot_length)) # Initializes plot, shows the axes, but no data yet
ggplot(data=svy, aes(x=weight, y=hindfoot_length)) + # This will add the points.
# Note, if you spit a command over two lines, keep the pluses on the prior line
# You can save plots as objects:
svy_plot <- ggplot(data=svy, aes(x=weight, y=hindfoot_length) # That's saved and we can run it later.
svy_plot + geom_point()
# How can we customize our plots?
ggplot(data=svy, aes(x=weight, y=hindfoot_length)) +
- geom_point(alpha=0.05) # alpha controls the transparency of the points. So, we should get more transparent points
# Change color:
ggplot(data=svy, aes(x=weight, y=hindfoot_length)) +
- geom_point(alpha=0.05, color="blue") # color will set the color of the points
# Color by species:
ggplot(data=svy, aes(x=weight, y=hindfoot_length)) +
- geom_point(alpha=0.05, aes(color=species_id)) # color by species_id
# We could have included the color information in the first ggplot call. Then, those colors would be the default for any plot that you add on:
ggplot(data=svy, aes(x=weight, y=hindfoot_length, color=species_id)) +
- geom_point(alpha=0.05) # color by species_id
# Let's add on some nicer labels:
ggplot(data=svy, aes(x=weight, y=hindfoot_length, color=species_id)) +
- geom_point(alpha=0.05) + # Make points transparent
- xlab("Weight (g)") + # Set the x-axis labels to be Weight (g)
- ylab("Hindfoot Length (mm)") + # Set the y-axis label to be Hindfoot Length (mm)
- scale_color_discrete(name="species") # Change the label above the legend
# side-by-side boxplots for hindfoot_length by species
ggplot(data=svy, aes(x=species_id, y=hindfoot_length)) +
ggplot(data=svy, aes(x=species_id, y=hindfoot_length)) +
- geom_boxplot() +
- geom_jitter(alpha=0.1, color="tomato")
ggplot(data=svy, aes(x=species_id, y=hindfoot_length)) +
- geom_boxplot() +
- geom_jitter(alpha=0.1, color="tomato") +
- geom_boxplot() +
- xlab("Species Code") +
- ylab("Hindfoot Length (mm)")
# Challenge: Create violin plots of weight by species (using geom_violin) without the jitter; add appropriate axis labels and a main title (using ggtitle)
ggplot(data=svy, aes(x=species_id, y=weight)) +
- geom_violin() +
- xlab("Species code") +
- ylab("Weight (g)") +
- ggtitle("Weight by Species")
# Eeek, ugly. Looks like the distribution of our weights are skewed. Let's put the y-axis on a log scale:
ggplot(data=svy, aes(x=species_id, y=weight)) +
- geom_violin() +
- xlab("Species code") +
- ylab("Weight (g)") +
- ggtitle("Weight by Species") +
- scale_y_log10()
# Longitudinal plot of the number of each species over time (year). Create a dataframe of counts by year and species
yearly_counts <- svy %>%
- group_by(year, species_id) %>%
- tally
ggplot(data=yearly_counts, aes(x=year, y=n)) +
# We aren't distinguishing by species above, not useful. We want separate lines for each species.
ggplot(data=yearly_counts, aes(x=year, y=n, group=species_id)) +
# Better! There are multiple lines now, but we can't tell which is which. Let's color by species:
ggplot(data=yearly_counts, aes(x=year, y=n, group=species_id, color=species_id)) +
# Sometimes we'll want a separate plot for each of our species, rather than having a bunch of lines on one plot. We can do this using faceting:
ggplot(data=yearly_counts, aes(x=year, y=n, group=species_id, color=species_id)) +
- geom_line() +
- facet_wrap(~ species_id)
# Now, what if we want to split by sex? First, create a new dataset:
yearly_sex <- svy %>%
- group_by(year, species_id, sex) %>%
- tally
# And plot:
ggplot(data=yearly_sex, aes(x=year, y=n, color=species_id, group=sex)) +
- geom_line() +
- facet_wrap(~species_id)
# Can't tell which sex is which. Let's color by sex rather than species.
ggplot(data=yearly_sex, aes(x=year, y=n, color=sex, group=sex)) +
- geom_line() +
- facet_wrap(~species_id)
# You can change the appearance of your graphs by changing the theme:
ggplot(data=yearly_sex, aes(x=year, y=n, color=sex, group=sex)) +
- geom_line() +
- facet_wrap(~species_id) +
- theme_bw() # This will get rid of the gray background
# Challenge: plot the average weight for each species over time by year. Hint: need to use dplyr first to create the dataset to plot.
# First, create the dataset:
yearly_wt <- svy %>%
- group_by(year, species_id) %>%
- summarize(avg_wt = mean(weight))
ggplot(data=yearly_wt, aes(x=year, y=avg_wt, color=species_id)) +
# Now let's break this up by sex as well:
yearly_wt_sex <- svy %>%
- group_by(year, species_id, sex) %>%
- summarize(avg_wt=mean(weight))
ggplot(data=yearly_wt_sex, aes(x=year, y=avg_wt, color=species_id, group=species_id)) +
- geom_line() +
- facet_wrap(~ sex)
# What if we want the two sexes to be two rows rather than columns? Use facet_grid instead:
ggplot(data=yearly_wt_sex, aes(x=year, y=avg_wt, color=species_id, group=species_id)) +
- geom_line() +
- facet_grid(sex ~ .)
Tuesday Morning
This morning we'll start with databasing with SQL.
If you don't still have the data from yesterday, get it here: https://figshare.com/articles/Portal_Project_Teaching_Database/1314459
To open the database, click the arrow into the folder button in SQL and open the mammals_portal.sql
Great, but how can we make a database using our own data?
First, make sure individual tables of yours are tidy (rectangular, single observation per row, etc.)
You can create your own database by importing the three excel files from the ecology project
To query the database, click on the surveys table, and then the "Execute SQL" tab:
Let's pull the year column from the surveys table:
SELECT year FROM surveys
We can select multiple columns by separating with commas. Let's pull year, month, and day from surveys:
SELECT year, month, day FROM surveys
The "*" is a wildcard character and means to pull everything:
SELECT * FROM surveys
It is good form to use all caps for SQL commands and use lowercase for your variables
The DISTINCT command will pull the unique values within a column.
SELECT DISTINCT species_id FROM surveys
It can also be used to pull unique combinations if you put in multiple columns
SELECT DISTINCT species_id, year FROM surveys
You can also do calculations from within the query. Let's display weight in kg rather than g:
SELECT plot_id, species_id, sex, weight/1000 FROM surveys
Floats vs. intergers. If you want to force a result of a computation to be a float when performing an operation, make sure there's a decimal:
SELECT plot_id, species_id, sex, weight/1000.0 FROM surveys
You can display only certain numbers of decimals by using the ROUND function:
SELECT plot_id, species_id, sex, ROUND(weight/1000.0, 2) FROM surveys
Challenge:
Write a query that returns the year, month, day, species_id, and weight in mg.
Answer:
SELECT year, month, day, species_id, weight*1000 FROM surveys
Filtering ------------------------
You can also use SQL to select only certain rows using the FILTER command
SELECT year, month, species_id, weight*1000 FROM surveys WHERE species_id="DM"
Our query is getting long. Convention is that you put each new command on a new line to keep things readable. End the command to the semi-colon:
SELECT year, month, species_id, weight*1000
FROM surveys
WHERE species_id="DM";
Can also include numeric conditionals. What if we only want records from after the year 2000?
SELECT year, month, species_id, weight*1000
FROM surveys
WHERE year >= 2000;
Can also have more complicated conditions using AND OR statements. For example, if we want only DM species after the year 2000:
SELECT year, month, species_id, weight*1000
FROM surveys
WHERE year >= 2000 AND species_id="DM";
Style statement - you can wrap things in parenthases to make things more readable:
SELECT year, month, species_id, weight*1000
FROM surveys
WHERE (year >= 2000) AND (species_id="DM");
What if we wanted all of the diplo species?
SELECT year, month, species_id, weight*1000
FROM surveys
WHERE (species_id="DM") OR (species_id="DO") OR (species_id="DS");
Challenge:
Write a query that returns the day, month, year, species_id, and weight (in kg) for individuals caught on Plot 1 that weight more than 75g
Answer:
SELECT day, year, month, species_id, weight/1000.0
FROM surveys
WHERE (plot_id==1) AND (weight > 75);
What if we have a bunch of species we want to select? We can use an IN statement, similar to dplyr in R yesterday:
SELECT day, year, month, species_id, weight/1000.0
FROM surveys
WHERE (species_id IN ("DO", "DM", "DS"));
It's always good to start simple with your queries and build up. Also, comment SQL code using at least two dashes at the beginning of a line:
-- Get post 2000 data on Dipodomys sepcies
-- This info is in the surveys table, want all columns
SELECT *
FROM surveys
-- Sampling year is in the column "year" and we want include 2000 and later
WHERE (year >= 2000);
You can also sort in SQL using the command ORDER BY. Let's say we want to sort by the taxa name in ascending order (ASC):
SELECT *
FROM species
ORDER BY taxa ASC;
Descending order is (DESC).
Can combine multiple sorts together as well:
SELECT *
FROM species
ORDER BY taxa ASC, species ASC;
Challenge:
Write a query that returns year, species, and weight in kg from the surveys table, sorted with the largest weights at the top.
Answer:
SELECT year, species_id, weight/1000.
FROM surveys
ORDER BY weight DESC;
You can sort by columns you don't even include as output:
SELECT genus, species
FROM species
WHERE taxa="Bird"
ORDER BY species_id ASC;
Order of operations in SQL: Filters with WHERE, then ORDERS the results, then will show the results with SELECT
Challenge:
Let's combine everything together: Using the surveys table write a query to display the three date fields, species_id, and weight in kg (rounded to two decimal places), for individuals captures in 1999, ordered alphabetically by the species_id. Write the query as a single line, then put each clause on it's own line and see how much more legible the query becomes.
Answer:
SELECT year, month, day, species_id, ROUND(weight/1000.0, 2)
FROM surveys
WHERE year=1999
ORDER BY species_id ASC;
You can also use aggregation functions within SQL to get summaries of your data. How many entries do we have in all columns? How much does everything in our dataset weigh? Can we round that to 2 decimal places?
SELECT COUNT(*), ROUND(SUM(weight)/1000.0, 2)
FROM surveys;
Challenge:
Write a query that returns: total weight, average weight, and the min and max weights for all animals caught over the duration of the survey. Can you modify it so that it outputs these values only for weights between 5 and 10?
Answer:
SELECT SUM(weight), AVG(weight), MIN(weight), MAX(weight)
FROM surveys
WHERE (weight >=5) AND (weight<=10);
Often times you don't want to average over everything in a data frame, but by a variable. Just like dplyr, you can use GROUP BY to aggregate by some variable in your set. For instance, by species_id:
SELECT SUM(weight), AVG(weight), MIN(weight), MAX(weight)
FROM surveys
WHERE (weight >=5) AND (weight<=10)
GROUP BY species_id;
Challenge:
Write queries that return: How many individuals were counted in each year per each species.
Also, average the weight of each species in each year. Can you modify the above queries combining them into one.
SELECT species_id, COUNT(*), AVG(weight)
FROM surveys
GROUP BY year;
If you want to use an aggregate statistic, you can use the HAVING function:
SELECT species_id, COUNT(species_id)
FROM surveys
GROUP BY species_id
HAVING COUNT(species_id) >10
ORDER BY COUNT(species_id);
You can save SQL queries for future use. Why would you want to do this? You can run the same queries on updated data. To do this, you need to CREATE VIEW:
CREATE VIEW species_count AS
SELECT species_id, COUNT(species_id)
FROM surveys
GROUP BY species_id
HAVING COUNT(species_id) >10
ORDER BY COUNT(species_id);
Now if you look over to the side, you'll see a new table under the views tab. If you update your data, this table will update too.
You can query a view just like a table:
SELECT *
FROM species_count
Challenge: Write a query that returns the number of each species caught in each year sorted from the most often caught species to the least occurring ones within each year starting from the most recent records. Save this query as a VIEW.
CREATE VIEW count_per_year AS
SELECT species_id, year, COUNT(*)
FROM surveys
GROUP BY year, species_id
ORDER BY
So, we have a bunch of information in multiple tables. How can we smoosh all of it together, to get the info we need into one table we can do analyses off of? We can use JOIN
SELECT *
FROM surveys
JOIN species
ON surveys.species_id = species.species_id;
Note that after the ON statement, we specify what table we want to pull from first (surveys), put a period, and then reference the column of that table we want to merge (species_id). We signify that this column matches the column from the other table (species.species_id) with the equals sign
If the column names are the same we can save some typing:
SELECT *
FROM surveys
JOIN species
USING (species_id);
If we only want certain columns from each table, you can specify which columns you want in the SELECT statement:
SELECT surveys.year, surveys.month, surveys.day, species.genus, species.species
FROM surveys
JOIN species
USING (species_id);
And you can join two tables and aggregate data as well. If you want the average weight by plot type:
SELECT plots.plot_type, AVG(surveys.weight)
FROM surveys
JOIN plots
ON surveys.plot_id=plots.plot_id
GROUP BY plots.plot_type;
SQL also has functions for missing data. If you want to recategorize missing data, you can use the IFNULL command:
SELECT species_id, sex, IFNULL(sex, "U")
FROM surveys;
That column name is UGLY. Also, it looks like SQL syntax, which means it would be very hard to reference it in code. We can rename it using AS:
SELECT species_id, sex, IFNULL(sex, "U") AS non_null_sex
FROM surveys;
When we joined the surveys and species tables before, some of the records were excluded because of null values for species_id. We can fix that now by specifying a new species name if the species table data is missing. Let's name that "AB":
SELECT surveys.year, surveys.month, surveys.day, species.genus, species.species
FROM surveys
JOIN species
ON surveys.species_id = IFNULL(species.species_id, "AB");
We can do the opposite as well. We can set cells to be null if they meet some criteria. For instance, let's set a a plot_id to null if the plot_id is 7:
SELECT species_id, plot_id, NULLIF(plot_id, 7) AS partial_plot_id
FROM surveys
There are other commands that might be useful to you:
ABS()
LENGTH
LOWER
UPPER
NULLIF(x,y)
ROUND
TRIM
IFNULL
RANDOM
REPLACE(string, find, replace)
Challenge: Write a query that returns the number of genus of the animals caught in each plot in descending order.
Two step answer:
CREATE VIEW combined AS
SELECT *
FROM surveys
JOIN species
ON surveys.species_id = species.species_id;
SELECT genus, plot_id, COUNT(*)
FROM combined
GROUP BY plot_id
ORDER BY COUNT(*)
One step answer:
SELECT plot_id, genus, COUNT(*)
FROM surveys
JOIN species
ON surveys.species_id = species.species_id
GROUP BY species.genus
ORDER BY plot_id
Challenge: Write a query that finds the average weight of each rodent species (i.e., only include species with Rodent in the taxa field).
Answer:
SELECT species.species_id, AVG(weight)
FROM surveys
JOIN species
USING (species_id)
WHERE species.taxa="Rodent"
GROUP BY surveys.species_id;
Tuesday Afternoon
# Set up that folder with all the data on the computer you're using if you don't still have it.
# Accessing your SQL database from R:
library(RSQLite)
# To connect to your database
myDB <- "portal_mammals.sqlite"
conn <- dbConnect(drv=SQLite(), dbname=myDB)
# To query the database
dbGetQuery(conn, "SELECT count(*) FROM surveys")
# To list all the tables in your database:
dbListTables(conn)
# To list all the fields in a table:
dbListFields(conn, "surveys")
# Often you'll want to save your query to a variable rather than type it in the dbGetQuery command. This makes it easier to read and edit your query
q <- "SELECT DISTINCT year, species_id FROM surveys"
result <- dbGetQuery(conn, q)
# We can use the results of some code we run in R to query our database directly.
# Let's say we want information from our database for every other year in a range. First, let's get the range of years from the data
yearRange <- dbGetQuery(conn, "SELECT min(year), max(year) FROM surveys")
# Now, let's create a sequence of every other year:
years <- seq(yearRange[,1], yearRange[,2], by=2)
# The paste command is super useful in R. It will "paste" together two either character strings or R expressions into one:
paste("This", "That")
# Now let's write a query to pull info for those years from the database
q <- paste("SELECT surveys.year species.taxa, COUNT(*) as count
FROM surveys
JOIN species
ON surveys.species_id=species.species_id
WHERE species.taxa='Rodent' AND surveys.year IN (", paste(years, collapse=','),
") GROUP BY surveys.year, species.taxa", sep="")
rCount <- dbGetQuery(conn, q)
# We can also build a SQL database from within R.
# Read in those csv datatables into R
species <- read.csv("species.csv")
surveys <- read.csv("surveys.csv")
plots <- read.csv("plots.csv")
newDB <- "portalR.db"
myConn <- dbConnect(drv=SQLite(), dbname=newDB)
dbListTables(myConn)
dbWriteTable(myConn, "species", species)
dbWriteTable(myConn, "surveys", surveys)
dbWriteTable(myConn, "plots", plots)
dbListTables(myConn)
# Every time you open a database connection from R, be sure you close your connection:
dbDisconnect(myConn)
Tuesday Afternoon Notes
How do we communicate our results?
1. RMarkdown for reports
2. Will learn some new tricks in R, loops, and functions
Install the Rmarkdown package:
install.packages("rmarkdown")
library(rmarkdown)
We will use the same data that we used for ggplot yesterday, svy_complete.csv
Make sure that you have your project setup to the folder where the dataset is saved.
Lesson are on the course website already. These were all created with Rmarkdown, check out how pretty they are! You too can make pretty things!
In Rmarkdown, code and contents of document are interweaved, so making changes and updates is easy.
Goals:
1. Learn a bit about markdown code.
2. Learn to introduce code chunks.
3. Put them all together and get clever.
Analysis and methods all in one place!
File -> New File -> Text to make a readme file
Date
Name
Description
FIle -> Save As -README.md file
.md=markdown
See the "Preview HTML" button, click it! See a txt file of what we have so far.
Let's organize our file.
Create a level one comment (largest level) by putting a #
A level two comment can be made by two hashtags (##)
When you click Preview HTML, files will be saved before it renders.
Markdown syntax:
Add:
data files
scripts
results
with two spaces after each to put them on separate lines
Blank lines will still be blank lines even without the two spaces
Add *'s to get bullets:
* data files
* scripts
* results
To make **bold** and to make _italicized_
Comments in R markdown can be made like this: <!-- comment here -->
`cat README.md` to add an emphasis
Adding a code block for commands to execute in R:
begin with ```
and end with ```
```
install.packages("rmarkdown")
```
now we have a code block!
Hyperlinks:
2 methods:
1) Wrap the text in the hyperlink in square brackets followed by the hyperlink in parentheses: i.e. [Text](hyperlink)
The data for this project originally came from the [Portal Project Teaching Database](https://figshare.com/articles/Portal_Project_Teaching_Database/1314459)
2) To link to individual data files. Surround the text in square brackets, followed by another set of square brackets, supplying a reference id. On the following line, provide the hyperlink to the references
: i.e. [Text1][ref1] [Text2][ref2]
The data files were downloaded 6/14/16, including both [species level data][link1] and [plot level data][link2].
[link1]:https://ndownloader.figshare.com/files/2292169
[link2]:https://ndownloader.figshare.com/files/3299474
Can do all sorts of other formatting as well: indents, lists, etc - just google it!
File -> New File -> R Markdown
Title = On the original of my dissertation
Author = Me.
Using HTML as default output
Will produce a new R Markdown file with a template. Can help you to get started if you are new to Rmd or just haven't used it in awhile.
"Knit HTML" is in the place where "preview HTML" was before.
Will ask you to save the file as a .Rmd file.
The R code is evaluated! The code and the output is included in the document.
Let's add some more text to the end of the file.
The dimension of the cars dataset are:
```{r}
dim(cars)
```
Can also use Chunks -> Insert Chunk
OR
Ctrl+Alt+I
so you don't have to keep writing ```{r} ```
Chunk options:
Sometimes we want to see the R code in the file, sometimes we don't!
There are options!
echo=FALSE option means that the R code will not be printed into the document.
Can give names to the chunks to make navigation easier:
```{r summary}
summary(cars)
```
Using the Chunk # menu at the bottom of the .Rmd file, you can jump from chunk to
chunk. Give the chunks useful names!
What if you have code that you want run, but not shown? You can hide your results:
```{r summary2, results="hide"}
summary(cars)
```
```{r summary3, include=FALSE}
summary(cars)
```
Make a new Rmd file. And knit.
Insert chunk to load data
# Loading the data
```{r}
surveys_raw <- read.csv("https://ndownloader.figshare.com/files/2292172", header=TRUE)
or download the data onto your computer instead of using the URL.
```
# Loading the libraries
```{r load libraries}
library(ggplot2)
library(dplyr)
```
#Filter the data
Removing missing records with missing values from species_id, weights, or hindfoot lengths:
Removing records with missing values:
```{r filter data}
surveys_complete <- surverys_raw %>%
filter(species_id !="") %>%
filter(!is.na(weight)) %>%
filter(!is.na(hindfoot_length))
```
Removing records for rare species:
```{r remove rare}
species_counts <- surveys_complete %>%
group_by(species_id) %>%
tally
head(species_counts)
frequent_species <- species_counts %>%
filter(n>=10) %>%
select(species_id)
survey_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
```
Other markdown options:
```{r, warning=FALSE, message=FALSE, include=TRUE, echo=TRUE}
```
# Analysis
The distribution of hindfoot lengths as a function of species:
```{r}
ggplot(data=surveys_complete, aes(x=species_id, y=hindfoot_length)) +
geom_boxplot()
```
Lets add a table of contents. Change header to
---
title: "Ecology Test"
author: :Name"
date: "June 14, 2016"
output:
html_document:
toc: true
---
Add extra cool things like numbered sections, colors and themes, figure size options
MAKE SURE TO INDENT AS SHOWN AND INCLUDE SPACES AFTER THE COLONS!!!!!!!!!!!
---
title: "Ecology Test"
author: :Name"
date: "June 14, 2016"
output:
html_document:
toc: true
number_sections: true
theme: cerulean
highlight: espresso
fig_width: 10
fig_height: 3
---
How do to figure captions
add to the header the option
---
title: "Ecology Test"
author: :Name"
date: "June 14, 2016"
output:
html_document:
figure captions: yes
---
Then within a chunk, specify the name of the figure caption
```{r remove rare , fig.cap="Figure 1: Hindfoot length wihin each species"}
[INSERT R CODE HERE
```
While it’s nice to have all of your code and results in one place, sometimes you’ll also want your figures printed to a separate file so that you can include them in manuscripts, presentations, or posters. You can specify the self_contained option as false in the header, which will generate all the files used to render the HTML.
---
title: "Ecology Test"
author: :Name"
date: "June 14, 2016"
output:
html_document:
self_contained: false
---
When you render HTML, it will create a folder with all the image files needed to put into the html document
Challenge:
1) Hide code for boxplots
- add "echo=FALSE" to chunk
2) Change fig dimension to 8 x 8
- Change fig_width and fig_height in header
3) Add another boxplot plotting the distributions of weights across the species, ensuring there are respectable titles and axis labels. Also, just like the previous figure, make sure there is a legend and that the code to generate the figure does not appear in the final report.
Add a new code chunk,
```{r boxplot 2, , echo=FALSE }
ggplot(data=surveys_complete, aes(x=species_id, y=weight))+geom_boxplot() + xlab("Species") +ylab("Weight (g)")
```
4) Add a sentence under the new figure: “The total number of species examined is X”, where X is the total number of species in the filtered dataset, as evaluated by inline code.
In the markdown section (i.e. not in a code chunk) write:
"The total number of species examined is `r count(frequent_species)`.
Programming Basics- Make a new markdown document
If/else statements
Let’s say that we discover from our collaborator that there was an issue in 1984. All of the scales had not been calibrated, and we need to increase the weights of any measurements made in that year by 10%. How can we go through and update our table?
use svy_complete.csv
library("dplyr")
library("ggplot2")
surveys <- read.csv('svy_complete.csv')
if (conditional statement) {
action
} else { do this other thing
}
if (surveys$year[1]==1984) {
print("Great Scott, it's 1984!")
} else {
print("it's not 1984")
}
Challenge
Let’s say we’re interested in knowing whether an animal is large or not, with a cut-off of at least one ounce. Write an if/else statement that evaluates whether the first animal in our data is larger than an ounce. (Hint: one ounce is 28.3g)
if (surveys$weight[1] > 28.3) {
print("the first animal weighs more than one ounce")
} else {
print("nope, too tiny." )
}
But this only works on the first row. We want to do it on all rows.
Loops
General syntax:
for (variable in vector) {
do something}
a simple loop :
for (i in 1:10) {
print (i)
}
# You can put if statements inside for-loops!
for (i in 1:dim(surveys)[1]) {
if (surveys$year[i]==1984) {
print("Great Scott, it's 1984!")
} else {
print("it's not 1984")
}
}
#Save a copy of surveys in case we mess it up!
surveys_adjusted <- surveys
#Let’s say that we discover from our collaborator that there was an issue in 1984. All of the scales had not been calibrated, and we need to increase the weights of any measurements made in that year by 10%. How can we go through and update our table?
# We dont' need the else since we're not going to adjust the non-1984 values.
for (i in 1:dim(surveys)[1]) {
if (surveys$year[i]==1984) {
surveys_adjusted$weight[i] <- surveys$weight[i]*1.1
}
}
}
#Check that this worked:
surveys %>%
group_by(year) %>%
summarize(mean_weight=mean(weight))
surveys_adjusted %>%
group_by(year) %>%
summarize(mean_weight=mean(weight))
Challenges:
- Using a for loop, using an if/else statement, and without using dplyr, tally the number of animals that weigh over an ounce in our dataset.
- to tally: initialize a variable, and in the loop, add one each time:
- var <-0
- var <-var+1
- skinnyM <-0
- skinnyF <-0
- largeA <-0
- for (j in 1:dim(surveys)[1]) {
- if (surveys$weight[j] > 28.3) {
- largeA <- largeA + 1
- } else {
- }
- For the animals that are not over an ounce in weight, how many of them are female and how many of them are male?
skinnyM <-0
skinnyF <-0
largeA <-0
for (j in 1:dim(surveys)[1]) {
if (surveys$weight[j] > 28.3) {
largeA <- largeA + 1
} else {
if (surveys$sex[j] == "M") {
skinnyM <- skinnyM + 1
} else {
skinnyF <- skinnyF +1
}
}
}
Discussion on nice formatting of code (indenting in particular). It looks better in RStudio than in this notepad because R studio automatically suggests spacing and indenting
# Functions
If you have to do anything more than once, put it in a function.
Our collaborator has noticed more problems with the data. They were wrong about the calibration issues in 1984, and have told us to discard the updated table we made. However, they realize that the person who recorded the data in 1984 somehow transformed all of the data they collected - both the weights and the hindfoot_length. To get the correct values, we will need multiply the recorded values by 1.1245697375083747 and add 10 to both of those variables. Your collaborator is very insistant that you use all of the significant digits provided when you convert values!
Could do this with loops
for (i in 1:dim(surveys_adjusted)[1]) {
if (surveys_adjusted$year[i] == 1984) {
surveys_adjusted$weight[i] <- surveys$weight[i]*1.1245697375083747+10
surveys_adjusted$hindfoot_length[i] <- surveys$hindfoot_length[i]*1.1245697375093747+10
} }
BUT! There is a typo in this crazy number. When making the adjustment more than once, you can make things more consistent with a function
general syntax:
function_name <- function(arguments) {
do stuff
}
convert_1984 <- function(myval) {
myval_adjusted <- myval*1.1245697375083747+10
}
convert_1984(1)
# if we put a "1" into parentheses, the function will use the value 1 for myval, and apply the changes to that and return that value.
# Now, let’s use this function in our loop to alter the values of weight and hindfoot_length:
for (i in 1:dim(surveys_adjusted)[1]) {
if (surveys_adjusted$year[i] == 1984) {
surveys_adjusted$weight[i] <- convert_1984(surveys$weight[i])
surveys_adjusted$hindfoot_length[i] <- convert_1984(surveys$hindfoot_length[i])
} }
#Challenge
Your collaborator tells you that you can use the length of the hindfoot to calculate brain volume. Apparently the hindfoot of these creatures is equal to the diameter of their skulls. Write a function that will calculate the volume of the animals skulls and apply it to this dataset. Hint: the volume of a sphere is 4/3 * pi * r^3
lengthToVolume<- function(len){
return(4/3 * pi * len^3)
}
for (i in 1:dim(surveys_adjusted)[1]) {
print(lengthToVolume(surveys_adjusted$hindfoot_length[i]))
}
How would you save the results to a file instead of printing to screen?