Find us on GitHub

A Data Carpentry Workshop

Cornell University Statistical Consulting Unit (CSCU)

Jun 13-14, 2016

9:00am - 5:00pm

Instructors: Erika Mudrak, Emily Davenport, Lynn Johnson

Helpers: Francoise Vermeylen, Stephen Parry, Kevin Packard, David Kent

General Information

Data Carpentry workshops teach basic concepts, skills and tools for working more effectively with data.

We will cover Data organization in spreadsheets and Data cleaning with OpenRefine, R day 1: managing and analyzing data with dplyr, visualizing data with ggplot , SQL for data management and R day 2: Intro to programming, and automatic reports with R Markdown. Participants should bring their laptops and plan to participate actively. By the end of the workshop learners should be able to more effectively manage and analyze data and be able to apply the tools and approaches directly to their ongoing research.

Who: The course is aimed at faculty, research staff, postdocs, graduate students, advanced undergraduates, and other researchers in any field. Priority will be given to people from Cornell Departments that support CSCU. See this page for a list of such departments.

Where: Albert R. Mann Library Room B30A, 237 Mann Drive, Cornell University. Get directions with OpenStreetMap or Google Maps.

Requirements: Participants must bring a laptop with a Mac, Linux, or Windows operating sytem (not a tablet, Chromebook, etc.) that they have administrative privileges on. They should have a few specific software packages installed (listed below). They are also required to abide by Data Carpentry's Code of Conduct.

Prerequisites: We especially encourage registration for those who may be less familiar with the above topics. To allow for coverage of more advanced R topics, we require that participants be familiar enough with R and RStudio to:

  • Be familiar with the R Studio interface (console, scripts, tabs for workspace, history, files, plots, packages, help)
  • Make a project in a new directory
  • load a data.frame via read.csv() or read.table()
  • how to access columns of a data.frame with $
  • know about assignment operators (<- or =) and comments #
  • explore a data.frame via head() str() summary() nrow() ncol() names() rownames() table() levels() mean() length() max() min()
  • modify data types with as.factor() relevel() as.numeric()
  • create vectors with c() or seq() and index them with [,] bracket notation
  • subset data with [,] bracket notion and logical vectors (==, !=, <, >, %in%) for conditions
  • simple plotting with plot() barplot() hist()
  • understand the concept of functions and their arguments
  • get help via ? or searching the help tab

If you have never used R or want a refresher, please prepare for the Data Carpentry Workshop by attending CSCU's free workshops:
Learn the above in Introductory Statistic Using R on June 9
Practice the above in Intermediate Statistics Using R on June 10th.

Fee: We charge a $40 fee to help defray costs.

Contact: Please mail for more information.

Useful Links


Please be sure to complete these surveys before and after the workshop.

Pre-workshop Survey

Post-workshop Survey


We will use The Portal Project Teaching Database

Get it here:

Data for OpenRefine Lesson

Data for ggplot2 Lesson svy_complete.csv (right click on the link to save)


These are the lessons that were used during the workshop

Introductory powerpoint (Monday morning)

Excel lessons (Monday morning)

OpenRefine lessons (Monday morning)

R lessons (Both days)

SQL lessons (Tuesday morning)


Right click on the links to save:

This is the script that Lynn generated on her computer during the lessons for dplyr and ggplot2 (Monday afternoon)

This script contains the code Erika used to query SQL databases from R (Tuesday afternoon)

This is the README markdown file Emily generated (Tuesday afternoon)

This is the Rmd document Emily generated during the Rmarkdown lesson (Tuesday afternoon)

This is the Rmd document Emily generated during the if/else, loops, and function lesson (Tuesday afternoon)

Cheat Sheets

Here are some links to cheat sheets you can reference later:





Preliminary Schedule

Day 1

Morning Data organization in spreadsheets and Data cleaning with OpenRefine
Afternoon R day 1: managing and analyzing data with dplyr, visualizing data with ggplot

Day 2

Morning SQL for data management
Afternoon R day 2: Intro to programming, and automatic reports with R Markdown

We will use this Etherpad for chatting, taking notes, and sharing URLs and bits of code. If that link goes dead, an export of the etherpad is here.


To participate in a Data Carpentry workshop, you will need working copies of the described software. Please make sure to install everything (or at least to download the installers) before the start of your workshop. Participants should bring and use their own laptops to insure the proper setup of tools for an efficient workflow once you leave the workshop.

Please follow these Setup Instructions.

We maintain a list of common issues that occur during installation as a reference for instructors that may be useful on the Configuration Problems and Solutions wiki page.