Welcome to Cornell MBG Software Carpentry
Instructors:
Nina Overgaard Therkildsen
Erika Mudrak
Emily Davenport
Helper:
David Kent
Links:
Bootcamp website: https://erdavenport.github.io/2016-08-11-cornell-mbg/
Good Morning Ithaca!
The Shell
Download the data here: http://swcarpentry.github.io/shell-novice/setup/
Folow the lessons here: http://swcarpentry.github.io/shell-novice/
pwd shows the path to the directory where we are currently working
ls shows all the files and folders that are in this directory ("folder" is the same as "directory")
modify command line arguments with "flags", usually letters after a dash "-"
ls --help gives a list of all these flags
Google is your friend!
Tab is also your friend: hit tab to autocomplete file names while in the shell
cd will change directories
Change directories into the shell-novice-data directories - put up orange stickies if you need help
where is the link to download the zip file? http://swcarpentry.github.io/shell-novice/setup/
cd shortcuts:
- - If you ever feel lost in your file system, you can always type "cd" and that will bring you back to your home directory
- - cd .. will take you back one folder
List out the contents of the north-pacific-gyre folder using ls
Challenge: Starting from /Users/amanda/data/, which of the following commands could Amanda use to navigate to her home directory, which is /Users/amanda?
mkdir makes a new directory (or folder)
- best practices: dont put spaces in file or folder names because in shell, spaces go between commands and arguements. You can get around this by putting the path in quotes, but it's easier to not put spaces in directory names
rm deletes files only. To remove a directory, use the flag "-r"
$rm -r thesis
BE CAREFUL! There is no recycle bin or trash to retrieve mistakenly deleted files here!
use up arrow to scroll through recent commands
rename the directory call thesis as dissertation, then move it one directory up
$mv thesis/ dissertation
$mv dissertation/ ..
get to the molecules diretory
$cd data-shell/molecules
$ls
getting info about files
wc counts the number of lines, words and characters in a file
$wc cubane.pdb
head shows the top few lines of the files
$head cubane.pdb
less prints a screen's worth of file at a time, so you can "scroll" through it
$less cubane.pdb
Wildcards:
matches 0 or more characters, so so *.pdb matches ethane.pdb, propane.pdb, and everyfile that ends with ‘.pdb’.
wc *.pdb will do wordcount command on all files with extension *.pdb
If we run wc -l instead of just wc,the output shows only the number of lines per file:
$ wc -l *.pdb
Pipes!
rather than put the result of a wordcount on thescreen, we can put it in a file to keep for later
$wc -l *.pdb >lengths.txt
print out the contents of the file to a screen
$cat lengths.txt
now lets sort the lenghts.txt file by the line number, and use hte -n flag to say that it is numerical instead of alphabetical, and put that into a new file called sorted-lengths.txt
$sort -n lengths.txt >sorted-lengths.txt
but these two steps generate intermediate files that we dont really need, and it could make our directories get full of junk. Run these at the same time in one line with a pipe, whcih can be gotten on a US keyboard by pressing shift wiht the slash key above the enter key.
$ wc -l *.pdb | sort -n
This does the first command, and sends the results to the second command. i.e first get the line count via wc, and then with those results, sort it. This allows us to get the results without generating the intermediate file lengths.txt
Then we can look at the beginning of this result:
$ wc -l *.pdb | sort -n | head -n 1
___________ break __________
Go into creatures file
Let's make a backup of the basilisk.dat andunicorn.dat files
Using wildcards won't work for copy and renaming files in bulk. Instead let's learn how to use loops in the shell.
for filename in basilisk.dat unicorn.dat
do
head -n 3 $filename
done
The syntax for writing a loop in the shell is always:
for variable in list
do
something
done
The "done" signifies the end of the loop statement
Your variable name can be anything you want. We wrote "filename" above because that is intuitive, but you can say anything: carrot, onion, muffins, book, x, etc
for filename in *.dat
do
echo $filename
head -n 100 $filename | tail -n 20
done
Spaces in file names causes issues in for loops. If a file has a space in it, the loop will think each word is a separate file. You can get around this problem by wrapping the file name in parenthases. Example: My thesis.txt would be expanded to "My" and "thesis.txt" if quotes aren't used. "My thesis.txt" in quotes will work though.
So, to do the copying:
for filename in *.dat
do
cp $filename original-$filename
done
What if you want to run a program on all the files of a certain type (with a certain extension)? First, let's set up a loop that will just echo file names to make sure that file names are being written correctly. Using an echo first in your loop is always a good idea. Double check that your commands will execute how you expect them to before actually running them (and potentially messing something up):
for datafile in *.txt
do
echo $datafile stats-$datafile
done
That works as expected, so let's run the program goostats on every file:
for datafile in *.txt
do
bash goostats $datafile stats-$datafile
done
If we run this, we have no way of knowing which file it's on. It's better to print out the name of the file being processed to the screen so we can monitor the progress of the program.
for datafile in *.txt
do
echo $datafile
bash goostats $datafile stats-$datafile
done
Let's learn how to make a script that can be run on command line.
Move back into the molecules folder
We'll be using nano as our text editor, but if it doens't work on your computer you can use any text editing software (like Notepad or Notepad ++)
Let's save a little program called middle.sh that outputs the middle few lines of a file.
head -n 15 octane.pdb | tail -n 5
We can run this by typing
bash middle.sh
This is fine, but much better than just typing in these commands into the command line. Instead, we can alter the script so in can take in an argument on the command line:
head -n 15 "$1" | tail -n 5
Let's run this giving input of what file we want to take the middle lines of:
bash middle.sh octane.pdb
We can run the same script now on any file to get the middle few lines of the file.
Let's make a very flexible script where we can specify the number of lines from the top, the filename, and number of lines at the bottom:
head -n "$2" "$1" | tail -n "$3"
To run this:
bash middle.sh propane.pdb 20 2
We know what we want to input as arguments from command line into this script, but if other people use it they won't know what we wanted. We should comment our script to remind ourselves what we expect for each variable.
# Select lines from the middle of a file
# Usage: bash middle.sh filename end_line number_lines
head -n "$2" "$1" | tail -n "$3"
Let's make another script that sorts filenames by their length:
# Sort filenames by their length
# Usage: bash sorted.sh one_or_more_filenames
wc -l "$@" | sort -n
We can run this on all of the .pbd files:
bash sorted.sh *.pdb
Saving things like this to a script is helpful so we have a record of exactly how we ran something. If we ever need to redo this later, we can easily rerun the script.
Break for lunch! Be back at 12:30 for R time
___________ R lesson_________________
Get the data here
http://swcarpentry.github.io/r-novice-inflammation/setup/
Download the data from above to your desktop
If you want to follow along with the lessons: http://swcarpentry.github.io/r-novice-inflammation/
To figure out where you are in your filesystem, type:
get.wd()
This is the same as pwd in the shell.
We want to move into the same directory as where our data is stored. To change directories in R, we can use the command setwd():
setwd("~/Desktop/r-novice-inflammation-data")
dir() will list out the contents of the current directory.
dir()
If you give dir() a path, it will list out the files in that path:
dir("data/")
Our data is in csv format, so let's use the function read.csv to read in our data into R:
read.csv("data/inflammation-01.csv")
If you ever need help with a function, put a ? in front of the function name. This will pop up a help menu that will tell you about the function, including parameters, and examples.
Let's change some of the arguments for read.csv. For instance, there isn't a header for this file:
read.csv("data/inflammation-01.csv", header=FALSE)
Variables in R:
- If you want to store data or a value for later, you can assign it to a variable.
- Variables are similar as in bash: you can use almost any word and assign using either "=" or "<- "
weight_kg <- 55
You can treat variables in R just like you would treat variables in algebra. You can multiply numbers, add things, etc:
2.2*weight_kg
Can also make new variables from old variables:
weight_lb <- 2.2*weight_kg
We've been typing into the console, but you will want to in general save all of your R code to a script. Start a new script in RStudio by going to the File menu -> New File -> R Script
In RStudio, you can run code directly from your script in the console by putting your cursor on a line and hitting control-r (command-enter on mac) or hitting that green run arrow up above the script. You can also select multiple lines of code to run.
Let's save that file we are reading in as a variable called dat:
dat <- read.csv("data/inflammation-01.csv", header=FALSE)
Some useful commands to look at data in R:
head() # This will look at the top few rows of the data, all columns
tail() # This will look at the last few rows of the data, all columns
class(dat) # This will tell you what type of object you're looking at.
Data.frame will most often be what your data is stored as if stored as a table.
dim(dat) # Tells you the number of rows by columns of your data table
To access certain parts of our data, we can subset using square brackets:
dat[1,1] # This will pull the value in the first row and the first column for you. The order when subsetting is always rows comma columns.
What if you want the first 10 observations? You can slice:
dat[1:10, 1:10] # The colon is what is used for slicing. It will show the first value through the 10th value, in this case.
What if you want every other value? c() stands for concatenate and is how you can give multiple values:
dat[c(1,3,5,7,9), 1:10] # This gives us the first 10 columns for rows 1,3,5,7,9.
If you don't include a value when indexing, you will get all values returned:
dat[ ,1:10] # This will return all rows, but only the first 10 columns.
Let's say we want to calculate the average inflammation for patient number 1 (aka: row 1).
First, let's just store the information for patient 1 to a variable:
patient_1 <- dat[1, ]
mean(patient_1)
We get an error if we try to run this because R is interpretting this as a data.frame. We can force R to recognize patient_1 as a bunch of numbers:
patient_1 <- as.numeric(dat[1, ])
mean(patient_1)
Other useful stats functions:
min() # find the minimum value
max() # find the maxiumum value
sd() # find the standard deviation of values
How can we figure out what the maximum inflammation value is across everyone on day 7?
min(dat[,7])
What if we want to calculate the mean inflammation across all individuals at once? We can use something called an "apply" statement to run a function across all rows or columns of some data at the same time.
apply(X = dat, MARGIN = 1, FUN = mean)
What about mean inflammation across all patients by day? You can use the opposite margin (columns):
apply(X = dat, MARGIN = 2, FUN = mean)
For apply statements, you first enter the data you want to process, then the margin (the rows or the columns you want to process), and then the function you want to run over those rows.
Just like shell, you can comment your code using "#" Anything after the hashtag R will not interpret.
Be sure to save your scripts as you're working on them! R scripts are saved with the .R ending.
R is a great tool for plotting and visualizing your data. Let's save the results of the apply statements we made above to use them later:
avg_patient_inflammation <- apply(X = dat, MARGIN = 1, FUN = mean)
avg_day_inflammation <- apply(X = dat, MARGIN = 2, FUN = mean)
Let's plot the average patient inflammation by day:
plot(avg_patient_inflammation) # This isn't actually very interesting, because the order of individuals in the data isn't meaningful
plot(avg_day_inflammation)
You can next functions within other functions:
plot(apply(X = dat, MARGIN = 2, FUN = max))
plot(apply(X = dat, MARGIN = 2, FUN = min))
For required arguments in functions, you don't need to actually write out the argument name. You can list your arguments in order and R will automatically assign them to the required arguments in order:
apply(dat, 2, mean)
apply(X = dat, MARGIN = 2, FUN = mean) # These two lines do the same thing.
If you want to input arguments out of order, then you must specify the argument name:
apply(MARGIN = 2, X = dat, FUN = mean)
###########
Functions
Functions always take the following form:
NAME_OF_FUNCTION <- function(VARIABLE) {
DO_SOME_STUFF
}
fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5/9)) + 273.15
return(kelvin)
}
To run the function data,
fahr_to_kelvin(32)
fahr_to_kelvin(212)
Let's make another function that is kelvin to celsius
kelvin_to_celsius <- function(temp) {
celsius <- temp - 273.15
return(celsius)
}
kelvin_to_celsius(0)
How to convert fahrenheit to celsius incorporating the functions we've already written?
kelvin_to_celsius(fahr_to_kelvin(32)) # One way is to nest the two functions like this.
http://pad.software-carpentry.org/2016-08-11-cornell-mbg
Challenge! Make a brand new function that takes in fahrenheight and returns celsius, using the two functions we've previously generated:
fahr_to_celsius <- function(temp) {
- temp_k <- fahr_to_kelvin(temp)
- temp_c <- kelvin_to_celsius(temp_k)
- return(temp_c)
}
A note about functions: any variables that you create within a function disappear once the function is done running. This is good, because these temp variables won't muddy up your working space.
################
Let's move away from making functions for temperature conversions. Let's make a function that will center our data:
center <- function(data, desired) {
new_data <- (data - mean(data)) + desired
return(new_data)
}
Aside about saving workspace images when you exit R: It seems like you may want to save workspace images, which means when you open R the next time you'll get back everything you were working on today as it is. However, this isn't the best practice for reproducible research. You want all of your steps to be recorded in a script so you or anyone else can redo it. If you save the workspace image, all of your typos and mistakes are saved as well!
------ Break -------
Before the break we were making a function that will center data. Let's test out that function on a list of just zeros:
z <- c(0,0,0,0)
center(z, 3)
Let's run the center function on the 4th day inflammation values:
centered <- center(dat[,4], 0)
How do we know if it worked right? Let's compare some stats from the original data to the centered data:
min(data[,4])
mean(dat[,4])
max(dat[,4])
min(centered)
mean(centered)
max(centered)
We can see that things are shifted to a mean of 0 and that the min and max values have shifted as well.
We know if we centered the data only that the standard deviation should not have changed. let's make sure they match:
sd(dat[,4])
sd(centered)
R only shows so many significant digits to the screen, but it stores many more than that. To compare you can do these types of checks:
sd(dat[,4] - sd(centered) # this should equal 0
all.equal(sd(dat[,4]), sd(centered))
You should always do a few sanity checks like this when you write your own functions to make sure they are working the way you intended them to.
When you write a function, you probably want to include a description of the function and an example of how it's used in the function itself, similar to what we did when writing a script in the shell:
center <- function(data, desired) {
# return a new vector containing hte original data (data)
# centered around the desired value (desired)
# Example center(c(1,2,3), 0) => c(-1,0,1)
new_data <- (data - mean(data)) + desired
return(new_data)
}
###########################
Let's write a function called analyze that takes in a file name and displays min, average, and max inflammation of all patients by day.
analyze <- function(filename) {
# This function takes a file name as an argument (should be a csv file) and displays graphs of the min, mean, and max for all patients for each day.
dat <- read.csv(filename, header=FALSE)
plot(apply(dat, 2, max))
plot(apply(dat, 2, min))
plot(apply(dat, 2, mean))
print("done")
}
analyze("data/inflammation-01.csv")
analyze("data/inflammation-02.csv")
Ta-da! Our function works on any file name that we give it.
Some guidelines for writing code, in the order that they matter:
1. It should work!
2. It should be readable to human
3. It should be efficient
In general, it's not a good use of your time to spend optimizing your code to be as fast as possible if you don't have it working yet. Also, it's more important for you to be able to use your code later, so be sure to make it human readable.
For loops
best_practice <- c("Let", "the", "computer", "do", "the", "work")
best_practice[1]
print_words <- function(sentence) {
print(sentence[1])
print(sentence[2])
print(sentence[3])
print(sentence[4])
print(sentence[5])
print(sentence[6])
}
print_words(best_practice)
Is there a limitation in this function? Will it work for all sentences? [spoiler: nope!]
new_sentence <- c("this", "is", "my", "second", "sentence")
print_words(new_sentence)
Is this what we wanted? Nope, there's an NA.
What can we do to make this function more flexible, so that it works on sentences with variable length? Let's add a for loop to the function
print_words <- function(sentence) {
for (word in sentence) {
print(word)
}
}
print_words(new_sentence)
a_longer_sentence <- c("this", "is", "a", "very", "very", "long", "sentence", "but", "not", "really", "that", "long")
print_words(a_longer_sentence)
len <- 0
vowels <- c("a", "e", "i", "o", "u")
for (v in vowels) {
len <- len + 1
}
Count the even numbers:
mynumbers <- c(2,7,8,2,3,4,6,7,8,2,3,4)
evens <- 0
for (num in mynumbers) {
if (num %% 2 == 0) {
evens <- evens + 1
}
}
Now that we understand loops, let's go back to our analyze function. We want to run the analyze function on every data file that is sitting in our data folder. How can we do that automatically in R, without having to write out each file name?
First, get a list of files that sit in the directory:
list.files(path="data/")
This lists out every file in the directory. Let's only list out the files that say "inflammation" and save those into a variable:
filenames <- list.files(path="data/", pattern="inflammation", full.names=TRUE) # full.names = TRUE means that the path before the file names will be included
for (f in filenames) {
print(f)
analyze(f)
}
This function prints out all the plots as individual files. That's a lot of scrolling. Let's modify the analyze script so that it displays the min, mean, and max plots for each file on the same page:
analyze <- function(filename) {
# This function takes a file name as an argument (should be a csv file) and displays graphs of the min, mean, and max for all patients for each day.
dat <- read.csv(filename, header=FALSE)
par(mfrow=c(3,1))
plot(apply(dat, 2, max))
plot(apply(dat, 2, min))
plot(apply(dat, 2, mean))
print("done")
}
Challenge!
Write a function called "analyze_all" that takes a filename pattern as one argument and the path to the folder holding the data to be analyzed as another and runs the function analyze for each file whose name matches the pattern
analyze_all <- function(folder, pattern) {
# Runs the analyze function for each file in the directory "folder" that matches the filename pattern "pattern".
filenames <- list.files(path=folder, pattern=pattern, full.names=TRUE)
for (f in filenames) {
}
Day 2:
We'll be starting this morning with Git. Lessons: http://erdavenport.github.io/git-lessons/index.html
If you want to follow the commands as I type them: https://www.dropbox.com/s/mtn8ui0yp1s99yv/git_commands.txt?dl=0
Why do we want to use version control?
-Keep track of versions (avoid the final, final_forreal, final_really_really files)
-Collaboration (share code and manage and document who is editing what when and why)
-For the future you - keep track of changes (what seems obvious to you now, but not 6 months down the road)
Example:
A mission to Mars: Wolfman and Dracula need to work together to plan it
Before they either worked on a project in sequence or always together; didn't work well
After you commit something to git, it's never lost (saved in the cloud, automatic backup)
git config --global user.name "NAME" # use your real name
git config --global user.email "EMAIL ADDRESS"
git config --global color.ui "auto"
git config --list
Git is going to track a folder on your computer - a convenient way to organize files for projects
In the shell:
Go to your home directory
cd
mkdir planets
cd planets
git init
ls
ls -a
# you'll see that git has greated the hidden folder .git
# If you by accident git initialize a folder, you can just delete the hidden .git file (but that will delete your entire version control history
# Once you have initialized a directory, it will version control in all subdirectories. So it's generally a bad idea to nest git folders (i.e. to make a subdirectory within a Git directory and second Git directory (because changes in this folder will already be tracked in the parent Git folder)
git status
nano mars.txt
Type "Cold and dry, but everything is my favorite color"
NOTE: On Windows: can type "notepad.exe mars.txt" to start or edit a text file in notepad
# Check what's in the file
cat mars.txt
git status
# Output: "nothing added to commit but untracked files present (use "git add" to track)"
# We see a new file in the folder. Git is saying that it sees it, but it not doing anything with it yet
git add mars.txt
git status
# Now we've told git to pay attention to this file, but we haven't actually told it to keep track of it yet
git commit -m "Start notes on Mars as base"
# If you don't include a message, git is going to ask what you changed
# Git will only save changes when you commit (so different from autosave and from word where you can undo every change
# Each commit gets a unique barcode
# What happens is that we store the changes in the hidden git directory (you won't see any new files when you type ls (you will only see a single version of your file, but you can access all older versions through the hidden git folder))
What is a good commit message?:
Short, so you can quickly read through (50 characters or so)
Be specific and informative (don't just say "fix types")
Why separate steps for add and commit?:
Sometimes you're making changes to multiple different files at once and want to track a change in one before you're finished editing the other
Typically you would commit immediately after adding
# Type more text in our mars.txt
nano mars.txt
"The two moons may be a program for Wolfman"
cat mars.txt
git status
git add mars.txt
# This stages the file to prepare for committing the change
git diff --staged
# Compare our staged version to the already committed version
# The output is a little cryptic, but there are different GUIs you can add on to get an output that is easier to interpret
git commit -m "Add converns about the effects of Mars' moons on Wolfman"
nano mars.txt
"But the mummy will appreciate the lack of humidity"
cat mars.txt
cat diff
git commit -m "added notes about humidity"
# This does not commit anything because the modified file hasn't been staged
git add mars.txt
git status
git commit -m "added notes about humidity"
# General rule: commit early and often (every time you think you might want to go back to a previous version)
Compare to specific older versions of commits:
git diff HEAD~1 mars.txt
git diff HEAD~2 mars.txt
But when you have lots of commits, it may be more helpful to look at the commit history (shows you what changes have been made when and by whom:
git log
Can use the unique git commit identifier (seen in the commit log history) to compare to a specific previous version (you don't need to type the entire identifier, you can just put in the first characters)
e.g.
git diff 89b0e779 mars.txt
git diff 89b0be77 mars.txt
To look at the most recent versions, you can use the diff HEAD command, for changes made longer ago, the history is more useful
What if we accidentally overwrite our file?
nano mars.txt
Delete everything in the file and type instead:
"We will need to manufacture our own oxygen"
# Pull back the last commited version
git checkout HEAD mars.txt
[Now we hadn't saved the changes we had made under a different name, so we lost those changes. If we had wanted to keep them as another document, we should have saved and committed them before doing the checkout]
In your file system you will only see the version that you currently have checked out
####### BREAK
If you want to clear off your Terminal screen, you can type "clear"
Git is best for tracking simple text files, not binary files like images etc.
Create a series of files:
touch a.dat b.dat c.dat
mkdir results
touch results/a.dat results/b.dot
git status
Create an ignore file
nano .gitignore
Type:
*.dat # All files with a .dat extension
results/ # All files in the results folder
git status
# Now you only see the .gitignore file. We want to commit that so other users can also see it)
git add .gitignore
git commit -m "added gitignore file"
-------------------------------
How to use Github to sync files between your computers
Github is free as long as keep your repositories public
Bitbucket is very similar to Github, but allows private repositories for free
Go to Github.com
Create an account or sign in
Create a new repository
Give it the same name as the git directory you just created in your home folder (planets)
Type into the shell:
git remote add origin https://github.com/erdavenport/planets.git
git remote -v
Using github is not like using dropbox, transfers of files don't happen automatically, you have to push it
git push origin master
Now you can see the files you committed on your computer in your Github web interface
Next we'll be pulling files from the internet to our computer
git pull origin master
Now we'll simulate what it's like to collaborate with someone (or yourself across multiple computers)
cd /tmp
git clone https://github.com/erdavenport/planets.git
ls
cd planets
nano pluto.txt
Add this file to the repository
git add pluto.txt
git commit -m "Some notes about Pluto"
git status
Now we want to push this up to our remote repository
git push origin master
(will ask for user name and password)
Now we have the Pluto file in our tmp directory and our online directory. Now we want to pull it to our planets folder in our home directory
cd ~/planets/
ls pwd
git pull origin master
But what happens if two people have made changes to the same file and both push them onto the master?
Let's have a look
In the home directory type
nano mars.txt
"This line was added in the home directory"
git add mars.txt
git commit -m "adding a line in our home directory"
git push orgin master
Now make a change in the copy in the temporary folder, commit those changes and push to the remote
cd /tmp/planets/
"We added a different line to the temporary copy"
git add mars.txt
git commit -m "adding a line in our home directory"
git push orgin master
These changes were rejected because you had not been making the changes in the most recent version online
You therefore first need to pull the current version and merge
git pull origin master
You will get an error message telling you that git doesn't know which of your changes you would like to keep and that you need to resolve the conflict
Open the file with nano and you will see the two conflicting changes
nano mars.txt
You can edit this to remove the conflict
git commit -m "Merging changes from github"
git push origin master
Now our temporary directory has this new file, as does the online repo, but we still have an older version in our home directory
cd ~/planets
git pull origin master
Clone the repository created by your instructor. Add a new file to it, and modify an existing file (your instructor will tell you which one). When asked by your instructor, pull her changes from the repository to create a conflict, then resolve it.
################# R lessons Friday Afternoon ####################
Lessons for today: http://swcarpentry.github.io/r-novice-inflammation/
- We're going to do "making choices"
Get the data here
http://swcarpentry.github.io/r-novice-inflammation/setup/
analyze <- function(filename) {
# Plots the average, min, and max inflammation over time.
# Input is character string of a csv file.
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
}
analyze_all <- function(pattern) {
# Runs the function analyze for each file in the current working directory
# that contains the given pattern.
filenames <- list.files(path = "data", pattern = pattern, full.names = TRUE)
for (f in filenames) {
analyze(f)
}
}
To get started, let's make a new project in R studio. Go to File -> New Project. Navigate to the r-novice-inflammation folder and set that as your R project folder.
Within an R project, it will automatically set your directories to your project folder, which is handy when you're coming back to something later.
How can we save out plots to files?
pdf("simpleplot.pdf") # This opens a connection to your file system
plot(1:10, 1:10) # This will plot a scatter plot
abline(h=6)
dev.off() # This closes the connection to the file system. You have to run this or you won't be able to open your files.
Anything you write between pdf() and dev.off() will be plotted, so you can combine multiple plots on to one sheet or add things to one plot (a trendline for example).
Pdf() will overwrite plots if they're named the same thing.
# Side lesson on conditionals
num <- 37
if (num > 100) {
print("greater")
} else {
print("not greater")
}
num <- 150
if (num > 100) {
print("greater")
} else {
print("not greater")
}
You don't always have to include an else statement in a conditional. For instance:
if (num > 100) {
print("greater")
}
That function will print "greater" if the number is greater than 100 and not print anything if the number is less that 100.
You can continue to make choices use else if statements:
if (num > 100) {
print("greater")
} else if (num < 50) {
print("less than 50")
} else {
print("50<=x<=100")
}
Let's write a general function that will test whether a number is positive, negative, or zero.
sign <- function(num) {
if (num > 0) {
return(1)
} else if (num == 0) {
return(0)
} else {
return(-1)
}
}
You can also combine tests using "&", which means AND
if (1 > 0 & -1 > 0) {
print("both parts are true")
} else {
print("at least one of the parts is not true")
}
We can also combine tests using "|", which means OR
if (1 > 0 | -1 > 0) {
print("at least one of these is true")
} else {
print("neither part is true")
}
Let's write a function plot_dist that plots a boxplot if the length of the vector is greater than a threshold that you specify. It will plot a stripchart otherwise.
To demonstrate boxplots and stripcharts, let's read in data from inflammation-01:
dat <- read.csv("data/inflammation-01.csv", header=FALSE)
boxplot(dat[,10])
stripchart(dat[,10])
plot_dist <- function(x, threshold) {
if (length(x) > threshold) {
boxplot(x)
} else {
stripchart(x)
}
}
mydat <- c(4,7,8,3,5,6,7,8,9,2,23,4,6,7,7,8,3,5,6,7)
length(mydat)
plot_dist(mydat, 10)
plot_dist(mydat, 50)
What if you want to plot a histogram?
hist(dat[,10])
Let's edit our function to add argument "use_boxplot" (default), but if use_boxplot=FALSE then we use a histogram
plot_dist <- function(x, threshold, use_boxplot=FALSE) {
if (length(x) > threshold & use_boxplot) {
boxplot(x)
} else if (length(x) > threshold & !use_boxplot) {
hist(x)
} else {
stripchart(x)
}
}
analyze <- function(filename, output=NULL) {
# Plots the average, min, and max inflammation over time.
# Input is character string of a csv file.
if (!is.null(output)) {
pdf(output)
}
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
if(!is.null(output)) {
dev.off()
}
}
analyze("data/inflammation-01.csv", output="inflammation-01.pdf")
analyze("data/inflammation-02.csv", output="inflammation-02.pdf")
A tip on best practice: It's best not to save results and data in the same folder. Let's create a new folder using R called "results")
dir.create("results")
analyze("data/inflammation-01.csv", output="results/inflammation-01.pdf")
This is great, but we still need to type the name of each file.
We want to have all output files have the same name as input files, but with the pdf extension.
We can use sub to replace characters in strings
sub(pattern="csv", replacement="pdf", "inflammation-01.csv")
f <- "inflammation-01.pdf"
sub(pattern="csv", replacement="pdf", f)
file.path("results", sub("csv", "pdf", f))
# Directory name for the data
analyze_all <- function(pattern) {
data_dir <- "data"
results_dir <- "results"
# Runs the function analyze for each file in the current working directory
# that contains the given pattern.
filenames <- list.files(path = data_dir, pattern = pattern, full.names = FALSE)
for (f in filenames) {
pdf_name <- file.path(results_dir, sub("csv", "pdf", f))
analyze(file.path(data_dir, f), output=pdf_name)
}
}
analyze_all("inflammation")
That's cool, because it allows you to process all of the data in your folder in one command.
One issue with that function though, is that the data and results paths are set in the function. Let's make the function more generalizable so that the user can input their own data and results directories:
analyze_all <- function(pattern, data_dir, results_dir) {
# Runs the function analyze for each file in the current working directory
# that contains the given pattern.
filenames <- list.files(path = data_dir, pattern = pattern, full.names = FALSE)
for (f in filenames) {
pdf_name <- file.path(results_dir, sub("csv", "pdf", f))
analyze(file.path(data_dir, f), output=pdf_name)
}
}
analyze_all(pattern="inflammation", data_dir="data", results_dir="results")
Break! Be back by 2:45
Creating dynamic documents using RMarkdown and KnitR
Since we're starting something new, let's clear out our workspace. We can do this by shutting out of RStudo, or we can click the little broomicon in the environment tab. Additionally, you can use the console and type rm( ). In the brackets you can put the variable that you want to drop. To remove everything, type rm(ls()).
To create an RMarkdown report, click new file -> RMarkdown. You may have to install some packages if you don't have some packages loaded.
If you need to install packages, go to the packages tab and type RMarkdown to install that package.
Once that's done, go to File -> RMarkdown. Fill in the title and author but leave it set to "Document"
Rmarkdown contains two things: R code "chunks" which is where you can type R code and R will evaluate that code. The rest of the document is text, formatting with a formatting language called markdown. By combining R "chunks" and markdown, you can combine descriptions and text with the output of your actual code. You can write full reports that fully integrate your analysis and results.
If you want a brief cheat sheet for writing in markdown: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
R best practices:
- If you update a function, delete any intermediate functions that you won't want to save later. Your scripts should be exactly the code you will use to analyze your data.
- Start your scripts with a description of what the code is for, who wrote it, and when you wrote it
- First thing, include what libraries you are going to load (for instance, rmarkdown) at the top of your script. Then people reading your script will see those libraries right away and know to download them.
- Include your sessionInfo() every time you run a script. This will let you know what version of R, what type of computer, and what versions of packages were run when the script was run. This is helpful if you need to recreate results later and can't. Sometimes there are version conflicts or operating system conflicts.
- Helpful tip in RStudio control-shift-c will put a hashtag in front of everything that is highlighted, turning it into one big comment.
- Another helpful RStudio tip: if you have a comment in your script to start a new section, you can add a string of dashes after the word and it'll then be collapsable (eg. # My analyze function ----)
- If you're going to reuse a function over and over in different scripts, save your functions in separate files. You can then "source" these files into R. This will load those functions. It saves you having to copy and paste those functions into any scripts that you might use them in. If you update the function, then you only need to update one function, rather than going through file by file and updating the function.
- There are also guidelines for styles of how to write code. You can use either the Google R style guide (https://google.github.io/styleguide/Rguide.xml) or Hadley Wickham's style guide (http://adv-r.had.co.nz/Style.html). The most important thing is to be consistent.
R data types:
- Several types of data: characters (words), intergers (1,2,3,4), numeric (anything with a decimal), logical (true or false),
- You can ask what type of object a variable is - class(object)
- You can reassign class types. To make a numer a character: as.character(4) will return "4" (in quotes)
- A vector is a series of single objects in a row. You can use that c() command to create a vector" myvector <- c(3,4,6,7,8,9)
- Can also find the class of a vector class(myvector)
- We can sort characters or numbers. Keep in mind that when you sort numbers that are stored as characters, things may look out of order to you. All of the numbers starting with 1 will come first (eg: 1,10,11,12,2,20). Convert to number if you want to sort it numerically (1,2,10,11,12,20)
- If you want to add new values on to a vector, you can concatenate using c() c(myvector, 3)
- Missing data. R has a special class just for missing data encoded by NA. If you want to test whether a variable is missing data or not, you can type is.na(myvector) and that will return true or false depending on if something is missing or not.
- A vector is a one-dimensional data type (think of it like a list).
- A matrix is one type of two-dimensional data.
- To make a new matrix:
m <- matrix(1:15, nrow=3, ncol=5)
- Another two-dimensional object type is a data.frame. Matricies can only accept one type of value (so all entries must be numeric or character, you can't mix the two). Data.frames can have a mix of both numbers and character strings.
- Factor is another type of data. A factor is a categorical variable.
- To access a single column in a data.frame, you can type the name of the data.frame, a dollar sign, and the name of the column:
carspeeds$Color # will pull the Color column
-
https://www.codecogs.com/latex/eqneditor.php
Notes for running R from command line or Gitbash in Windows
(these work for the Mann Lib computers)
First we need to find the location for a file called "Rscript.exe"
On Mann computers it is C:\Program Files\R\R-3.3.1\bin\
Click on the windows icon in the lower right hand corner, right click on "Computer" and choose properties
Click "Advanced system settings"
In the lowe right hand corner click the button for "Environment Variables"
- If there is already a variable called PATH, click edit
- If there is not a PATH variable, click New, and write "PATH" in the box for Variable name
In the box for Variable value, scroll to the end, and enter the path we found above without quotes C:\Program Files\R\R-3.3.1\bin\
Click OK a bunch to get out of the settings dialogues.
Get to your folder with an R script (something.R)
- either through the command line window
- or by clicking to the folder, right clickign and choosing "Git Bash Here". This will open a git bash that is already navigated ot this folder
Run the script by typing at the prompt:
Rscript session-info.R