l
Welcome to the Software Carpentry at PSU
June 22-23, 2015
Instructors: Brad Taber-Thomas and Emily Davenport
Helpers:
------------------------------------------------------------

Links:
Bootcamp webpage: http://swcarpentry.github.io/matlab-novice-inflammation/
Bootcamp repository: https://github.com/erdavenport/2015-06-22-psu
Shell history: https://www.dropbox.com/s/5cywdtvhvwhj3z4/history.txt?dl=0 # This has been removed, but see Emily's terminal output below:
Emily's terminal output: https://www.dropbox.com/s/cl9ibo20awl6j4z/shell_terminal_output.txt?dl=0

git history: https://www.dropbox.com/s/5cywdtvhvwhj3z4/history.txt?dl=0 # This has alos been moved, it is now located here:
https://www.dropbox.com/s/c94htqqgs4se5hu/git_history.txt?dl=0

The plan: 
Day 1:
Day 2:
Further reading:
Shell:
Set up:
Lesson Notes: 

References:

http://software-carpentry.org/v5/novice/ref/01-shell.html
http://swcarpentry.github.io/matlab-novice-inflammation/


When naming files with dates use the format: YYYY/MM/DD
but no slashes, right? <--- Correct, don't use slashes in file names (because the shell will interpret those as folder names)

(The following code is for GitBash users only)
To color your files in windows enter:
cd
ls -a
nano .bash_profile

in new line enter:
LS_COLORS='di=1:fi=0:ln=31:pi=5:so=5:bd=5:cd=5:or=31:mi=0:ex=35:*.rpm=90'
export LS_COLORS
alias ls='ls -F --color --show-control-chars'
Save
Exit text file
Restart GitBash

For terminal users
cd
ls -a
nano .bash_profile
add new line with:
alias ls='ls -F <-- adds color only -FG<--Adds color and folder slash

save and restart terminal



Matlab
Go to the bootcamp webpage in Hammer: http://erdavenport.github.io/2015-06-22-psu/
Under the Matlab section you'll see a link to the data: http://erdavenport.github.io/2015-06-22-psu/matlab_data.zip
Save that into your home folder (if you're downloading it via firefox, it'll automatically download into your "Downloads" folder. In your terminal, cd to your home directory, then "mv Downloads/matlab_data.zip ~") <- relative path!! shell makes an entrance into our matlab lesson

Open a terminal. 
cd work
ls
mkdir swc_workshop
cd swc_workshop
ls
cd ../..
ls

Is your data file there? If not, move it to your home directory. 
Make sure you're in the same directory, then unzip:

unztip matlab_data.zip

If you see a matlab_data folder now, you're good to go. 
Challenge! Move the matlab_folder into the swc_workshop folder (which is in the work directory)
rm matlab_data.zip  # (we don't need it anymore)
cd work/swc_workshop

Load a module:
module avail
module avail matlab
module load matlab
module list

Each time you open a new terminal window, you'll need to load matlab.

Why matlab??
- Operates on large matricies well
- Default for neuroscience and other fields for both data collection and presentations
- Drawbacks: $$$$ 

matlab
(matlab should've openned up)

Go to Desktop in the menu bar and uncheck everything but "Command window"
question mark button up at the top will open up the documentation/help files (or go to "Help" on the menu -> "Product Help"

Go back to the terminal window. Let's glance through the data we're going to work with. 
matlab

We'll need to open a new terminal tab to actually navigate around the shell. 
cd matlab_data
ls
(should see inflammation files)
Each row is a subject, each column is a different timepoint
head inflammation-01.csv
(should see a bunch of numbers, comma separated)

Go to that other tab that is matlab, we're going to read in the files. 
Challenge! Go through the help documentation and try to find a function that will read files that are comma separated?
Let's use csvread(filename), where filename is the name of the file we want to read into matlab
In the matlab command window:
Go to File -> Set Path -> Add with Subfolders -> (add the path to the swc_workshop folder)
You should see it add two paths to that list of paths. Close out (don't need to save for the future)

Let's load one of our files:
csvread('inflammation-01.csv')
Should see a bit matrix into your screen, however, that isn't saved into matlab.
clc
clc will clear the screen of all the numbers (make it nice and pretty)

csvread is a function
functions take parameters (in our case, the file name)

Let's save the data into matlab so that we can work with it in the program:
patient_data = csvread('inflammation-01.csv');
The semi-colon suppresses any output to the screen. Allows you to run things without a ton of numbers being printed to the screen. Use the semi-colon!

Now, the variable 'patient_data' containts the contents of the array. DISPlay the variable
disp(patient_data)

Variables:
- must start with a LETTER, but can contain numbers and underscores.

weight_kg = 55;

Go up to Desktop, click on workspace. This will show you what variables you have stored in your current matlab session.

weight_lb = 2.2*weight_kg;

disp(['Weight in pounds: ', num2str(weight_lb)]);
num2str(weight_lb)
weight_lb

Variables:
- Variables in programming are just like variables in math class
- In math, you can assign any number to x in an equation (y = mx +b, for instance, for the slope of a line) to get an answer (say slope is 2 and intercept is 1: y = 2x + 1), You can enter any number for x and get y.

disp(['Weight in pounds: ', num2str(weight_lb)]);
Challenge! Change the previous command to say 'Weight in kg' and then list out weight in kg
disp(['Weight in kg" ', num2str(weight_kg)]);
weight_kg = 100
disp(['Weight in kg" ', num2str(weight_lb)]);
Ooops, haven't updated the weight_lb variable! Let's rerun that:
weight_lb = 2.2 * weight_kg;
disp(['Weight in kg" ', num2str(weight_lb)]);
Yay! It's updated!

who
This lists what variables you have in the workspace
We don't need the weight variables, so let's get rid of those:
clear weight_lb
clear weight_kg

The ans variable always is there. It always holds the output of the last command you've run
ans gets overwritten every time you run a new command.

clear all (would get rid of everything, but let's not do that!)

Exercise! http://swcarpentry.github.io/matlab-novice-inflammation/01-intro.html 1/4 the way down the page, do the Predicting Variable Values Challenge (it's green)
Answers:
mass = 95
age = 102

clear age mass
who
whos
whos will tell you a bit more info than who (size of files, the class of the variable, etc)

How much data do we have?
size(patient_data)
This displays 60 40 (these are rows and columns - always in that order!)

All data in matlab is stored as an array. If you have a list of numbers (1,2,3,4), that's a 1 dimensional array called a vector. 

class(patient_data)
class tells us what the data type is. patient_data is a number that's allowed to have decimal points. If there aren't decimal points, you can make an integer (whole numbers). If you create an integer with decimals, it will round and store it just as the nearest whole number. 

Sometimes we might want to make a toy dataset to play with:
magic(8)

This created an 8x8 array where all of the columns, rows, and diagonal add up to the same thing. 

Indexing:

- What if we want just part of an array?
M(5,6)
This will have us grab the 5th row, 6th column data point. 

- What if we want all of row 5?
M(5,:)
The colon tells matlab to grab all of the columns
 
 - What about all of column 6?
 M(:,6)
 
 - What about rows 1 - 4 and everything in between?
 M(1:4)
 Ooops, what columns do we need?
 M(1:4, :)
 This gives us rows 1-4 and all columns. 

- What if you want all rows, and then every column after the 6th column?
M(:, 6:end)
The end designates the last column 

- What if we want to skip rows?
M(2:3:end, :)
We're starting at the 2nd row, taking every 3rd row, until the last row)

- What about skipping rows and columns?
m(2:3:end, 2:2:end)

We can "slice" with numbers, but we can also slice with characters (or text)
element = 'oxygen';

Challenge! "Slicing" green box on this page: http://swcarpentry.github.io/matlab-novice-inflammation/01-intro.html
Answers:
1. gen, oye, xyge
2. element(:): You're telling it to return every element of the array. Each are put on their own line. Showing the contents of a variable is not the same as indexing into a variable. 

Back from coffee break. 
We're about 2/3 through the lesson if you're following along. 
http://swcarpentry.github.io/matlab-novice-inflammation/01-intro.html

Let's start analyzing the patient data. Let's find the average of all the datapoints in the data set:
mean(patient_data(:));

Neat tip: if you put your curser next to parenthases, an underline will show up under the matching paranthasis 
mean(patient_data) 
Will give you the mean of each column of the data

Display the maxiumum data point;
disp(['Max inflammation: ', num2str(max(patient_data(:)))])

Display the minimum data point:
disp(['Max inflammation: ', num2str(min(patient_data(:)))])

Display the standard deviation of all the points:
disp(['Max inflammation: ', num2str(std(patient_data(:)))])

Open up the command history tab, highlight the last thre disps
Mac - Function F9
PC - F9

Let's pull all data for patient 1
patient_1 = patient_data(1, :);
disp('["Max inflammation: ', nu2str(max(patient_1))])
max(patient_data(1,:))

What if we want the meae mean for all subjects?
mean(patient_data, 1)
We are passing two parameters here: the first is the data the second stands for dimension (1 means columns, 2 means rows)

What's the size of the data that was output by the last command above?
size(mean(patient_data, 1))

One thing matlab is really good at is visualizing data. 
imagesc(patient_data)
This takes a matrix, and shows it to you in color (so you see the higher values have a different color than the lower values)

Let's calculate the means at each time point and plot that:
ave_inflammation = mean(patient_data, 1);
plot(ave_inflammation)
disp(ave_inflammation)
plot(max(patient_data, [], 1));

Why the extra bracket [] on that max function? Max had a functionality were if you give it two equally sized arrays, i'll output a new array where for each element it will return the max value out of the two original arrays. 

Challenge1e! The green Plots box on this page: http://swcarpentry.github.io/matlab-novice-inflammation/01-intro.html
Answers:
1. There is no patient 0 or day 0. The lines are slightly slanted because they're jumping from day to day.
2. plot(std(pateint_data, 1));

How can we do 2 plots side by side?
subplot(1, 2, 1)
subplot sets up the plotting space, but doesn't actually plot anything
plot(max(patient_data, [], 1));
subplot(1, 2, 2)
plot(min(patient_data, [], 1));
subplot(2,2,1)
The first two parameters of subplot stands for the number of rows and the number of columns of plots

If you're following along with the notes, we're now going to make scripts:
http://swcarpentry.github.io/matlab-novice-inflammation/02-scripts.html

Go to Desktop in the top bar and editor. The scripts you save have a .m extension, but they're just text files. 
Click the little page in the upper left corner to open up a new script. 
% signifies that the text following it is just a comment. Matlab will ignore anything you write after that.
Comment your code a lot!! Future you will forget what you were trying to do. 
Some comments to consider putting at the top of your script:

note: use rm to remove folders in MATLAB on Hammer --> goes to home file when you just delete using the 'Desktop' Tab, but it's hidden and takes up unnecessary space



%%%%%%%%%%%% This should all be in your script %%%%%%%%%%%%
% script analyze.m for inflammation patient data
% bct3 wrote this script 
patient_data = csvread('inflammation-01.csv');
disp(['Analyzing inflammation-01.csv:'])
disp(['Max inflammation: ', num2str(max(patient_data(:)))])
disp(['Min inflammation: ', num2str(min(patient_data(:)))])
disp(['Standard Deviation inflammation: ', num2str(std(patient_data(:)))])

ave_inflammation = mean(patient_data, 1);
plot(ave_inflammation)
ylabel('average')

print -dpng 'average.png'

subplot(1,2,1)
plot(max(patient_data, [], 1));
ylabel('max')

subplot(1,2,2)
plot(min(patient_data, [], 1));
ylabel('min')

print -dpng 'pateint_data-01.png'
%%%%%%%%%%%% End of what is in script %%%%%%%%%%%%%%%%%%


Matlab (Day 2)
Loops: http://swcarpentry.github.io/matlab-novice-inflammation/03-loops.html

Open a terminal
mmlsquota
This function will tell you your memory quotas for different spaces on the cluster. 
2 numbers to pay attention to are the "size" and the "limit". If your size is getting close to the limit, you may start to have problems running matlab.

Go to Applications (top left), Accessories, right click on terminal and it'll add the launcher to desktop. 

Right click on the terminal icon, go to Properties, go to the Launcher tab. Under command it should say gnome-terminal ~working-directory=work/swc_workshop

You can set up multiple terminal launchers with different paths (if you're working on multiple projects)

In the terminal, load the module load 
Then type matlab to open up matlab
It should open up to the script that you were making anyway. 

Make sure to add the data folder to the path
Try running the first line of the script:
patient_data = csvread('inflammation-01.csv');

whos
Make sure the patient_data variable is in there

clear all

Run the whole script. 
analyze

You should get output that displays file name, max, min, sd, and the figure

We're going to make a change to the script: We only want it to output one figure rather than two figures. 
%%%%%%%%%%%% This should all be in your script %%%%%%%%%%%%
% script analyze.m for inflammation patient data
% bct3 wrote this script 
patient_data = csvread('inflammation-01.csv');
disp(['Analyzing inflammation-01.csv:'])
disp(['Max inflammation: ', num2str(max(patient_data(:)))])
disp(['Min inflammation: ', num2str(min(patient_data(:)))])
disp(['Standard Deviation inflammation: ', num2str(std(patient_data(:)))])

subplot(1,3,1)
ave_inflammation = mean(patient_data, 1);
plot(ave_inflammation)
ylabel('average')

subplot(1,3,2)
plot(max(patient_data, [], 1));
ylabel('max')

subplot(1,3,3)
plot(min(patient_data, [], 1));
ylabel('min')

print -dpng 'pateint_data-01.png'
%%%%%%%%%%%% End of what is in script %%%%%%%%%%%%%%%%%%

That's great, but we only analyzed one data file. Let's write some loops to analyze everything. 

Open a new script. 

word = 'brain'

Let's say we wanted to print one letter at a time:
disp(word(1))
disp(word(2))
disp(word(3))

Ugh. Boring. I'd rather automate the code so that it prints out each letter. That way, it doesn't matter how long the word is, or if we add letters to the end of the word later. 

word = 'ofc'
for letter = 1:4             % the 1:4 just stands for 1,2,3,4
    disp(word(letter))    % we use the value of letter to index (letter will equal 1,2,3,4 as the loop runs)
end

Darn, this still isn't as flexible as we'd like it. If the word is only 3 letters, when it gets to the fourth iteration of the loop it gives us an error. 

for letter = 1:length(word)    % length will find the size of the variable we're looping over. 
    disp(word(letter))
end

Great, by using length(), we now have a flexible loop that will adjust to the word we give it. 

Sometimes we'll want to add a counter inside a loop:

len = 0;
for letter = 1:length(word)
    len = len + 1;
    disp(word(letter))
end

How did len get to five? Each time through the loop we added 1 to the value of len. So, for each iteration of the loop it looks like this:
1 = 0 + 1
2 = 1 + 1
3 = 2 + 1
4 = 3 + 1
5 = 4 + 1

Challenge! Incrementing with loops (green box in the notes about the word aluminum). 
Answer:
for letter = 1:length(word)
    len = len + 1;
    disp(word(1:len));
end

Striding:
disp(1:3:11)

Can go backwards too:
disp(11:-3:1)

Challenge! Display the letters of 'brain' backwards, one per line. 
Answer:
for letter = length(word):-1:1
    disp(word(letter));
end

Great, we now know the basic structure of a loop in matlab. Let's open a new script and start to loop over our inflammation files:

for idx = 1:12
    file_name = sprintf('inflammation-%d.csv', idx);    % the %d stands for a digit, that digit is the looping variable idx
    disp(file_name)
end

Ok, that's close, but we need to pad the single digit numbers with a leading digit. We can add the '02' right before the d, which tells it that we want two digits, padded with a leading zero. 

for idx = 1:12
    file_name = sprintf('inflammation-%02d.csv', idx);    % the %d stands for a digit, that digit is the looping variable idx
    disp(file_name)
end

Let's go back to our analyze.m script and update it:
Highlight all your code in the script and hit tab. We're going to incase all of that code in a for loop:

%%%%%%%%%%%% This should all be in your script %%%%%%%%%%%%
% script analyze.m for inflammation patient data
% bct3 wrote this script 

for idx = 1:12
    file_name = sprintf('inflammation-%02d.csv', idx);
    img_name = sprintf('pateint_data-%02d.png', idx);
    
    patient_data = csvread(file_name);
    
    disp(['Analyzing ', file_name, ':'])
    disp(['Max inflammation: ', num2str(max(patient_data(:)))])
    disp(['Min inflammation: ', num2str(min(patient_data(:)))])
    disp(['Standard Deviation inflammation: ', num2str(std(patient_data(:)))])

    subplot(1,3,1)
    ave_inflammation = mean(patient_data, 1);
    plot(ave_inflammation)
    ylabel('average')

    subplot(1,3,2)
    plot(max(patient_data, [], 1));
    ylabel('max')

    subplot(1,3,3)
    plot(min(patient_data, [], 1));
    ylabel('min')

    print('-dpng', img_name)
end
%%%%%%%%%%%% End of what is in script %%%%%%%%%%%%%%%%%%

gnome-open is useful on this cluster to open up the files


So, our script is pretty good, but it still isn't flexible to how many files are in the folder. What if we add more patient data? We'll need to re-write the script. 
Navigate to the folder where the data are stored. 

We've added the files variable below. This is a new variable type: struct. It's a variable that can have other variables in it. 
files.name will show all of the names of the files. 
length(files) will show how many files we have
strrep stands for string replace. Below, we're using it to replace '.csv' in the file name with '.png' so we can save the image with the same name as the input file. 

%%%%%%%%%%%% This should all be in your script %%%%%%%%%%%%
% script analyze.m for inflammation patient data
% bct3 wrote this script 

files = dir('inflammation*.csv');

for idx = 1:length(files)
    %file_name = sprintf('inflammation-%02d.csv', idx);
    file_name = files(idx).name;
   %img_name = sprintf('pateint_data-%02d.png', idx);
    img_name = strrep(files(idx).name, '.csv', '.png')
    
    patient_data = csvread(file_name);
    
    disp(['Analyzing ', file_name, ':'])
    disp(['Max inflammation: ', num2str(max(patient_data(:)))])
    disp(['Min inflammation: ', num2str(min(patient_data(:)))])
    disp(['Standard Deviation inflammation: ', num2str(std(patient_data(:)))])

    subplot(1,3,1)
    ave_inflammation = mean(patient_data, 1);
    plot(ave_inflammation)
    ylabel('average')

    subplot(1,3,2)
    plot(max(patient_data, [], 1));
    ylabel('max')

    subplot(1,3,3)
    plot(min(patient_data, [], 1));
    ylabel('min')

    print('-dpng', img_name)
end
%%%%%%%%%%%% End of what is in script %%%%%%%%%%%%%%%%%%


Yay! We can loop! And we can write scripts. 

We've been using built in functions in matlab, but we can write our own functions. 

Navigate to the swc_workshop folder in the terminal and let's so some spring cleaning. 
rm patient_data*

Let's make some simple functions:

%%%%%%%%%%%% This should be in your new script %%%%%%%%%%%%%
% file fahr_to_kelvin.m
function ktemp = fahr_to_kelvin(ftemp)
    ktemp = ((ftemp - 32)*(5/9)) + 273.15;
end
%%%%%%%%%%%% End of script %%%%%%%%%%%%%%%%%%%%%%%

Now run the function:
fahr_to_kelvin(32)

Let's try another function:

%%%%%%%%%% This should be another new script %%%%%%%%%%%%%%
% file kelvin_to_celsius.m
function ctemp = kelvin_to_celsius(ktemp)
    ctemp = ktemp - 273.15;
end
%%%%%%%%%%% End of script %%%%%%%%%%%%%%%%%%%%%%%

Now run that function;
kelvin_to_celsius(0.0)

Let's combine a couple of the functions we already have into one last function:

%%%%%%%%%%%% Another separate script %%%%%%%%%%%%%%%%%
% file fahr_to_celsius.m
function ctemp = fahr_to_celsius(ftemp)
    ktemp = fahr_to_kelvin(ftemp);
    ctemp = kelvin_to_celsius(ktemp);
end
%%%%%%%%%%%% End of script %%%%%%%%%%%%%%%%%%%%%%

fahr_to_celsius(32)

Try to keep functions between 20-40 lines. Much larger than that, they become really hard to manage and typos get pretty hard to spot. 

COFFEEEEE! Be back at 10:50 please. 

Now that you're caffeinated, a challenge!
Start at Concatenating in a function green box here: http://swcarpentry.github.io/matlab-novice-inflammation/04-func.html

disp(['the', 'brain', 'rocks!', num2str(1)])

%%%%%%%%%%%%%% Function from the challenge %%%%%%%%%%%%%
% file fence.m
 function output = fence(original, wrapper)
     output = strcat(wrapper, original, wrapper);
end
%%%%%%%%%%%%%% End of script %%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%% Outer function from challenge %%%%%%%%%%%%%
function output = outer(word)
    output = strcat(word(1), word(end));% now takes 1st letter of helium and last of it
end
%type into command window "output('helium;)
gives you 'hm'
%%%%%%%%%%%%%% End of script %%%%%%%%%%%%%%%%%%%%%

outer('brain')

%%%%%%%%%%%%%% center.m script %%%%%%%%%%%%%%%%%%%%
% file center.m
function out = center(data, desired)
    out = (data - mean(data(:))) + desired;
end
%%%%%%%%%%% End of script %%%%%%%%%%%%%%%%%%%%%%%%

z = zeros(2,2)
center(z, 3)

z = z + 1
center(z, 3)

Let's try the center function on our data. Let's start with just one file, so that we know it's doing what we think it should be doing.


data = csvread('inflammation-01.csv');
centered = center(data(:), 0);

size(centered)

Let's add some help comments to our function. Put these right under your function definition. The function needs to be on line one:
%%%%%%%%%%%%%% center.m script %%%%%%%%%%%%%%%%%%%%
function out = center(data, desired)
    % Center data around desired
    % out = enter(data, desired)
    % returned "out" array of centered data
    
    out = (data - mean(data(:))) + desired;
end
%%%%%%%%%%% End of script %%%%%%%%%%%%%%%%%%%%%%%%

Challenge! "Testing a Function" question #3 on the bottom of this page: http://swcarpentry.github.io/matlab-novice-inflammation/04-func.html

%%%%%%%%%%%%%%%% Challenge script %%%%%%%%%%%%%%%%%%
function [] = run_analysis(filename)
    disp(['running...', filename])
    patient_data = csvread(filename);
    
    pause(2)
    close()
    
    ave_inflammation = mean(patient_data, 1);

    subplot(1,3,1)
    plot(ave_inflammation)
    ylabel('average')

    subplot(1,3,2)
    plot(max(patient_data, [], 1));
    ylabel('max')

    subplot(1,3,3)
    plot(min(patient_data, [], 1));
    ylabel('min')
    
end
%%%%%%%%%%%%%%%% End of script %%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%% Batch analysis script %%%%%%%%%%%%%%%%%
function [] =  batch_analysis()
    files = dir('inflammation*.csv');
    for idx = 1:length(files)
        disp(files(idx).name);
        run_analysis(files(idx).name)
end
%%%%%%%%%%%%%%% End of script %%%%%%%%%%%%%%%%%%%%

batch_analysis

When we do defensive programming, we want to try to catch errors before they happen. For instance, you might want to only process samples that have at least 60 lines in the file. You can use something called an assert statement to ensure that the data is long enough. 

%%%%%%%%%%%%%%%% Challenge script %%%%%%%%%%%%%%%%%%
function [] = run_analysis(filename)
    disp(['running...', filename])
    patient_data = csvread(filename);
    
    assert(size(patient_data, 1) == 60, 'Files must have 60 rows!')
    
    pause(2)
    close()
    
    ave_inflammation = mean(patient_data, 1);

    subplot(1,3,1)
    plot(ave_inflammation)
    ylabel('average')

    subplot(1,3,2)
    plot(max(patient_data, [], 1));
    ylabel('max')

    subplot(1,3,3)
    plot(min(patient_data, [], 1));
    ylabel('min')
    
end
%%%%%%%%%%%%%%%% End of script %%%%%%%%%%%%%%%%%%%%


Control-C will cancel any loop or process running in matlab. Useful when you accidentally start something that takes a long time to run. 

What's the difference between "=" and "=="
- "=" is used to assign to a variable
- "==" is used to test of two variables or objects are the same/equal


GIT

Lesson material: http://swcarpentry.github.io/git-novice/

Open terminal, type git (tells you all the git options)
    on a Mac, if you get an Xcode error, install it:)
    
Why use git?
    multiple version of the same file gets annoying!
        E.g., ... http://www.phdcomics.com/comics/archive.php?comicid=1531
        version1.doc, version2.doc, final.doc, final24.doc
    is very good at managing files/versions through the cloud
    never lose ANYTHING...EVER!
    time travel (go back in time and see previous versions you presented at the World Conference 2013)
    detects when you might overwrite changes and will prompt you to reconcile changes
    
git config --global core.editor "nano -w"
git config --global user.name "Your Name"
git config --global user.email "your@email.com"
git config --global color.ui "auto"
git config --list
    shows you your configuration settings

Let's set up a git repository!
cd (to go Home)
mkdir planets
cd planets
git init
    this will initialize your repository
    should see message about git repository (aka "repo") being initialized
ls -a
    lists all files/folders, even hidden ones (which have names that start with "." dot)
    should see .git (that's your hidden git repo folder, if you delete is your git repo/version history are gonezo)
    
git status
    tells you what's happening in your git repo
    you're on branch master (we're staying on "master" for these lessons, so it should always say master today)

Problem: Is it a good idea to make a folder "mars" inside planets, and initialize mars as a git repo?
    pretty much always a bad idea to make repos within repos!

nano mars.txt
type some notes in there about mars
ctrl + o to save, enter to save as mars.txt
ctrl + x to exit
ls and you'll see your mars.txt file
cat mars.txt
    shows contents of file
git status
    you'll see your mars.txt file as an untracked file
git add mars.txt
    adds mars so git will track it
git commit -m "Starts notes on Mars as a base"
    you'll see a note about what you did, files changed

that was the git add - commit cycle: you update a file, put it in the "staging area" (with git add), and then commit (changes to whatever's in your staging area is committed)
    committing only what is in the staging area is helpful so you can commit only those changes you want; changes that you might not be done with can be staged/committed later

git status
    shows you status of repo (should be clean, you don't have any new changes to commit)
git log
    shows log of your commits, commit messages, so you can see what/when/who committed

nano mars.txt
    add another line of text to edit the file
    ctrl + o, enter, ctrl + x (save, accept name, exit)

git status
    modified mars.txt that aren't staged for commit (we made changes, but haven't staged them for commit)

but before we commit, let's see what we changed in our file (e.g., maybe the changes we made messed up the file and we want to compare it to the old version)...
git diff
    first line shows you "diff" (the command you ran) and the two files you're comparing
    - indicates deletions, + additions to the file

git commit -m "add concerns about effects of mars' moons on Wolfman"
    oops, something went wrong there, all we see is the git status again with mars.txt still unstaged, and message that "no changes added to commit"
    we forgot to git add!
    
git add mars.txt
git commit -m "add concerns about effects of mars' moons on Wolfman"
    you'll see nice commit message, # files changed, # lines inserted

nano mars.txt
    add a third line of text
    save/exit

cat mars.txt
git diff
    shows line added in green with +

git add mars.txt
git diff
    didn't do anything because it's in the staging area (files have to be unstaged to be diffed; once files are staged git considers that staged version as the current version so there's nothing to diff it with)
    you can edit mars.txt again even though it hasn't been committed, and it will get thrown off the stage (i.e., will be unstaged)
    you can still diff a staged file, "git diff -staged"

git commit -m "Discuss concerns about mars' climate for Mummy"
    you'll see your nice commit message
git status
    on master, nothing to commit, we're all caught up

git log
    we've made three commits, first/initial commit is at bottom, most recent commit at top

Problem: Committing changes to git: which would save changes to myfiles.txt to local git repo?
    $ git add myfile.txt
    $ git commit -m "my recent changes"

Problem: Make "bio" git repository
    cd ..
        get out of planets so we don't make a repository in a repository (and get stuck in a black hole)
    mkdir bio
    cd bio
    git init
    nano me.txt
        write a three line bio
    git add me.txt
    git commit -m "edited my life"
    nano me.txt
        modify one of the lines, and add a fourth line
    git diff

Let's go back to planets
cd ../planets
ls
    should see mars.txt
git diff HEAD~1 mars.txt
    head = current commit, ~1 = go 1 commit back (from head), and compare mars.txt in that commit to current version
git diff HEAD~2 mars.txt
    compare current mars.txt to mars.txt from 2 commits ago
git log
    get commit id (long crazy string next to "commit") for the commit you want to compare to
git diff fae08aekje83 mars.txt
    or whatever your crazy long string is:) (you usually only need to use the first 10ish characters of it, enough so you know it's going to uniquely identify the commit)
nano mars.txt
    add line about needing to manufacture oxygen or whatever
    save/exit
cat mars.txt
OOPS, we didn't like that change, we want to go back to the last committed version...
git checkout HEAD mars.txt
    head = last commit, and then specify which file you want to checkout from that commit
    the stuff we didn't want in mars.txt will be GONE FOREVER! because we didn't commit it
    you can replace HEAD with a commit ID or HEAD~3 (or any number of commits ago)

PROBLEM: Recovering older versions of a file...
    Which commands below will let Jennifer recover the last committed version of her Python script called data_cruncher.py? Answer is 5 (both 2 and 4)...
        $ git checkout HEAD data_cruncher.py
        $ git checkout <unique ID of last commit> data_cruncher.py

Let's ignore things we don't want to be version controlled by git (bit data files, non-text files, images, raw data, etc.); version controlling just SCRIPTS is common (results files can be reproduced from those scripts at any time). Here's how we ignore....

mkdir results
touch a.dat b.dat c.dat results/a.out results/b.out
ls results
    see a.out and b.out
git status
    lots of new files, but we don't want to see those every time and we don't want to control them
so let's make a git ignore file that lists the things we want git to ignore...
nano .gitignore
    add these two lines, then save/exit...
    *.dat
    results/
git status
    all those new files are gone, and we just see .gitignore
    it's up to you if you version control .gitignore; Emily chooses to do so, Brad does too (because I mess it up sometimes and want to go back to my previous versions:)
git add .gitignore
git commit -m "added git ignore file"
git status
git add a.dat
    note: you will get an error if you try to stage a file that you have in your git ignore file
    you can use "git add -f" to force staging that file, but you probably just want to remove it from your git ignore

GIT HUB
sign up for an account at github.com
pros: cloud based repositories, free
cons: pretty much only allows public repositories
alternatives: https://bitbucket.org/

In web browser...
    log in to git, Click + in top right corner to create a new repository "planets"
    under quick setup, click "ssh" and copy LINK.git

Back in terminal window...let's connect our planets repo to the online github planets repo
    cd ~/planets
    git remote add origin LINK.git

Now let's "push" our repository to the github repository
    git push origin master
            origin = what we're calling our github repository (pretty standard to just use that name, must match what you used when you added the remote link above)
            master = which branch you're on--we aren't doing any "branching" today, you're always on the master branch

Pair up with neighbor to share repository:
Person A-- on github go to your planets repository page, settings link on right, and click collaborators
    add Person B's github username

Then Person B-- on your computer, go to your Desktop folder (or any folder other than where you have your own planets directory)
    on github, click search bar at the top and search for Person A's username, go to their planets repo, and in lower left panel copy the https link for cloning the repo
    back on your computer in the shell...
        git clone https://github.com/PersonA/planets (or whatever their https link is)
        ls
            you'll see you have a planets directory now
        cd planets
        let's make a new file...
            nano pluto.txt
                add some text to it, save/exit
            git add pluto.txt
            git commit -m "added notes on pluto"
            git push origin master
                this pushes your new files up to Person A's planets repository on github

Person A can now get those changes Person B made into A's planets directory on A's machine
    Person A, in their shell on their computer, from in their planets directory that they pushed up to github earlier...
        git pull origin master
            pulls down new commits that have been made to Person A's planets repo on github
        ls
            you should now see the new pluto.txt file

Now let's create a conflict...
Person A
    make some edits to mars.txt
    git add mars.txt
    git commit -m "person A made some changes"
    git push origin master
        push those changes up to github

After that, Person B
    make some edits to mars.txt
    git add mars.txt
    git commit -m "person B made some OTHER changes!"
    git push origin master
        try to push those changes up to github

Rejected! Git knows--Person A made some edits that Person B didn't have in their repository; before trying to push to github, Person B needed to git pull to get latest version of github planets repository

Person B:
    git pull origin master
        message about a conflict in mars.txt that you need to resolve, do it NOW!
    nano mars.txt
        now it looks kind of crazy, you see both Person A's version and Person B's version, and you can pick which you want (or delete both and type something totally new)
        save/exit
    git add mars.txt
    git commit -m "resolved merge conflict"
    git push origin master
        now it should push just fine; git defers to you, the human, when you've resolved a conflict--that means you can mess it up, and that the person who pushes changes to a repository first has no problem, and the person who gets there second has to deal with the conflict:)



SHELL SCRIPTS:
http://swcarpentry.github.io/shell-novice/05-script.html

cd
cd Desktop/shell-novice/data/users/nelle/molecules/
ls
    see cubane.pdb, ethane.pdb, etc. files

Script to take lines 10-15 of file...
head -15 octane.pdb | tail -5
nano middle.sh
    type "head-15 octane.pdb | tail -5" in there, save/exit
bash middle.sh
    see same output as just typing those commands in terminal
nano middle.sh
    replace octane.pdb with "$1" (include quotes! necessary when there are spaces in filenames--which you should avoid at all costs :)...
    head -15 "$1" | tail -5
bash middle.sh octane.pdb
    same output again:)
bash middle.sh cubane.pdb
    now you see lines 10-15 of cubane.pdb
nano middle.sh
    head "$2" "$1" | tail "$3"
bash middle.sh octane.pdb -10 -3

Future you: Will be very confused about what the heck this script was for, so let's add comments to the beginning of the script to let future us know what this thing did...
    # Select lines from the middle of a file
    # Usage: middle.sh filename -end_line -num_lines

nano sorted.sh
    wc -l "$@" | sort -n
    save/exit; note--the $@ will take as many parameters as you want (whereas $1 or $2 can hold only a single parameter)
    
bash sorted.sh *.pdb ../creatures/*.dat
    we're passing lots of files to the sorted.sh script! (all files matching *.pdb, and all files in the creatures directory matching *.dat)

nano sorted.sh
    add some comments for future you!
        #list files sorted by number of lines
        #usage: sorted.sh filelist
        #note: file list can contain any number of files to sort by number of lines

history | tail -4 > redo-figure-3.sh
    this grabs your most recent 4 commands and dumps them into the redo-figure-3.sh file, which you can go into in a text editor and modify as desired to create a script based on