Start the program. (Double-click on the google-refine.exe file. Java services will start on your machine, and Refine will open in your Firefox browser).
Note the file types Open Refine handles: TSV, CSF, *SV, Excel (.xls .xlsx), JSON, XML, RDF as XML, Google Data documents. Support for other formats can be added with Google Refine extensions.
In this first step, we’ll browse our computer to the sample data file for this lesson. In this case, I’ve modified the Portal_rodents csv file. I added several columns: scientificName, locality, county, state, country and I generated several more columns in the lesson itself (JSON, decimalLatitude, decimalLongitude). Data in locality, county, country, JSON, decimalLatitude and decimalLongitude are contrived and are in no way related to the original dataset. When doing this demo, the columns: JSON, decimalLatitude, decimalLongitude can be deleted, and then recreated if time, with a call to a locality service, and subsequent parsing of the JSON data returned by the service.
If you haven’t already, download the data from:
https://www.dropbox.com/s/kbb4k00eanm19lg/Portal_rodents_19772002_scinameUUIDs.csv?dl=0
Once Refine is open, you’ll be asked if you want to Create, Open, or Import a Project.
- Click Browse, find Portal_rodents_19772002_scinameUUIDs.csv
- Click next to open Portal_rodents_19772002_scinameUUIDs.csv
- Refine gives you a preview - a chance to show you it understood the file. If, for example, your file was really tab-delimited, the preview might look strange, you would choose the correct separator in the box shown and click "update preview."
- If all looks well, click _Create Project._
```
## Faceting
*Exploring data by applying multiple filters*
OpenRefine supports faceted browsing as a mechanism for
* seeing a big picture of your data, and
* filtering down to just the subset of rows that you want to change in bulk.
Typically, you create a facet on a particular column. The facet summarizes the cells in that column to give you a big picture on that column, and allows you to filter to some subset of rows for which their cells in that column satisfy some constraint. That's a bit abstract, so let's jump into some examples.
[More on faceting](https://github.com/OpenRefine/OpenRefine/wiki/Faceting)
## Leading - Trailing Whitespace
One of the most magical bits of Refine, the moment you realize what you’ve been missing. Refine has several clustering algorithms built in. Experiment with them, and learn more about these algorithms and how they work.
In OpenRefine, clustering refers to the operation of “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings “New York” and “new york” are very likely to refer to the same concept and just have capitalization differences. Likewise, “Gödel” and “Godel” probably refer to the same person.
- In this example, in the scientificName Text Facet we created in the step above, click the _Cluster_ button.
- In the resulting pop-up window, you can change the algorithm method, and keying function. Try different combinations to see the difference.
- For example, with this dataset, the _nearest neighbor_ method with the _PPM_ keying function shows the power of clustering the best.
- Intentional errors in these scientific names have been introduced to show how errors (typos) in any position can be found with this method. All errors can then be fixed by simply entering the correct value in the box on the right. Often, the algorithm has guessed correctly.
- After corrections are made in this window, you can either Merge and Close the Cluster pop-up, or Merge and Re-cluster.
Note that at any time, in any cell of the Facet box, or data cell in the Refine window, you have access to “edit” and can fix an error immediately. Refine will even ask you if you’d like to make that same correction to every value it finds like that one (or not). ```` - Scroll to Country. Look at the fourth row. Edit “United States of America” to “UNITED STATES” and apply to all matching cells. - Create a text facet of Country. Edit “US” to match “UNITED STATES” - What is HT? Click on it to filter the data table by it. In the locality column we see that it is Haiti. Edit via the text facet to be HAITI.
## Split / Undo - Redo
If data in a column needs to be split into multiple columns, and the strings in the cells are separated by a common separator (say a comma, or a space), you can use that separator to divide up the bits into their own columns.
- Create a numeric facet on weight, and investigate the larger values.
You can also investigate records by two variables at once
- Click on the arrow of any column and choose Facet > Scatterplot Facet
- The resulting scatterplot matrix shows all columns- click on the scatterplot for hfl and wgt. A detailed scatterplot will appear in the facet tab.
- Draw a box around the lower middle blob of points. This will filter the data file to the records matching those ranges of hfl and wgt.
- Now create a text facet for scientificName. We see that records of this combination of hfl and wgt are made up of 5 species.
Previous: Getting Started with OpenRefine Next: Scripts from OpenRefine