Note: Although this guide was written primarily for users running Ubuntu or Debian, all of the instructions (aside from the installation procedures) should work on Windows and Mac machines as well. If you are running Windows or OS X, please follow the installation instructions here.
Until I began my current research project earlier this year, my work has focused primarily on small, geographically circumscribed groups of scientists, philosophers, and other intellectuals. Consequently, I’ve never had much use for databases in my research. Now that I’m beginning my dissertation research in earnest, I am being forced to deal with a very new challenge: collecting and organizing information on hundreds of individuals, publications, and organizations from all over the world.
Now, if you’re like me, you find the prospect of having to collect birth, death, education, and employment records for the subjects of your research to be an exceedingly dull exercise, but one you need to have finished before you can actually begin to sit down and write. Fortunately, with OpenRefine (formerly GoogleRefine) and Freebase you can automate most of this initial data collection process.
Installation & Importing Your Data
To begin, you’ll want to grab the latest release of OpenRefine from Github. After downloading and extracting the zip file, you should open up a terminal window and run the following commands:
This should both compile OpenRefine and launch the program’s server (it might take a minute, but Refine should eventually launch a web browser pointing to http://127.0.0.1:3333). When that’s all finished, you should see a screen that looks like this:
Now that OpenRefine is up and running, we’ll need some data to work with. For the purposes of this guide I’ll be using a public Google Docs spreadsheet, but you can also use TSV, CSV, Excel (.xls and .xlsx), JSON, XML, and RDF as XML files. You can also sign in and authorize OpenRefine to access your Google Drive files.
To add a Google spreadsheet click on the “Google Data” tab of the Create Project dialog screen, and paste the sharable link to your file in the URL text field and click next.
After you’ve imported your spreadsheet, you will be presented with a preview of your database and the opportunity to make adjustments. If everything looks fine, simply click “Create Project.”
Assuming your project imported correctly, you should see a screen like the one below. For the purposes of this guide I’m using a list of people associated with the publication of The Limits to Growth, but feel free to use any list of names, films, or books, but it’s a good idea to use a small list until you have gotten a hang of the program. You’ll notice that this list has some spelling errors and inconsistent capitalization. Don’t worry; in the next step we’ll clean up this data.
Reconciling Your Data
In order to do anything useful with this list, we must first reconcile it with Freebase. To do this, first click the drop-down menu of the column you’d like to work with, and click on “Reconcile” > “Start reconciling…”:
Next you’ll be presented with a list of possible reconciliation services, including “Freebase Query-Based Reconciliation” and “Freebase Reconciliation Service.” Don’t use either of those (as of 2013-09-11 both of these are broken following a Freebase API upgrade), instead we are going to add our own service using the “Add Standard Service” button. From there, you’ll be presented with the option to add a new reconciliation service. Paste the URL http://reconcile.freebaseapps.com/reconcile and then click “Add Service.”
Assuming everything is working properly, you have a new reconciliation service called “Freebase v1 Reconciliation Service on new-freebaseapps.com.” Click on the new service and you’ll be presented with a list of options that will look something like this:
From here you’ll want to pick the most applicable (but also most general) category available. For our purposes here, I am going to pick the “Person” option and click “Start Reconciling.”
After OpenRefine compares your data with the Freebase database, you’ll then need to begin reconciling your data. Depending on whether or not your data is already in Freebase, your spreadsheet cells will look something like this:
If you have multiple reconcilable options or you would like to make sure that the pairing is correct, you can click on the hyperlinked text and get a preview like this:
If you are confident in the pairing, click the check mark next to the appropriate option. If no Freebase data is available, click the select the check mark next “Create New Topic.” Do this until all your data is reconciled.
Automatically Populating A Database
Now that we have our data reconciled, we can move on to something much more impressive: automatically populating new database columns. To start, click the drop-down menu on the column you just reconciled and select “Edit Columns” > “Add columns from Freebase.”
After you’ve done that you’ll see a screen like this:
You can add as many columns as you’d like in this dialog screen, but I used these properties for the sake of simplicity:
And here’s (part of) my result:
When you are finished, you can save your project by clicking the export button in the upper-right of your screen.
This is only a tiny sample of what you can do with OpenRefine + Freebase. If you spend some time looking at the Freebase schema (e.g. here’s the schema for people) you can get really creative.
3 thoughts on “Automating Basic Research Tasks with OpenRefine and Freebase”
I was about to read the article, and then I realized the nice new google-less logo. I downloaded open refine but the new logo is not there. What version of open-refine are you using and where can I download it from?
oops, sorry, I just realized I have to download it from github…