Messy Data?

Open Refine comes to the rescue

Presented by Weiwei Shi
Digital Intiatives Applications Librarian, University of Alberta Libraries
@weiwei_shi | weiwei.shi@ualberta.ca

Have you ever needed to deal with this?

Or this?

Or this?

Data is Messy!

Data Cleaning is a necessary evil

How not to hate other people's data?

Open Refine!

What is OpenRefine

Who uses OpenRefine

What does OpenRefine offer?

  • Powerful yet easy-to-use interface
  • Quickly and interactively explore new datasets, normalize them and process them via API services
  • Easy-to-learn scripting language

What does it do?

  • Applying transformations in bulk
  • Extending existing data with more data from external sources
  • Getting your data to forms that other tools can consume

What does it NOT do?

  • Entering new data one cell at a time
  • In-depth data analysis
  • Data visualization

Start a OpenRefine Project

Get your data in

Sample Data - Provincial Hospitals in Zimbabwe

An API request to Google Places to search places of type "food" within 200 meter radius of University of Alberta: https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=53.5235571,-113.5239325&radius=200&types=food&key=AIzaSyD-aV4ZDVgfqqNMC78UiKjpp65VOOmGrDM

Rows vs Records

A row is a single line of your project.

A record is a combination of one or multiple rows identifying a unique object and sharing the same first column.

Sorting, Facetting and Filtering

Facetting can give you a big picture of the dataset:
http://localhost:3333/project?project=1928396612960

Transforming

Clustering

Geocoding with OpenRefine

Fetching


Say your project has a column named address that contains street address, using "Add Column by Fetching URLs", and use the follow expression to fetch result from Google Geocoding API: "http://maps.google.com/maps/api/geocode/json?sensor=false&address=" + escape(value, "url")

Parsing


Now you need to parse the json response into useful data by using the following expression: with(value.parseJson().results[0].geometry.location, pair, pair.lat +", " + pair.lng)

Reconciliation Services

With OpenRefine, we can perform reconciliation against any web service supporting the Reconciliation Service API.

  • OpenDataRise: A flavor of Open Refine that particularly designed to use with CKAN open source data portal
  • dbpedia: a linked-data data extraction of Wikipedia
  • OpenCorporate: share data on corporate entities as open data

How to use reconciliation services

Example: OpenRefine & OpenCorporates

OpenRefine Expression Language (GREL)

GREL is designed to resemble Javascript. So you can expect basic things to work, and know how they would work: