Using OpenRefine and BatchRefine to export large CSV datasets to RDF


#1

Remark: I wrote this post long time ago but never published it. Unfortunately BatchRefine is still on the old 2.x codebase of OpenRefine but some might still find it useful.

In this post I will show how you can use a headless OpenRefine instance called BatchRefine to process large amounts of CSV data without memory problems. A normal OpenRefine instance is in our experience not really suited for processing large files. If you ever tried to load several hundred thousand or even millions of lines in OpenRefine, you know what I mean.

In this example I transform the Open Data data set for the Swiss National Archive to RDF. The export is provided as a TSV file of around 4 million lines.

My former project colleague Giuliano Mega also did a short video that demonstrates OpenRefine and BatchRefine within the Fusepool P3 platform.

Toolchain

Beside OpenRefine and BatchRefine I will work with Unix tools which are well suited for processing files. Among them:

  • file – determine file type
  • wc – word, line, character, and byte count
  • split – split a file into pieces
  • curl – transfer a URL
  • iconv – character set conversion
  • head – display first lines of a file
  • pbpaste – provide copying and pasting to the pasteboard (the Clipboard) from command line
  • serdi – Read and write RDF syntax

Note that there is in general a difference between the BSD and GNU implementation of these utilities. I tested all this on my macOS 10.12 system which is using the BSD variant of the utilities. That might or might not work if you try it on platforms like Linux, which is using GNU by default. Please also consult the man-pages of the utilities for more information of what they can do. I only explain what we really need for our task.

I will also use Docker for running BatchRefine.

Getting started

First, we need to know something about input file, while OpenRefine can handle different input formats, BatchRefine assumes the input to be UTF-8. As far as I remember we never implemented support for different codepages. This means you have to convert the input file to UTF-8 before you feed it to BatchRefine.

So let us start with checking the input file type:

» file INPUT.tab
INPUT.tab: ISO-8859 text, with very long lines, with CRLF, CR line terminators

In our case this is not UTF-8 so we need to convert it accordingly, this can be done like this:

» iconv -f ISO-8859-1 -t utf8 INPUT.tab > INPUT_UTF8.tab

If your file has another encoding than ISO-8859, check iconv -l for a list of supported input and output formats and adjust the command above accordingly.

Once iconv finished, we can check again:

» file INPUT_UTF8.tab
INPUT_UTF8.tab: UTF-8 Unicode text, with very long lines, with CRLF line terminators

That looks more like it! Now let’s check how many lines we will process:

» wc -l INPUT_UTF8.tab
 4002100 INPUT_UTF8.tab

The -l option of wc simply counts the amount of lines in a file, in our case around 4 million. As I mentioned before, this is not something you want to load into OpenRefine. But we still have to use OpenRefine first to create our RDF mapping. A good approach here is to use just a few hundred lines to define the filter in OpenRefine and test it accordingly. We can easily create that by using head:

» head -n 200 INPUT_UTF8.tab > SAMPLE.tab

In this case it will create a new file called SAMPLE.tab that contains the first 200 lines of the large file. You can verify with wc that it is indeed 200 lines long.

OpenRefine

Now it’s time to create the RDF mapping in OpenRefine. I will not go into details here, all you need is an instance of OpenRefine that has the RDF extension installed. I didn’t try to install the RDF extension on the latest OpenRefine version but I do remember that it was broken for a long time. If you don’t want to check yourself, simply grab the OpenRefine binary we created within the EU FP7 Fusepool P3 Project. It is not on the latest level of OpenRefine source code but it ships a working RDF extension. By the way if anyone wants to update this code, feel free to create a pull-request!

Please consult the RDF Refine documentation about how to create an appropriate RDF configuration, I won’t go into any details about that here.

Once you have a working RDF configuration and you tested it accordingly we need to export the configuration. This is done by extracting the operation history. You can find that in Undo/Redo, click the Extract… button there and you should see a dialogue similar to this one:

OpenRefine configuration export

Click select all and copy it into your copy-buffer with Ctrl-C. If you are on macOS you can paste that into a file:

» pbpaste > bar-config.json

We will have to provide this by a HTTP URI to BatchRefine, for that we launch a simple HTTP server in a separate window, started in the directory where you put the JSON file:

» python -m SimpleHTTPServer 8000
Serving HTTP on 0.0.0.0 port 8000 ...

Note that for BatchRefine it is essential that the content-type of this file is declared as JSON. You can test this with a simple curl HEAD command:

» curl -I http://localhost:8000/bar-config.json 
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.10
Date: Sat, 25 Feb 2017 16:10:49 GMT
Content-type: application/json
Content-Length: 237562
Last-Modified: Sat, 18 Feb 2017 14:32:33 GMT

If you get text/plain or something alike, this will most probably not work for BatchRefine and instead of RDF you will get weird error messages! So be warned :slight_smile: If you got application/json, we are all set for BatchRefine.

BatchRefine

Now we need to fire up BatchRefine. As I mentioned this is a headless version of OpenRefine created within the Fusepool P3 project. You can either run it as described in the README or we use Docker. I like Docker better as it is very easy to deploy. Simply execute:

» docker run --rm -it -p 8310:8310 fusepool/p3-batchrefine

This will start a foreground-instance of BatchRefine in a container and assign local port 8310 to it.

Let’s test if this works with our small test-file:

» curl -XPOST -H 'Accept: text/turtle' -H 'Content-Type:text/csv' --data-binary @SAMPLE.tab 'localhost:8310/?refinejson=http://localhost:8000/bar-config.json'

Now there is a little catch: If you run BatchRefine in Docker this will not work as localhost for Docker is not localhost on your computer, so it will not find the JSON configuration. If this is the case you need to replace localhost:8000 by the IP-address of your computer, the port stays the same. Alternatively, you can put it to a public webserver somewhere, just make sure the content-type is still JSON.

If everything worked well, you will see something like this in the BatchRefine process output:

16:19:13.110 [...BatchRefineTransformer] User-Agent:curl/7.51.0 (61744ms)
16:19:13.111 [...BatchRefineTransformer] Expect:100-continue (1ms)
16:19:13.111 [...BatchRefineTransformer] Content-Type:text/csv (0ms)
16:19:13.111 [...BatchRefineTransformer] Accept:text/turtle (0ms)
16:19:13.111 [...BatchRefineTransformer] Host:localhost:8310 (0ms)
16:19:13.111 [...BatchRefineTransformer] Content-Length:31634 (0ms)
16:19:13.112 [...BatchRefineTransformer] ----------------------------------------------------- (1ms)
16:19:13.128 [...BatchRefineTransformer] GET http://192.168.1.132:8000/bar-config.json HTTP/1.1 (16ms)
16:19:13.156 [...BatchRefineTransformer] HTTP/1.0 200 OK [Server: SimpleHTTP/0.6 Python/2.7.10, Date: Sat, 25 Feb 2017 16:19:13 GMT, Content-type: application/json, Content-Length: 237562, Last-Modified: Sat, 18 Feb 2017 14:32:33 GMT] (28ms)

and on my shell where I executed curl, I get (among others):

...
<http://data.alod.ch/bar/id/archivalresource/4897> a locah:ArchivalResource ;
    dc:title "Eidgenössisches Departement für auswärtige Angelegenheiten: Handakten Flavio Cotti, Bundesrat (1993-1999)" ;
    alod:referenceCode "E2850.3*" ;
    locah:maintenanceAgency <http://data.helveticarchives.ch/isil/CH-000018-2> ;
    time:intervalStarts "1993"^^<http://www.w3.org/2001/XMLSchema#gYear> ;
    time:intervalEnds "1999"^^<http://www.w3.org/2001/XMLSchema#gYear> ;
    locah:maintenanceAgencyCode "CH-000018-2" ;
    alod:recordID "4897" ;
    alod:legacyTimeRange "1993-1999" .

So we successfully transformed SAMPLE.tab to RDF! What did we do with this command? We simply said we feed you an input file that is of type text/csv and we would like to get text/turtle as a result, which is one of many RDF serializations. BatchRefine then acts like an OpenRefine instance and simply converts the file based on the JSON configuration to RDF. Note that as far as I know RDF Refine only supports RDF/XML and Turtle so we have to stick to one of these two as output format.

Bulk-processing

Now we are almost done but we need to find a way to automate the processing of BatchRefine for many small files, in the end we don’t feel like doing this manually. There are many ways of doing that, in this post I explain a shell-script variant that does not need any other languages or programming. Feel free to replace this part with whatever language you are familiar with.

First, we need to split our large files to something smaller. I noticed that OpenRefine has absolutely no memory-problems with junks of 100’000 lines, so let’s split our input file:

» split -l 100000 INPUT.tab

This will generate a bunch of files, as I didn’t specify any special option, split will name them on its own. The first file in my case is called xaa. Note that I did this in a directory called input, this is needed in my shell scripts later.

Now the problem is that BatchRefine relies on the first line of the CSV file to figure out which column is mapped how. This information is lost on all other files except xaa so we need to write this first line into every other file. This can be done with a shell script I call split-clean-header.sh

#!/bin/bash

for file in ./input/*
do
  filename=$(basename "$file")

  if [ ! -f "input/"$filename".tsv" ]
  then
    echo Processing $filename...
    head -n 1 input/xaa | cat - "input/"$filename > "input/"$filename".tsv"
  else
    echo Found "input/"$filename".tsv", skipping $file
  fi
done;

Save this and execute it on your file, you will get a bunch of new files using the same name but with an additional .tsv extension. One problem remains here: The file xaa now gets a double header as it already did contain it once. You can delete that manually or you fix the script accordingly.

Now everything is set and we can basically execute the processing on every single file we created. This can again be done in a simple shell script, I call it bulk-refine.sh:

#!/bin/bash

for file in ./input/*.tsv
do
  filename=$(basename "$file")

  if [ ! -f "target/"$filename"-transformed" ]
  then
    echo Processing $filename...
    curl -XPOST -H 'Accept: text/turtle' -H 'Content-Type:text/csv' --data-binary "@"$file -o "target/"$filename"-transformed" 'localhost:8310/?refinejson=http://localhost:8000/bar-config.json'
   
  else
    echo Found "target/"$filename"-transformed", skipping $file
  fi
done;

On my MacBook, this script transforms 4 million lines in roughly 10 minutes to RDF!

If I have large files I prefer NTriples serialization, so I use serdi to convert everything from Turtle to NTriples, you can find the tool herehttps://drobilla.net/software/serd, on macOS I installed it with brew. The script I call bulk-to-ntriples.sh

#!/bin/bash

for file in ./target/*
do
  filename=$(basename "$file")
  serdi -i turtle -o ntriples $file >> complete.nt
done;

If you now do a wc -l complete.nt you will see how many triples it generated! In my case I get around 34 million triples in a file of 4.4GB (uncompressed).

Happy SPARQLing!

Credits

My thanks go to Andrey Bratus and Giuliano Mega from SpazioDati. They were working with me in the Fusepool P3 team and did the amazing OpenRefine/BatchRefine work!