Remark: I wrote this post long time ago but never published it. Unfortunately BatchRefine is still on the old 2.x codebase of OpenRefine but some might still find it useful.
In this post I will show how you can use a headless OpenRefine instance called BatchRefine to process large amounts of CSV data without memory problems. A normal OpenRefine instance is in our experience not really suited for processing large files. If you ever tried to load several hundred thousand or even millions of lines in OpenRefine, you know what I mean.
In this example I transform the Open Data data set for the Swiss National Archive to RDF. The export is provided as a TSV file of around 4 million lines.
My former project colleague Giuliano Mega also did a short video that demonstrates OpenRefine and BatchRefine within the Fusepool P3 platform.
Beside OpenRefine and BatchRefine I will work with Unix tools which are well suited for processing files. Among them:
file– determine file type
wc– word, line, character, and byte count
split– split a file into pieces
curl– transfer a URL
iconv– character set conversion
head– display first lines of a file
pbpaste– provide copying and pasting to the pasteboard (the Clipboard) from command line
serdi– Read and write RDF syntax
Note that there is in general a difference between the BSD and GNU implementation of these utilities. I tested all this on my macOS 10.12 system which is using the BSD variant of the utilities. That might or might not work if you try it on platforms like Linux, which is using GNU by default. Please also consult the
man-pages of the utilities for more information of what they can do. I only explain what we really need for our task.
I will also use Docker for running BatchRefine.
First, we need to know something about input file, while OpenRefine can handle different input formats, BatchRefine assumes the input to be UTF-8. As far as I remember we never implemented support for different codepages. This means you have to convert the input file to UTF-8 before you feed it to BatchRefine.
So let us start with checking the input file type:
» file INPUT.tab INPUT.tab: ISO-8859 text, with very long lines, with CRLF, CR line terminators
In our case this is not UTF-8 so we need to convert it accordingly, this can be done like this:
» iconv -f ISO-8859-1 -t utf8 INPUT.tab > INPUT_UTF8.tab
If your file has another encoding than ISO-8859, check
iconv -l for a list of supported input and output formats and adjust the command above accordingly.
iconv finished, we can check again:
» file INPUT_UTF8.tab INPUT_UTF8.tab: UTF-8 Unicode text, with very long lines, with CRLF line terminators
That looks more like it! Now let’s check how many lines we will process:
» wc -l INPUT_UTF8.tab 4002100 INPUT_UTF8.tab
-l option of
wc simply counts the amount of lines in a file, in our case around 4 million. As I mentioned before, this is not something you want to load into OpenRefine. But we still have to use OpenRefine first to create our RDF mapping. A good approach here is to use just a few hundred lines to define the filter in OpenRefine and test it accordingly. We can easily create that by using
» head -n 200 INPUT_UTF8.tab > SAMPLE.tab
In this case it will create a new file called
SAMPLE.tab that contains the first 200 lines of the large file. You can verify with
wc that it is indeed 200 lines long.
Now it’s time to create the RDF mapping in OpenRefine. I will not go into details here, all you need is an instance of OpenRefine that has the RDF extension installed. I didn’t try to install the RDF extension on the latest OpenRefine version but I do remember that it was broken for a long time. If you don’t want to check yourself, simply grab the OpenRefine binary we created within the EU FP7 Fusepool P3 Project. It is not on the latest level of OpenRefine source code but it ships a working RDF extension. By the way if anyone wants to update this code, feel free to create a pull-request!
Please consult the RDF Refine documentation about how to create an appropriate RDF configuration, I won’t go into any details about that here.
Once you have a working RDF configuration and you tested it accordingly we need to export the configuration. This is done by extracting the operation history. You can find that in Undo/Redo, click the Extract… button there and you should see a dialogue similar to this one:
Click select all and copy it into your copy-buffer with
Ctrl-C. If you are on macOS you can paste that into a file:
» pbpaste > bar-config.json
We will have to provide this by a HTTP URI to BatchRefine, for that we launch a simple HTTP server in a separate window, started in the directory where you put the JSON file:
» python -m SimpleHTTPServer 8000 Serving HTTP on 0.0.0.0 port 8000 ...
Note that for BatchRefine it is essential that the content-type of this file is declared as JSON. You can test this with a simple
curl HEAD command:
» curl -I http://localhost:8000/bar-config.json HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.10 Date: Sat, 25 Feb 2017 16:10:49 GMT Content-type: application/json Content-Length: 237562 Last-Modified: Sat, 18 Feb 2017 14:32:33 GMT
If you get
text/plain or something alike, this will most probably not work for BatchRefine and instead of RDF you will get weird error messages! So be warned If you got
application/json, we are all set for BatchRefine.
Now we need to fire up BatchRefine. As I mentioned this is a headless version of OpenRefine created within the Fusepool P3 project. You can either run it as described in the README or we use Docker. I like Docker better as it is very easy to deploy. Simply execute:
» docker run --rm -it -p 8310:8310 fusepool/p3-batchrefine
This will start a foreground-instance of BatchRefine in a container and assign local port 8310 to it.
Let’s test if this works with our small test-file:
» curl -XPOST -H 'Accept: text/turtle' -H 'Content-Type:text/csv' --data-binary @SAMPLE.tab 'localhost:8310/?refinejson=http://localhost:8000/bar-config.json'
Now there is a little catch: If you run BatchRefine in Docker this will not work as
localhost for Docker is not
localhost on your computer, so it will not find the JSON configuration. If this is the case you need to replace
localhost:8000 by the IP-address of your computer, the port stays the same. Alternatively, you can put it to a public webserver somewhere, just make sure the content-type is still JSON.
If everything worked well, you will see something like this in the BatchRefine process output:
16:19:13.110 [...BatchRefineTransformer] User-Agent:curl/7.51.0 (61744ms) 16:19:13.111 [...BatchRefineTransformer] Expect:100-continue (1ms) 16:19:13.111 [...BatchRefineTransformer] Content-Type:text/csv (0ms) 16:19:13.111 [...BatchRefineTransformer] Accept:text/turtle (0ms) 16:19:13.111 [...BatchRefineTransformer] Host:localhost:8310 (0ms) 16:19:13.111 [...BatchRefineTransformer] Content-Length:31634 (0ms) 16:19:13.112 [...BatchRefineTransformer] ----------------------------------------------------- (1ms) 16:19:13.128 [...BatchRefineTransformer] GET http://192.168.1.132:8000/bar-config.json HTTP/1.1 (16ms) 16:19:13.156 [...BatchRefineTransformer] HTTP/1.0 200 OK [Server: SimpleHTTP/0.6 Python/2.7.10, Date: Sat, 25 Feb 2017 16:19:13 GMT, Content-type: application/json, Content-Length: 237562, Last-Modified: Sat, 18 Feb 2017 14:32:33 GMT] (28ms)
and on my shell where I executed
curl, I get (among others):
... <http://data.alod.ch/bar/id/archivalresource/4897> a locah:ArchivalResource ; dc:title "Eidgenössisches Departement für auswärtige Angelegenheiten: Handakten Flavio Cotti, Bundesrat (1993-1999)" ; alod:referenceCode "E2850.3*" ; locah:maintenanceAgency <http://data.helveticarchives.ch/isil/CH-000018-2> ; time:intervalStarts "1993"^^<http://www.w3.org/2001/XMLSchema#gYear> ; time:intervalEnds "1999"^^<http://www.w3.org/2001/XMLSchema#gYear> ; locah:maintenanceAgencyCode "CH-000018-2" ; alod:recordID "4897" ; alod:legacyTimeRange "1993-1999" .
So we successfully transformed
SAMPLE.tab to RDF! What did we do with this command? We simply said we feed you an input file that is of type
text/csv and we would like to get
text/turtle as a result, which is one of many RDF serializations. BatchRefine then acts like an OpenRefine instance and simply converts the file based on the JSON configuration to RDF. Note that as far as I know RDF Refine only supports RDF/XML and Turtle so we have to stick to one of these two as output format.
Now we are almost done but we need to find a way to automate the processing of BatchRefine for many small files, in the end we don’t feel like doing this manually. There are many ways of doing that, in this post I explain a shell-script variant that does not need any other languages or programming. Feel free to replace this part with whatever language you are familiar with.
First, we need to split our large files to something smaller. I noticed that OpenRefine has absolutely no memory-problems with junks of 100’000 lines, so let’s split our input file:
» split -l 100000 INPUT.tab
This will generate a bunch of files, as I didn’t specify any special option,
split will name them on its own. The first file in my case is called
xaa. Note that I did this in a directory called
input, this is needed in my shell scripts later.
Now the problem is that BatchRefine relies on the first line of the CSV file to figure out which column is mapped how. This information is lost on all other files except
xaa so we need to write this first line into every other file. This can be done with a shell script I call
#!/bin/bash for file in ./input/* do filename=$(basename "$file") if [ ! -f "input/"$filename".tsv" ] then echo Processing $filename... head -n 1 input/xaa | cat - "input/"$filename > "input/"$filename".tsv" else echo Found "input/"$filename".tsv", skipping $file fi done;
Save this and execute it on your file, you will get a bunch of new files using the same name but with an additional
.tsv extension. One problem remains here: The file
xaa now gets a double header as it already did contain it once. You can delete that manually or you fix the script accordingly.
Now everything is set and we can basically execute the processing on every single file we created. This can again be done in a simple shell script, I call it
#!/bin/bash for file in ./input/*.tsv do filename=$(basename "$file") if [ ! -f "target/"$filename"-transformed" ] then echo Processing $filename... curl -XPOST -H 'Accept: text/turtle' -H 'Content-Type:text/csv' --data-binary "@"$file -o "target/"$filename"-transformed" 'localhost:8310/?refinejson=http://localhost:8000/bar-config.json' else echo Found "target/"$filename"-transformed", skipping $file fi done;
On my MacBook, this script transforms 4 million lines in roughly 10 minutes to RDF!
If I have large files I prefer NTriples serialization, so I use
serdi to convert everything from Turtle to NTriples, you can find the tool herehttps://drobilla.net/software/serd, on macOS I installed it with brew. The script I call
#!/bin/bash for file in ./target/* do filename=$(basename "$file") serdi -i turtle -o ntriples $file >> complete.nt done;
If you now do a
wc -l complete.nt you will see how many triples it generated! In my case I get around 34 million triples in a file of 4.4GB (uncompressed).