Discovering RDF Data Cubes


#1

I noticed that I often help people with some datasets we co-created or maintain so I thought I will move this kind of discussions in here so others can learn from it as well. In this particular case it’s about a sample RDF Data Cube we created a while back for the Swiss Environment Agency, called Bundesamt für Umwelt (BAFU). He discovered the GitHub repository and had a look at the sample SPARQL queries, which unfortunately point into Nirvana for some reasons.

Without further without further ado, I paste the questions and my answers, as they are very good for people new to RDF & RDF Data Cubes in particular:

  1. So far, there are two graphs with BAFU data, with each one dataset, right?

yes that is still the old data but a good base to start working with.

  1. How can I request all the graphs or all the datasets that have BAFU data (not all graphs or all datasets available at the endpoint)?

All queries I post can be executed on: https://test.lindas-data.ch/sparql-ui/
If you run them on commandline/api with curl or alike, use the one in
the query window: https://test.lindas-data.ch/sparql

In LINDAS there is not a proper separation so you can’t do that specific
query directly, but with some SPARQL filtering:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?g WHERE {
  GRAPH ?g {
    ?s ?p ?o
  }
  FILTER(CONTAINS(STR(?g), "FOEN"))
}
  1. Can I consider that a graph is just “a group of datasets” and that a
    graph is used within an endpoint to exclude the datasets that don’t
    belong to this graph?

yes it’s mainly for dynamically getting data in our out. If you do not
specify a graph, it will go over the so called “default” graph, which in
current LINDAS setup goes through all graphs.

But you can also specify one or multiple graphs, for example:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
SELECT *
FROM <https://linked.opendata.swiss/graph/FOEN/UBD28> # comment one of
these FROMs to see what happens
FROM <https://linked.opendata.swiss/graph/FOEN/UBD66>
WHERE {
  ?dataset a qb:DataSet .
}
  1. Can I assume that there is only 1 qb:MeasureProperty per dataset?

In our cubes yes, you can assume that. The RDF Data Cube spec allows
also more. There are cases where this might make sense.

  1. In the sample query I found here
    https://github.com/lindas-uc/bafu_ubd/blob/master/queries/sparql-ubd66-complete.rq,
    there are some things that I don’t understand:
  • Why is the actual measurement value linked to the qb:Observation with
    bafu:measurement, and not queried with qb:MeasureProperty?

Let’s get some operations:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
SELECT *
#FROM <https://linked.opendata.swiss/graph/FOEN/UBD28> # comment one of
these FROMs to see what happens
FROM <https://linked.opendata.swiss/graph/FOEN/UBD66>
WHERE {
  ?obs a qb:Observation
} LIMIT 10

Let’s have a look at one random operation (you can browse this URI):

http://environment.data.admin.ch/ubd/66/measurement/1_1/Cd/1989-10-17T00%3A00%3A00

in there we find the link you mentioned (bafu:measurement), open it in a
new window:

http://environment.data.admin.ch/ubd/66/qb/measurement

-> you will see the subject URI is a qb:MeasureProperty.

The idea of RDF Data Cube is to define a generic model, that can then be
“instantiated” for particular cubes. There is no property that
represents a measure but there is a class so you can create your own
qb:MeasureProperty for whatever makes sense in your data.

That might look like overkill for this particular cube, as we only seem
to have one qb:MeasureProperty:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
SELECT *
#FROM <https://linked.opendata.swiss/graph/FOEN/UBD28> # comment one of
these FROMs to see what happens
FROM <https://linked.opendata.swiss/graph/FOEN/UBD66>
WHERE {
  ?measure a qb:MeasureProperty .
} LIMIT 10

but it makes a bit more sense when we use the same concept for
dimensions and attributes:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
SELECT *
#FROM <https://linked.opendata.swiss/graph/FOEN/UBD28> # comment one of
these FROMs to see what happens
FROM <https://linked.opendata.swiss/graph/FOEN/UBD66>
WHERE {
  ?dimension a qb:DimensionProperty ;
      rdfs:label ?dimensionLabel .
}

Again in the current BAFU setup this is not that many, but this list
will grow when we have more datasets. For Statistik Zürich for example
you will get a lot more results, even on qb:MeasureProperty.

Also note that qb:MeasureProperty acts as a class so we basically define
a new attribute that is of class qb:MeasureProperty.

  • Does the OPTIONAL to request measurements mean that there can be
    observations without measurement?

yes, that could happen in the BAFU dataset. I would do that differently
now I think and add a measure that has “NaN” as value attached, which is
the official XML way of saying that this was measured but it is “Not a
Number”. That will actually validate as well as “NaN” was defined
exactly for cases like that so it’s considered a valid integer/double value.

  1. Is this query reliable to fetch all the dimensions that this dataset
    has?
SELECT * WHERE {
    GRAPH <https://linked.opendata.swiss/graph/FOEN/UBD66> {
    ?dimension a qb:DimensionProperty;
    rdfs:label ?label .
  }
}

yes Note that your way of specifying the graph is just another way of
writing it, in this case without the FROM.

  1. How can I fetch all the unique values that a DimensionProperty can
    have? (for instance all the dates of bafu:date)?

you can use the DISTINCT keyword for that:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
SELECT DISTINCT *
FROM <https://linked.opendata.swiss/graph/FOEN/UBD66>
WHERE {
  ?dimension <http://environment.data.admin.ch/ubd/66/qb/date> ?date .
}
  1. The queries listed in the README on this page
    https://github.com/lindas-uc/bafu_ubd don’t return any result? Are
    they outdated?

yes apparently I never updated them to the correct endpoint, see the
URI’s above.


#2

Re 5: QB says that all measures are required. If you use single measures that’s not a problem because you can just skip the observation , but if you use many measures per observation, it becomes a problem
So how do you write NaN in RDF? i don’t think you can.


#3

I’m not sure if I follow on No. 5, can you elaborate?

Regarding NaN, you can do this:

BASE <http://example.org/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

<observation> <measure> "NaN"^^xsd:double .

XSD defines it exactly for those cases and it is valid RDF.


#4
  • Does the OPTIONAL to request measurements mean that there can be observations without measurement?
    • yes, that could happen in the BAFU dataset.

That’s not according to QB. Thanks for the NaN suggestion, I didn’t know that.
FWIW, xsd:decimal doesn’t have NaN but double and float do.

See LSD-dimenions for some Cube stuff crawled from the LOD cloud https://www.slideshare.net/albertmeronyo/wi-40956859. But the site seems to be down.

Cheers!


#5

Ah now I get it with OPTIONAL. I was not aware of that, thanks for the hint. This dataset is quite old and I never felt comfortable with what I did there so in the future everything will do it the NaN way.

Thanks for LSD-Dimensions, I did not know about that. Will talk to the ones who started that.


#6

We now use NaN in BigDataGrapes.
We prefer to use xsd:decimal because it’s infinite precision, so i’m wondering whether it’s kosher to mix the two data types in different values of the same property.


#7

For xsd:decimal NaN is not valid, that’s why you asked if you can mix it? I would not see any reasons on why not, only concern I have is that consuming libraries might not be smart enough to handle a mix of data types.