Data Cubes and Largish Data


A Potential Case Study

Not long ago I was looking at a Kaggle competition, machine learning task, about predicting earthquakes.

It was based on the idea that (local) geological scale rock fractures had the same characteristics as squeezing a rock in the lab. It makes tell-tale noises before fracturing catastrophically. The challenge was easy. Two sets of data: one, the grunting & squeaking noises the rock made when squeezed; two, the time when serious failure happened. Predict the latter from the former.

I didn’t do too well with it, it felt like I was hampered by four things, in no particular order:

  • sheer size of the data
  • processing power
  • lack of data management tools
  • an ex girlfriend doing my head in

Data Size

The data given was effectively a big, for me - a few GB (it says 2GB now, I’m sure it was more like 6) sound file and an associated timeline of when breakage events occurred.

So you start with very low dimensional stuff - two at most. You want to find some features there. Add another parallel stripe of, say, mean level over the last 1000 samples.
So far so good.
But add an FFT, spectrogram, cepstrum, wavelet madness. You very soon get the shape of the data expanding.
And each of these very likely is as big as the initial data, the more interesting ones going off into different dimension.
This very quickly gets huge.

Now here I have to apologise, I haven’t read the specs in a while now, I forget the handling of blobs rather that point-by-point, errm, datapoints.

But still, onto the next thing


I only had (have) a very low spec machine. Ok, I accept things will take time. But it is very frustrating when you ok, have only 4GB of memory, but 500GB of HD space available. Why doesn’t it swap!!! Never asked it for more that say 1GB in memory at a time. Bloody Python.
Or, I was I was using Kaggle’s own machines, they got screwed up a bit on having a large quantity of stuff thrown at them. Same for MS, same for - I forget now, a couple of other similar services.

I’m sure Bergi will pick me up on not emphasising web-based processing resources. He’s right. But pragmatically, I want a supercomputer in my basement.

Data Management Tools

This is the leverage point I reckon. I was writing horrid little scripts to save partial results over here, over there. A day later forgetting where I’d left things. Had this been put down as, dare I say it, good metadata, I could pick up where I left off.
More significantly, I should have been able to say, ok, drop that model. Fire at this one instead. Or ensemble x, y & z models.
Not trying to make RDF a scripting language, just leaving anchors around. I have this model available from my Goose Mating Rituals experiment. A very stupid script, do a bit of SPARQL, put pieces together, is enough go and try it.

Ex Girlfriend Issues

She really did my head in. Only one bit particularly relevant. I DID have quite a nice laptop, complete with NVIDIA card, perhaps 50 times faster than this thing. One night on Skype, we had a row, I got angry and slammed the laptop lid down. Forgetting I’d left a USB stick on the keyboard. It all got rather broken.


I have been hand-waving, but hopefully you get the idea of the problems that I felt I hit.
I’m pretty sure most if not all are relatively easily fixable. I’m not going to suggest anything here, I’m so out of touch, you know better.

Well, ok, I don’t expect the RDF community will solve the issues with my ex. Until the inclusion of a declared Null object, which probably breaks the open world assumption and gets nasty idiots elected to positions of power. In the meantime, I’m saving up for a new computer and okcupid time.

Over to you.


Sounds like an interesting project. I like environment related projects.
I also have one on my todo list.

Data Size

I’m not sure, but it sounds like you have some kind of audio data (or something that can be handled like audio), so maybe it can be handled like image data.

Don’t store the full data in RDF, just point to it or fragments of it with Media Fragments URIs.

Are you doing FFT etc. on the fly? Then store any additional data required for the processing also in the observation. E.g. images can be scaled and rotated for data augmentation. I would create an intermediate dataset that contains multiple observations for one input observation with the same Media Fragments URI but different arguments for scaling and rotating.

If the data can’t be handled with Media Fragments, I need some more details. Maybe I can propose an alternative or you just need to get a very big machine for a very big triple store :wink:


Sorry, but I have to agree, I also want/have my supercomputer, which means I don’t have a solution for you at the moment. I have one dedicated machine for GPU tasks with a powerful Nvidia card at home. On my todo list is some kind of schedule software that starts the machine via Wakeup on LAN when there is enough solar power. I expect there will be more powerful small, Raspberry Pi like, machines with NPU. Powerful enough for training and everyone has one at home. It will be shared to all devices in the local network and can be found via Simple Service Discovery Protocol or something similar. But that’s the future. Right now you have to check how you can start with a small model, check if it’s going in the right direction and scale it once you have the base architecture. I’m sure you find some papers, posts on the Web about it.

Data Management Tools

I have some ideas for this, but not a ready solution (yet). I ported the concept of RDF loaders to Python and created some example code how it could be used in the field of machine learning. You can find it here with a short readme and comments in the code. I also have some code to read the cube data in Python, but that’s connect to more static code and needs to be cleaned up before I can put it on Github. But maybe something I can show you already via screen sharing, if you ping me.

Ex Girlfriend Issues

I think you are on the right track by working on a project :slight_smile: