A Potential Case Study
Not long ago I was looking at a Kaggle competition, machine learning task, about predicting earthquakes.
It was based on the idea that (local) geological scale rock fractures had the same characteristics as squeezing a rock in the lab. It makes tell-tale noises before fracturing catastrophically. The challenge was easy. Two sets of data: one, the grunting & squeaking noises the rock made when squeezed; two, the time when serious failure happened. Predict the latter from the former.
I didn’t do too well with it, it felt like I was hampered by four things, in no particular order:
- sheer size of the data
- processing power
- lack of data management tools
- an ex girlfriend doing my head in
The data given was effectively a big, for me - a few GB (it says 2GB now, I’m sure it was more like 6) sound file and an associated timeline of when breakage events occurred.
So you start with very low dimensional stuff - two at most. You want to find some features there. Add another parallel stripe of, say, mean level over the last 1000 samples.
So far so good.
But add an FFT, spectrogram, cepstrum, wavelet madness. You very soon get the shape of the data expanding.
And each of these very likely is as big as the initial data, the more interesting ones going off into different dimension.
This very quickly gets huge.
Now here I have to apologise, I haven’t read the specs in a while now, I forget the handling of blobs rather that point-by-point, errm, datapoints.
But still, onto the next thing
I only had (have) a very low spec machine. Ok, I accept things will take time. But it is very frustrating when you ok, have only 4GB of memory, but 500GB of HD space available. Why doesn’t it swap!!! Never asked it for more that say 1GB in memory at a time. Bloody Python.
Or, I was I was using Kaggle’s own machines, they got screwed up a bit on having a large quantity of stuff thrown at them. Same for MS, same for - I forget now, a couple of other similar services.
I’m sure Bergi will pick me up on not emphasising web-based processing resources. He’s right. But pragmatically, I want a supercomputer in my basement.
Data Management Tools
This is the leverage point I reckon. I was writing horrid little scripts to save partial results over here, over there. A day later forgetting where I’d left things. Had this been put down as, dare I say it, good metadata, I could pick up where I left off.
More significantly, I should have been able to say, ok, drop that model. Fire at this one instead. Or ensemble x, y & z models.
Not trying to make RDF a scripting language, just leaving anchors around. I have this model available from my Goose Mating Rituals experiment. A very stupid script, do a bit of SPARQL, put pieces together, is enough go and try it.
Ex Girlfriend Issues
She really did my head in. Only one bit particularly relevant. I DID have quite a nice laptop, complete with NVIDIA card, perhaps 50 times faster than this thing. One night on Skype, we had a row, I got angry and slammed the laptop lid down. Forgetting I’d left a USB stick on the keyboard. It all got rather broken.
I have been hand-waving, but hopefully you get the idea of the problems that I felt I hit.
I’m pretty sure most if not all are relatively easily fixable. I’m not going to suggest anything here, I’m so out of touch, you know better.
Well, ok, I don’t expect the RDF community will solve the issues with my ex. Until the inclusion of a declared Null object, which probably breaks the open world assumption and gets nasty idiots elected to positions of power. In the meantime, I’m saving up for a new computer and okcupid time.
Over to you.