”Big data, ready or not” April 25, 2012
- Doug Cutting, co-founder of the Apache Hadoop project and creator of Nutch and Lucene
- Nick Halstead - Founder/CTO DataSift
- Hilary Mason – Bit.ly Chief Scientist
- Andy Kirk – Visualising Data
- Edd Dumbill – Program Chair Strata Conference – Moderator
(Photo via Stewart Townsend)
What follows are my notes from the evening - they're not word for word correct (I'm not that good with an iPad keyboard yet) but I hope the sentiment is correct. Where I noted down the panelist, I've attempted to attribute the comments - however, again, I may not be attributing everyone's contributions correctly.
The panel started off with Edd asking, So what is big data? The answers ranged from correct but slightly silly:
lots of 0s and 1s
too big to fit in x (where x is your usual tool - excel, SQL, memory etc) - Hilary
the combination of multiple data sets where they might fit in one machine/chunk of memory but couldn't be mined together. It's more about mining /analysing data sets together than storing them - Nick
There was a conversation about efficienices, and about the impact of working at scale:
Pressures are increased due to size of data - pressure on servers but also on human factors - need efficiency to get the most out of things
An example was given of it taking 6 weeks to copy a file when moving data centres
A conversation later touched on the combination of cloud computing and big data, and that whilst it makes a lot of sense from a processing perspective, it depends where your data is starting from. A member of the audience pointed out that you can post your hard drive to Amazon for them to load the data.
The panel then moved on to, So what is data science or a data scientist?
Can ask questions and get answers back before we've forgotten why we asked the question - Hilary
Data science should be a thing, but isn't yet in the UK - Andy.
Probably more as a team than an individual pursuit. Analytics/management information exists already but the terminology possibly not so much.
More about insights than core skills - need to understand Maths but it's more about having a fascination
Lots of closeted/latent data scientists out there - says Doug - if you give them tools, they'll come out of the woodwork
People might not have taken the right courses, but they do have the right insights and interests
Data science is new as a role - straw poll of audience shows there are very few who are currently titled as data scientists and even less who have had more than one role as such
Recruiters don't know what data scientist are (in the UK - Nick). We're very early in this.
As an aside, I chatted with a couple of recruitment agents at the event. They'd started to be asked for people with skills they'd never heard of. They attended the event as a way to learn more. They seemed quite human actually, and didn't even have a business card to offer me.
The conversation then moved on to, How does big data fit into a corporate environment?:
Silo'd data is bad. It needs to be integrated with all the rest of the data in your business. Your twitter mentions need to be with your orders etc. This is the only way to get the data in one place to be able to analyse it all in context and see the cause and effect.
The culture of the organisation plays a big part - there is no point in having all the data if the culture isn't open to the changes and have the ability/desire to respond to it.
Data visualisation can be used as eye candy - Andy had an example of a supermarket wanting prettier visualisations to grab attention. It wasn't about the data. It was all about getting the attention.
The conversation about data visualisation then left the corporate environment, and pulled out one of my favourite thoughts:
Most things lend themselves to timelines and storylines. There is a skill in getting the attention but there is more of a skill at making it stick - Andy
So more about making the person who has seen it return to it, think on it, learn from it. This I liked.
An interesting question touched upon, that was the subject of the debate at Strata conference last month: Is domain knowledge important?
Hilary says: smart people are smart. The domain expertise can add constraints to what you'll explore.
which again I liked.
The first of the questions from the audience was: What can we do with social data/user generated data that we’re not exploiting now?
Hilary says: make use of user attention data, gain insight From hidden data, for instance, bit.ly don't currently analyse data within images/videos etc
Andy says: get access to individual insights, opinion, mindsets at the time. Will get better opinion of real time people thoughts, not a 2-3 month lag waiting for research to come in. Also, sentiment analysis
Nick asks: But what about privacy? It comes down to our acceptance of whether we want to or are willing to be tracked to that level rather than what we can do.
The privacy question is interesting. Are people becoming more wary? Or will they?
I quite liked the cheek of this question - basically, what should the questioner move their business towards?
There was the inevitable question on: So relational databases vs NoSQL solutions. Which will be the choice in 5 years time:
RDBMS solutions won't go away. There is and will continue to be a growth of alternative mechanisms. It'll be a combination of relational and some other things - Doug
This inevitably gets asked at every panel/forum/event, and it amazes me that anyone could ever think that there is a one-size-fits-all solution. People have tried to force large data sets or object based data into RDBMS solutions for years, and struggled. That doesn't mean RDBMS solutions don't work. It just means that in the same way that there are jobs in the house that are best done with a hammer and some that are best done with a screwdriver, there are jobs where RDBMS solutions work best, and others where they don't. You can use a screwdriver to bang in a nail, but it's not an efficient solution...
The final question was nice: What are the dark side of big data? What happens with that?
Probably more regulation on data. We share a lot of data with our credit card companies. Facebook owns a lot of information about us, but government/governance hasn't caught up yet. How we’re allowed to analyse the data will change. Not necessarily going to get it right initally - ie EU cookie laws etc - Nick
People are learning what should be public and what should be private. People are more aware now than they were a couple of years ago about what Facebook knows about them, and how that can be/is being used and making decisions accordingly - Doug
and my favourite answer
Waiting for governments to legislate is not the answer. We need to make good and ethical decisions - Hilary
My favourite soundbite of the evening has to be:
Get more data out there, and do fun things with it...
After the ODCC conference on Friday, and this, my brain keeps sparking off in all directions with possibilities. Hopefully, during my time off in May I can start to focus on which of these directions to pursue. I've always liked data, and my storyline project whilst being a relatively small quantity of analogue data is definitely helping me gain a much greater respect for some areas of data management/curation. It's going to be exciting and, hopefully, lots of fun!