This blog is about technology, software and social media. It's aimed as much towards 'normal' people as the tech savvy. The author is Tony Gallacher.
Ernest Cline’s science fiction novel “Ready Player One,” is about gamers. They compete in an online simulation for a fortune in prize money. The world watches as contestants leapfrog each other on a leader board, jostling for position. Kaggle, a platform for data science (or big data) competitions, works in a very similar way. Its one-line mission statement reads: “we’re making data science a sport.”
Deloitte, Ford, Facebook and Wikipedia are some of the companies that use Kaggle. In a typical competition, an organisation posts some historical data to the site and asks a question. How many of our clients will switch insurers before the end of their policy? Which posts will a WordPress user like? What is the likelihood that an HIV patient’s infection will become less severe? The challenge is to answer the question with the most accurate predictive model.
Submissions are crowdsourced from among the platform’s more than 40,000 data scientists. Some work in teams, others go it alone. While you might think this sort of work would suit the talents of actuaries, they make up only 1.2% of Kaggle’s members. Contributors come from a range of fields: they can be statisticians, biologists, hobbyists or philosophers. The largest group, 15.6%, are computer scientists.
Usually the host organisation offers a reward, often financial, often significant, for the best forecast, measured against a solution file. The contestants don’t see this file until after the competition deadline and, as always, the judges’ decision is final.
While Cline’s “Ready Player One” is set in a dystopian world, Kaggle is aiming for something quite different. The multi-stage Heritage Health Prize, worth $3m, will run until April next year. US health care provider, the Heritage Provider Network (HPN), gave the website’s members anonymised claims data. HPN asked the data scientists to predict which patients would be admitted to hospital within the next year. Its goal is to identify and treat people before they need in-patient treatment.
Participants can submit updated models, as often as every day, while a competition is open. The league table ranks solutions in real time. A competitor might be at the top with a month to go but several places behind a week later – because more accurate solutions have been uploaded. That means the data scientists are motivated to continue to build even better predictive models, right up to the deadline. This is one reason why Kaggle has always crowdsourced better solutions than pre-existing benchmarks.
Many people are still not sure what big data is. That’s partly because the term is just misleading. Not all sets of “big data” are actually that big (Kaggle has made significant advances with megabytes of data). It’s much more useful to talk about data science: the process of extracting value from datasets. Kaggle competitions are very effective examples of data science.
If you found this Tech Post article useful, please share it…