Big data researchers at Trinity College Dublin are to lead a €4 million EU-funded effort to enforce a degree of “quality control” over the ocean of data the world is rapidly accumulating.
The “Aligned” project is funded under the Horizon 2020 research budget and is designed to directly attack the inherent complexity, scale and inconsistency of data stores available on the web, explains Dr Kevin Feeney, a research fellow in Trinity’s school of computer science and statistics.
He will lead the project with senior research fellow Dr Rob Brennan and supported by associate professor of computer science Declan O’Sullivan.
“One of the big problems with big data sets is quality control,” says Dr Feeney. “You can mine it and get statistics from it, but any conclusions drawn are only as good as the data they came from and data gets messy very quickly.”
The goal is to find a better way, develop methods to build in consistency and quality assurance for developers who want to access the information embedded within the data source.
“It will develop technology to allow software developers to incorporate big data from the web into their applications,” says Dr Brennan.
App developers for example want to develop new services based on the data but don’t want to be held up by inconsistencies and weaknesses inherent in the source material.
The Aligned (Aligned, quality-centric software and data engineering) consortium which includes Oxford, Leipzig and Poznan universities and software companies in Germany and Poland will make it easier for app developers and others seeking to mine big data to get what they want from the resource. The funding will sustain the work for three years.
One way to achieve improved quality is to begin building quality control into the data itself, says Dr Feeney. “We are starting to put more intelligence into our big data, the software is starting to be embedded in the data.”
Achieving this requires a blend of human intervention along with machine learning, computer software that gets better at its job through experience.
“We can build workflows where we can have humans checking data but it is easier and faster to get machine learning to check the data. Yet how can we reduce the human involvement necessary to handle bigger and bigger data sets,” he says. “We are looking for the best mix of human and artificial intelligence to deliver good data sets for doing good analysis.”
The hope is that the quality control systems set up by the Aligned project will open the door for the next generation of big data systems, making it easier – and cheaper – to deal with big data sets that have a growing complexity.
This would serve the needs of companies seeking to exploit data mining but also academic researchers who need ways to build and maintain data sets.
“Both enterprises and academics see the power of sharing and reusing data on the web but cleaning and maintaining data so that flexible applications or analytics can be built on is still a challenge and Aligned will address this need,” says Dr Feeney.
The consortium is not going to work in a vacuum; it will tackle existing, real world big data applications. The Wolters Kluwer Jurion legal information system will provide a test-bed for the technology flowing from the Aligned project.
Oxford’s software engineering research group is already working on a project with the UK National Health Service to share data and the results of drug trials coming from academic cancer research teams. This existing work will be brought into the Aligned mix and extended to cover big data in general and not just medical data sets.
Another intriguing project that already involves Dr Feeney and colleagues at Leipzig University relates to the international Seshat Global History Databank project. It has the ambitious goal to bring together everything we know about every human society that ever existed from now back to prehistory.
The project began three years ago and has already assembled 30,000 data points and Dr Feeney serves as the technical editor for the Seshat project.
It is an ideal example of what Aligned wants to accomplish given it will take existing data sets and will publish expert-curated data that will be both accurate and accessible, Dr Feeney says.
Aligned is also a real reputation builder for Trinity given the university’s lead roll. It was one of only seven software engineering projects from among 75 submissions to receive funding in the first round of Horizon 2020 information technology funding.
Even though the project predominantly serves the needs of academics and companies, those with an interest in data mining can still participate, says Dr Feeney. “You can do this at home but it is not easy,” he says. “But if you can run a statistics package then you can mine data. You need to be adept at computers. It is not like using Word, but you don’t need a PhD or be a multinational corporation to use this.”
Part of the problem is whether the data set is open to general access or belongs to a company.
“The thing that makes it uneven is lots of the data sets are owned by corporations for looking at customer behaviour. But data is becoming increasingly accessible. And we are big on open publishing and will be making our findings available anywhere.”