Data issues hamper successful machine learning and AI launch
Source: John Liu
According to a survey of 277 data scientists and other artificial intelligence (AI)
professionals in large companies across nearly 20 industries, AI is still in its early days and challenges abound to prevent AI’s successful implementation.
Nearly all companies pursuing AI projects have run into problems with data quality
and data labeling. Four out of five data scientists said training AI with data is more difficult than they thought, according to the survey conducted by Dimensional Research.
More specifically, errors in data, not enough data, data not in a usable form, and not having enough people and tools to label data, are the challenges.
The amount of data required to train the AI algorithm is huge. 72% of the respondents reported that, in their current project, production-level model confidence will require more than 100,000 labeled data items; 10% indicated they’d need more than 10 million.
“Labeling and annotating training data for machine learning projects is a serious problem for data science teams, and a significant obstacle to getting those projects into production,” says the report titled What Data Scientists Tell Us about AI Model Training Today published by Alegion, which commissioned Dimensional Research to conduct the survey.
Human resource issues
Nearly two thirds of the data scientists surveyed said their machine learning (ML)
projects have progressed beyond proof of concept (POC), which is the litmus test for an idea, like identifying strawberries and how ripe they are, for example.
The next phase of feeding the algorithm with enough data “to be ready for validation in the real world" presents a host of challenges, the report says.
Human resource is also an issue. 80% of data scientists’ time is spent on preparing and managing data. This is problematic for companies because data scientists are expensive, and also, it is dissatisfying for data scientists "who take the job to do interesting, challenging and strategic work, not to draw boxes," the report says. As a result, data scientists are not left with much time to do what they were hired to do – using machine learning to improve the business, and to potentially “carve out a position of industry dominance through innovation,” the report says.
Outsourcing produces results
The surveyed data scientists said “offloading training data labeling and annotation” is associated with a significantly higher rates of successful project deployment. This is not surprising, given the typical volume of training data involved, the small team size and the numerous data quality issues.
“Depending on the volume of data the algorithm requires as well as the number and complexity of the tasks needed to structure the data appropriately, an ML project team may need to find, train, and manage hundreds of people,” says the report.
Up to 71% of companies have outsourced some AI or ML activities. Companies that don’t have the correct data and enough data often outsource the data collection task.
Also, people are needed to supply human judgement to the data preparation process. “These data specialists drive the tools, label the data and evaluate the work of other people,” the report says.