Some thoughts about the "data quality in AI projects" conversation

Photo by Nathan Bingle on Unsplash

There's been a lot of talk recently about data quality, especially since Andrew Ng published his toughts about it a couple of months ago, essentially saying that algorithms are good enough as it is, and that model accuracy needs to be improved by feeding them better quality data. As far this statement goes, I whole-heartedly agree.

To prove this point, Andrew Ng's DeepLearning.ai is teaming up with Andrew Ng's Landing.ai to launch an AI competition, but with a twist:

the goal is not to create the most accurate model given a data set, it's to create the dataset that will give the best accuracy given a specific model. All you can change are the data samples.

What are the rules?

a) It's an image recognition competition, where the goal is to recognize handwritten roman numerals.

b) The model's input constraint is that you need to provide a data set for each of i, ii, iii, iv... etc. So basically the data model is fixed. All you can do is cleanup the data and labels and enrich the samples with new data.

I think this is a great initiative, but I have some remarks. I have commented on this here at Luddites and Executives (here, here and here) and also on LinkedIn (here and here) but wanted to chime in more specifically with what I think of this recent spate of "data quality" focus and Andrew Ng's way of presenting the problem.

"Quality" is not the issue

I believe that the way the "data quality" problem is presented in this conversation is wrong, that it misses important points about the problem of data in real world AI projects.

My view is that the largest room for improvement in model accuracy is not in either the algorithm or the training data "quality".

The way to improve AI outcomes is by focusing on understanding the semantic goal of your project, and how that goal is represented in and supported by your data. Then you can select and tune the AI model and the data samples and the data model to reflect your target semantics in the most effective way. A data project that is limited to cleaning up examples of individual classes, and is constrained to only provide positive examples of each class, while it might give us some interesting metrics and findings for cleaning up image data samples, will in no way address the data problem that real world companies trying to do AI actually face: the goal of AI in the real world is not to recognize that a photo is of a certain category, it's to achieve a specific knowledge outcome that supports a decision.

If you want your AI in the real world to achieve such an outcome, you have to build a pipeline of models involving several layers of data and models, some being triggers for other layers, and so on.

What makes data both useful and specific? It's what they mean, i.e. what information content they have!

And by the way, when I say "it's all about semantics", I don't mean this in the typical nerdy semantic web sense. I mean that it's about the data model.

There is a very annoying ambiguity in the AI world, it's the word "model". A model can be an ML engine that has been trained to assign some values to some input types, but a model can also be a way to structure datasets. It's the data model in the latter case. I wonder whether the fact that the same word is used for both concepts is a cognitive or sociological reason why AI engineers tend to not see this data modeling problem.

Because it's not about "clean" data. It's about having data models that are consistent and meaningful according to what you're trying to accomplish.

For instance, to try and contrive an example using the roman numerals in this competition, let's say that by analyzing and understanding my data I realize that there's an obvious confusion, that makes it difficult even for a human to distinguish between some of the numerals when written in cursive and others in block script or those that use the top and bottom bars and those that don't. If I have that case, then I might want to segment my dataset between cursive and block first, and then branch a second model to recognize the actual numbers once they have been categorized in block or cursive. Obviously I don't have a clue whether that's the case here, and there may be other characteristics.

Even taking this simple and relatively intuitive example, I could then decide to not have just classes for {i, ii, iii, iv,...}, but have classes for {"i cursive with bars", "i cursive without bars", "i block", ii, iii, "iv with bars", "iv without bars",...}. Or, I could have three models: one that has classes {"cursive", "block"}, and then I have separate "cursive" and "block" models trained to run on the two output sets, each with classes {i, ii, iii, iv,...}.

This would depend on me analyzing the data, running different models on it to see what trips them up, and make some experiments to design the most promising combination of ML model and data model to do my processing.

But if you have a feature like this in your dataset, ie some feature that is a confounder, or that helps clarify the semantics of your data making it more digestible by AI models, then it's your data model that you need to evolve, not the "quality" of the samples themselves. This competition will not be able to discover this, because you have to provide the fixed model with data in a data model that has already been decided, with preprocessing and ML model that will be the same for everybody.

Of course, Andrew Ng and his followers have a rather accademic and AI-theoretical bent, and this competition is interesting as an accademic exercise to illustrate something about data quality, for sure.

But I think that this question of the data model is an important blind spot of this discussion and competition that I don't see reflected here.

Do you have reasons to think I'm wrong about this? This is borne out of my experience running AI projects, but other people's mileage may vary.

As usual, feel free to let me know what you think on LinkedIn or here. If you'd like to receive my publications by email, subscribe on this blog below.