AI systems are only as good as the data we feed them

By focusing on data rather than solely on a model, it’s possible to improve the performance of AI systems. Students Asfandyar Azhar and Nidhish Shah of the AI honors track arrived at this conclusion after they won a data-centric AI competition. They became so enthusiastic, that they now want to launch a course about data-centric AI at TU/e.

photo screenshot meeting Andrew Ng

Data is an essential part of artificial intelligence, but most people who work on AI prefer to focus on the development of models. That’s a shame, students Asfandyar Azhar (Data Science) and Nidhish Shah (Computer Science) believe, because you sometimes obtain better results when you improve a dataset than when you modify the AI model.

The students came to this conclusion last September, when they took part in the virtual Data-Centric AI Competition, during which participants are asked to improve a dataset given a fixed model. Azhar and Shah applied numerous methods in order to improve the data quality, including manually combing through 10,000 images in the dataset. An undeniably tedious job, Shah says. “No one wakes up in the morning thinking: today, I’m going to clean data, and it will be a fun day.”


Other methods proved to be more interesting, Azhar says. “We used active learning, for example." With this method you let an model identify what data is most useful to learn from.  The students wrote their own algorithm, which asked the existing model if an image was useful or not. If that was the case, the image was selected for the dataset.

Boring or not, the results were impressive. The students managed to improve the model’s accuracy with twenty percent by cleansing the data. “In the real world, that would be a lot,” Shah says. “We were so impressed by it that we decided to continue to work in the field of data-centric AI.”

And more people should do that, Shah believes. “Practically all research in the field of AI focuses on the improvement of existing models, which we simply feed terabytes of data. But the models don’t leave much room for improvement anymore, we’ve practically reached the limit.” That is why the results that can be gained from data are relatively much more significant.

Andrew Ng

The students took third place with their submission and were invited for a digital meeting with Andrew Ng, professor at Stanford University and organizer of the competition. “At the end of our conversation, we asked him if we could connect with him on LinkedIn. He said yes, despite the fact that he still has 26,000 open requests.” Azhar laughs: “It’s like a stamp of approval.”

Computer scientist Ng is a huge source of inspiration for Azhar and Shah, who took their first steps in the world of AI with his Deep Learning course. “I always feel that I learn more slowly than other students. His way of explaining things made it easier for me,” Azhar says. “I never expected to work in the field of AI, I thought that it was too complicated for me.”

Elective course

Inspired by Andrew Ng, the students now want to launch their own elective course about data-centric AI at TU/e. “You can find some things about it online, but there’s no complete course that concentrates specifically on this subject,” Azhar says. Students don’t even know that they can look for data-centric AI, Shah adds. The main reason for launching their course is to introduce students to data-centric AI.

“We have a rough outline, but we haven’t gained a foothold yet. Lecturers are enthusiastic about it and also see the need for such a course, but they’re not the ones who make the decisions on new courses. I don't know who do, but I'm going to hunt them down,” Azhar says.

Share this article