Training Set Cleaning
Why is this challenge important?
When dealing with massive datasets, noises in the datasets become inevitable. This is increasingly the problem for ML training and noises in the dataset can come from many places:
Natural noises come in during data acquisition.
Algorithmic labeling: e.g., weak supervision, and automatically generated labels by machines.
Data collection biases (e.g., biased hiring decisions).
If trained over these noisy datasets, ML models might suffer not only from lower quality, but also potential risks on other quality dimensions such as fairness. Careful data cleaning can often accommodate this, however, it can be a very expensive process if we need to investigate and clean all examples. By using a more data-centric approach we hope to direct human attention and the cleaning efforts toward data examples that matter more to the improvement of ML models.
In this data cleaning challenge, we invite participants to design and experiment data-centric approaches towards strategic data cleaning for training sets of an image classification model. As a participant, you will be asked to rank the samples in the entire training set, and then we will clean them one by one and evaluate the performance of the model after each fix. The earlier it reached a high enough accuracy, the better your rank is.
Similar to other dataperf challenges, the cleaning challenge comes with two flavors: the open division and the closed division.
In the open division, you will submit the output of running your cleaning algorithm on a given dataset. Then we will train the model and evaluate it based on your submission.
In the closed division, you will submit the cleaning algorithm itself, and we will run your algorithm to generate the output on several hidden datasets. Then we evaluate your submissions.
Contact
In case you have any questions, please do not hesitate to contact Xiaozhe Yao, we also have an online discussion forum at https://github.com/DS3Lab/dataperf-vision-debugging/discussions.