Moving Fast With Broken Data: Implementing an Automatic Data Validation System for ML Pipelines

I recently came across an insightful research paper titled "Moving Fast With Broken Data" by Shreya Shankar, Labib Fawaz, Karl Gyllstrom, and Aditya G. Parameswaran from UC Berkeley and Meta. The paper addresses the significant issue of data corruption in machine learning (ML) pipelines, which often leads to decreased model accuracy. The authors present an automatic data validation system implemented at Meta that aims to solve this problem.

The paper highlights that ML models in production pipelines are frequently retrained on the latest partitions of continually growing datasets. Due to engineering bugs, these datasets often contain corrupted features, making it crucial to detect data issues and block retraining before the ML model's accuracy is negatively impacted. However, identifying when a partition is corrupted enough to block retraining is challenging.

The authors present the Partition Summarization (PS) approach to data validation, where each timestamp-based partition of data is summarized with data quality metrics, and these summaries are compared to detect corrupted partitions. The PS approach can be adapted for several data validation methods, each with its pros and cons. As none of the methods alone met the requirements for high precision and recall in detecting corruptions, the authors devised 'gate', a high-precision and high-recall data validation method. Gate showed a 2.1x average improvement in precision over the baseline in a case study with Instagram's data.

The paper suggests employing the Partition Summarization (PS) approach to automatically validate data in ML pipelines, which can help detect issues before model retraining. Implementing the 'gate' method can further improve the precision and recall of detecting corruptions, ensuring higher model accuracy.

The research paper "Moving Fast With Broken Data" provides valuable insights into the challenges of data corruption in ML pipelines and presents an automatic data validation system implemented at Meta. By employing the Partition Summarization (PS) approach and the 'gate' method, ML practitioners can effectively tackle data corruption issues and maintain high model accuracy. As someone who closely follows advancements in machine learning, I found this paper to be an essential read for understanding the significance of data validation in the field.