A THEORETICAL FRAMEWORK FOR AI-DRIVEN DATA QUALITY MONITORING IN HIGH-VOLUME DATA ENVIRONMENTS

Nikhil Bangad; Vivekananda Jayaram; Manjunatha Sughaturu Krishnappa

Authors

Nikhil Bangad Meta Inc, Texas, USA Author
Vivekananda Jayaram JPMorgan Chase, Texas, USA. Author
Manjunatha Sughaturu Krishnappa Oracle America Inc, California, USA. Author

Keywords:

Artificial Intelligence, Customer Interaction Platform, Predictive Analytics, Data Integration, ETL Processing, Data Warehouse, Distributed Systems

Abstract

This paper presents a theoretical framework for an AI-driven data quality monitoring system designed to address the challenges of maintaining data quality in high-volume environments. We examine the limitations of traditional methods in managing the scale, velocity, and variety of big data and propose a conceptual approach leveraging advanced machine learning techniques. Our framework outlines a system architecture that incorporates anomaly detection, classification, and predictive analytics for real-time, scalable data quality management. Key components include an intelligent data ingestion layer, adaptive preprocessing mechanisms, context-aware feature extraction, and AI-based quality assessment modules. A continuous learning paradigm is central to our framework, ensuring adaptability to evolving data patterns and quality requirements. We also address implications for scalability, privacy, and integration within existing data ecosystems. While practical results are not provided, it lays a robust theoretical foundation for future research and implementations, advancing data quality management and encouraging the exploration of AI-driven solutions in dynamic environments.

References

V. Mayer-Schönberger and K. Cukier, Big Data: A Revolution That Will Transform How We Live, Work, and Think. Houghton Mifflin Harcourt, 2013.

T. C. Redman, "Bad Data Costs the U.S. $3 Trillion Per Year," Harvard Business Review, Sep. 2016. [Online]. Available: https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year. [Accessed: Sep. 28, 2024].

C. Batini and M. Scannapieco, Data and Information Quality: Dimensions, Principles and Techniques. Springer, 2016.

A. Labrinidis and H. V. Jagadish, "Challenges and opportunities with big data," Proc. VLDB Endowment, vol. 5, no. 12, pp. 2032-2033, 2012.

N. N. Taleb, "Beware the Big Errors of 'Big Data'," Wired, Feb. 2013. [Online]. Available: https://www.wired.com/2013/02/big-data-means-big-errors-people/. [Accessed: Sep. 28, 2024].

J. Gao, C. Xie, and C. Tao, "Big data validation and quality assurance--Issues, challenges, and needs," in 2016 IEEE Symp. Service-Oriented System Engineering (SOSE), 2016, pp. 433-441.

E. Schubert, A. Zimek, and H. P. Kriegel, "Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection," Data Mining and Knowledge Discovery, vol. 28, no. 1, pp. 190-237, 2014.

B. Saha and D. Srivastava, "Data quality: The other face of big data," in 2014 IEEE 30th Int. Conf. Data Engineering, 2014, pp. 1294-1297.

Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, 2015.

A. Bifet and R. Kirkby, Data Stream Mining: A Practical Approach. The University of Waikato, 2009.

Q. Yang, Y. Liu, T. Chen, and Y. Tong, "Federated machine learning: Concept and applications," ACM Trans. Intell. Syst. Technol., vol. 10, no. 2, pp. 1-19, 2019.

C. Batini, C. Cappiello, C. Francalanci, and A. Maurino, "Methodologies for data quality assessment and improvement," ACM Comput. Surv., vol. 41, no. 3, pp. 1-52, 2009.

T. N. Herzog, F. J. Scheuren, and W. E. Winkler, Data Quality and Record Linkage Techniques. Springer, 2007.

E. Rahm and H. H. Do, "Data cleaning: Problems and current approaches," IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3-13, 2000.

Z. Abedjan et al., "Detecting data errors: Where are we and what needs to be done?" Proc. VLDB Endowment, vol. 9, no. 12, pp. 993-1004, 2016.

L. Cai and Y. Zhu, "The challenges of data quality and data quality assessment in the big data era," Data Sci. J., vol. 14, 2015.

N. N. Taleb, "Beware the big errors of 'big data'," Wired, Feb. 2013. [Online]. Available: https://www.wired.com/2013/02/big-data-means-big-errors-people/. [Accessed: Sep. 28, 2024].

D. Firmani, M. Mecella, M. Scannapieco, and C. Batini, "On the meaningfulness of 'big data quality'," Data Sci. Eng., vol. 1, no. 1, pp. 6-20, 2016.

M. Yakout, L. Berti-Équille, and A. K. Elmagarmid, "Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes," in Proc. 2013 ACM SIGMOD Int. Conf. Management of Data, 2013, pp. 553-564.

F. T. Liu, K. M. Ting, and Z. H. Zhou, "Isolation-based anomaly detection," ACM Trans. Knowledge Discovery from Data, vol. 6, no. 1, pp. 1-39, 2012.

T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré, “Holoclean: Holistic data repairs with probabilistic inference,” Proc. VLDB Endowment, vol. 10, no. 11, pp. 1190-1201, 2017.

A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas, “HoloDetect: Few-shot learning for error detection,” Proc. 2019 Int. Conf. Management Data, pp. 829-846, 2019.

Z. Wang and A. C. Bovik, “A universal image quality index,” IEEE Signal Process. Lett., vol. 9, no. 3, pp. 81-84, 2002.

H. Talebi and P. Milanfar, “NIMA: Neural image assessment,” IEEE Trans. Image Process., vol. 27, no. 8, pp. 3998-4011, 2018.

A. Artikis, O. Etzion, Z. Feldman, and F. Fournier, “Event processing under uncertainty,” Proc. 6th ACM Int. Conf. Distributed Event-Based Syst., pp. 32-43, 2012.

F. Psallidas and E. Wu, “Smoke: Fine-grained lineage at interactive speed,” Proc. VLDB Endowment, vol. 11, no. 6, pp. 719-732, 2018.

M. M. Hassan, A. Gumaei, M. Alrubaian, G. Fortino, and M. Alhussein, “A privacy-preserving framework for data quality assessment in large-scale distributed environments,” IEEE Access, vol. 7, pp. 154997-155012, 2019.

J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.

J. Kreps, N. Narkhede, and J. Rao, “Kafka: A distributed messaging system for log processing,” Proc. NetDB, pp. 1-7, 2011.

Z. Abedjan, L. Golab, and F. Naumann, “Profiling relational data: A survey,” VLDB J., vol. 24, no. 4, pp. 557-581, 2015.

G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine, “Synopses for massive data: Samples, histograms, wavelets, sketches,” Found. Trends Databases, vol. 4, no. 1-3, pp. 1-294, 2012.

S. Abiteboul, P. Buneman, and D. Suciu, Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 1999.

S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska, “ActiveClean: Interactive data cleaning while learning convex loss models,” arXiv preprint arXiv:1601.03797, 2016.

S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345-1359, 2009.

J. Yoon, J. Jordon, and M. Van Der Schaar, “GAIN: Missing data imputation using generative adversarial nets,” in Proc. Int. Conf. Machine Learning, pp. 5689-5698, 2018.

V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surv., vol. 41, no. 3, pp. 1-58, 2009.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems, vol. 26, 2013.

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.

A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.

G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504-507, 2006.

S. Ruder, "An overview of multi-task learning in deep neural networks," arXiv preprint arXiv:1706.05098, 2017.

F. T. Liu, K. M. Ting, and Z. H. Zhou, "Isolation forest," in 2008 Eighth IEEE International Conference on Data Mining, 2008, pp. 413-422.

J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, "Signature verification using a 'siamese' time delay neural network," in Advances in Neural Information Processing Systems, vol. 6, 1993.

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 2018.

T. Akidau et al., "The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing," Proceedings of the VLDB Endowment, vol. 8, no. 12, pp. 1792-1803, 2015.

S. J. Taylor and B. Letham, "Forecasting at scale," The American Statistician, vol. 72, no. 1, pp. 37-45, 2018.

L. Li, W. Chu, J. Langford, and R. E. Schapire, "A contextual-bandit approach to personalized news article recommendation," in Proceedings of the 19th International Conference on World Wide Web, 2010, pp. 661-670.

S. Liu, D. Maljovec, B. Wang, P. T. Bremer, and V. Pascucci, "Visualizing high-dimensional data: Advances in the past decade," IEEE Transactions on Visualization and Computer Graphics, vol. 23, no. 3, pp. 1249-1268, 2016.

S. C. Hoi, D. Sahoo, J. Lu, and P. Zhao, "Online learning: A comprehensive survey," arXiv preprint arXiv:1802.02871, 2018.

M. Wang and W. Deng, "Deep visual domain adaptation: A survey," Neurocomputing, vol. 312, pp. 135-153, 2018.

A THEORETICAL FRAMEWORK FOR AI-DRIVEN DATA QUALITY MONITORING IN HIGH-VOLUME DATA ENVIRONMENTS

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

cover