LEVERAGING GENERATIVE ADVERSARIAL NETWORKS (GANS) FOR REALISTIC SYNTHETIC DATA GENERATION

Harish Narne

Authors

Harish Narne Application Engineer, UiPath Inc, USA. Author

Keywords:

Leveraging Generative Adversarial Networks (GANs), Realistic Synthetic Data Generation, Real Data Could

Abstract

The intricate process of creating synthetic data requires precise mathematical and statistical replication of the original data parts. There are significant privacy concerns associated with using and sharing real data for research or model building in industries like banking because of the sensitive information that is often included. Also, real data could be hard to come by, especially in niche fields where it's expensive or difficult to collect a wide variety of high-quality records. Due to data scarcity or availability issues, machine learning model training and testing may be hindered. We tackle this problem in this article. To be more specific, we need to create a new dataset that shares characteristics with an existing stock market dataset. The anonymized input dataset has a number of issues, including an imbalance, missing rows, duplicates, and improper data formatting (no columns or rows), as well as no normalized, scaled, or balanced values. Here, we take a look at generative adversarial networks as a deep-learning strategy, assess its ability to produce synthetic data, and compare it to the original stock dataset. Making fake datasets that hide some information while imitating the input portions' statistical features is our innovation's meat and potatoes. To illustrate the point, synthetic datasets can replicate the actual dataset's stock price distribution, trading volume distribution, and market trend distribution. As a result of the increased variety in the produced datasets, academics and industry professionals are better equipped to investigate various market circumstances and investment approaches. This variety has the potential to make machine-learning models more resilient and better at generalising. The average, similarities, and correlations are the metrics we use to assess our artificial data.

References

Sivarajah, U.; Kamal, M.M.; Irani, Z.; Weerakkody, V. Critical analysis of Big Data challenges and analytical methods. J. Bus. Res. 2017, 70, 263–286. [Google Scholar] [CrossRef]

Consoli, S.; Recupero, D.R.; Petkovic, M. (Eds.) Data Science for Healthcare–Methodologies and Applications; Springer: Cham, Switzerland, 2019. [Google Scholar] [CrossRef]

Daniel, B. Big Data and analytics in higher education: Opportunities and challenges. Br. J. Educ. Technol. 2015, 46, 904–920. [Google Scholar] [CrossRef]

Cauli, N.; Recupero, D.R. Survey on Videos Data Augmentation for Deep Learning Models. Future Internet 2022, 14, 93. [Google Scholar] [CrossRef]

Carta, S.; Medda, A.; Pili, A.; Recupero, D.R.; Saia, R. Forecasting E-Commerce Products Prices by Combining an Autoregressive Integrated Moving Average (ARIMA) Model and Google Trends Data. Future Internet 2019, 11, 5. [Google Scholar] [CrossRef]

Carta, S.; Podda, A.S.; Recupero, D.R.; Stanciu, M.M. Explainable AI for Financial Forecasting. In Proceedings of the Machine Learning, Optimization, and Data Science–7th International Conference, LOD 2021, Grasmere, UK, 4–8 October 2021; Nicosia, G., Ojha, V., Malfa, E.L., Malfa, G.L., Jansen, G., Pardalos, P.M., Giuffrida, G., Umeton, R., Eds.; Revised Selected Papers, Part II; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2021; Volume 13164, pp. 51–69. [Google Scholar] [CrossRef]

Carta, S.; Consoli, S.; Piras, L.; Podda, A.S.; Recupero, D.R. Event detection in finance using hierarchical clustering algorithms on news and tweets. PeerJ Comput. Sci. 2021, 7, e438. [Google Scholar] [CrossRef]

Barra, S.; Carta, S.M.; Corriga, A.; Podda, A.S.; Recupero, D.R. Deep learning and time series-to-image encoding for financial forecasting. IEEE CAA J. Autom. Sin. 2020, 7, 683–692. [Google Scholar] [CrossRef]

Akhtar, M.M.; Zamani, A.S.; Khan, S.; Shatat, A.S.A.; Dilshad, S.; Samdani, F. Stock market prediction based on statistical data using machine learning algorithms. J. King Saud Univ.-Sci. 2022, 34, 101940. [Google Scholar] [CrossRef]

Nikolenko, S.I. Synthetic Data for Deep Learning. arXiv 2019. [Google Scholar] [CrossRef]

Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014; Conference Track Proceedings. Bengio, Y., LeCun, Y., Eds.; ArXiv: Ithaca, NY, USA, 2014. [Google Scholar]

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. Assoc. Comput. Mach. 2020, 63, 139–144. [Google Scholar] [CrossRef]

Zhao, S.; Liu, Z.; Lin, J.; Zhu, J.; Han, S. Differentiable Augmentation for Data-Efficient GAN Training. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Glasgow, UK, 2020. [Google Scholar]

Wagner, F.; König, T.; Benninger, M.; Kley, M.; Liebschner, M. Generation of synthetic data with low-dimensional features for condition monitoring utilizing Generative Adversarial Networks. In Proceedings of the Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 26th International Conference KES-2022, Verona, Italy and Virtual Event, 7–9 September 2022;

Cristani, M., Toro, C., Zanni-Merk, C., Howlett, R.J., Jain, L.C., Eds.; Procedia Computer Science. Elsevier: Amsterdam, The Netherlands, 2022; Volume 207, pp. 634–643. [Google Scholar] [CrossRef]

Plesovskaya, E.; Ivanov, S. An Empirical Analysis of KDE-based Generative Models on Small Datasets. Procedia Comput. Sci. 2021, 193, 442–452. [Google Scholar] [CrossRef]

dos Santos Tanaka, F.H.K.; Aranha, C. Data Augmentation Using GANs. arXiv 2019. [Google Scholar] [CrossRef]

Wang, Z.; She, Q.; Ward, T.E. Generative Adversarial Networks in Computer Vision: A Survey and Taxonomy. Assoc. Comput. Mach. Comput. Surv. 2022, 54, 37.

LEVERAGING GENERATIVE ADVERSARIAL NETWORKS (GANS) FOR REALISTIC SYNTHETIC DATA GENERATION

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

cover