Abstract
Artificial Intelligence (AI) has emerged as a transformative technology across various industries, enabling advanced applications such as image recognition, natural language processing, and autonomous systems. A critical determinant of AI model performance is the quality and quantity of training data used during the model's development. However, acquiring and labeling large datasets for training can be resource-intensive, time-consuming, and privacy-sensitive. Synthetic data has emerged as a promising solution to address these challenges and enhance AI accuracy. This study explores the role of synthetic data in improving AI accuracy. Synthetic data refers to artificially generated data that mimics the distribution and characteristics of real-world data. By leveraging techniques from computer graphics, data augmentation, and generative modeling, researchers and practitioners can create diverse and representative synthetic datasets that supplement or replace traditional training data.
References
Nikolenko, Sergey I. Synthetic data for deep learning. Vol. 174. Springer Nature, 2021.
Abowd, John M., and Lars Vilhuber. "How protective are synthetic data?." International Conference on Privacy in Statistical Databases. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008.
Patki, Neha, Roy Wedge, and Kalyan Veeramachaneni. "The synthetic data vault." 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 2016.
Jordon, James, et al. "Synthetic Data--what, why and how?." arXiv preprint arXiv:2205.03257 (2022).
Hu, Qixin, Alan Yuille, and Zongwei Zhou. "Synthetic Data as Validation." arXiv preprint arXiv:2310.16052 (2023).
Assefa, Samuel A., et al. "Generating synthetic data in finance: opportunities, challenges and pitfalls." Proceedings of the First ACM International Conference on AI in Finance. 2020.
Hyun, Jayun, et al. "Synthetic Data Generation System for AI-Based Diabetic Foot Diagnosis." SN Computer Science 2.5 (2021): 345.
Kurapati, Shalini, and Luca Gilli. "Synthetic data: A convergence between Innovation and GDPR." Journal of Open Access to Law 11.2 (2023): 12-12.
Gonzales, Aldren, Guruprabha Guruswamy, and Scott R. Smith. "Synthetic data in health care: a narrative review." PLOS Digital Health 2.1 (2023): e0000082
Dahmen, Jessamyn, and Diane Cook. "SynSys: A synthetic data generation system for healthcare applications." Sensors 19.5 (2019): 1181.
Giuffrè, Mauro, and Dennis L. Shung. "Harnessing the power of synthetic data in healthcare: innovation, application, and privacy." NPJ Digital Medicine 6.1 (2023): 186.
https://www.kaggle.com/datasets/jehanbhathena/weather-dataset
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
https://data.world/cancerdatahp/lung-cancer data/workspace/file?filename=cancer+patient+data+sets.xlsx
https://archive.ics.uci.edu/dataset/613/smartphone+dataset+for+anomaly+detection+in+crowds
https://www.kaggle.com/code/badmangamingsv/credit-card-fraud-detection
