This paper presents a novel framework designed to significantly accelerate these pipelines. By establishing granular data provenance and implementing intelligent reuse strategies, our system efficiently identifies and eliminates redundant computations. This approach tackles key challenges such as managing extensive data traces and accommodating non-deterministic operations through advanced duplication and hierarchical reuse techniques. Our framework seamlessly integrates with existing data processing environments, demonstrating substantial efficiency improvements and fostering faster iterative development cycles for data professionals.
@article{mohammed2025,
author = {Ahmed Sarwar Mohammed},
title = {{Data Workflow Acceleration: A Smart System for Redundancy Elimination in Machine Learning Pipelines}},
journal = {Journal of Information Technology and Digital World},
volume = {7},
number = {4},
pages = {271-282},
year = {2025},
publisher = {Inventive Research Organization},
doi = {10.36548/jitdw.2025.4.002},
url = {https://doi.org/10.36548/jitdw.2025.4.002}
}
Copy Citation
- Zaharia, Matei, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching et al. "Accelerating the Machine Learning Lifecycle with MLflow." IEEE Data Eng. Bull. 41, no. 4 (2018): 39-45.
- Liam Li, Evan Sparks, Kevin Jamieson, Ameet Talwalkar, “Exploiting Reuse in Pipeline-Aware Hyperparameter Tuning,” in Proceedings of https://arxiv.org/pdf/1903.05176
- Xin, Reynold S., Daniel Crankshaw, Ankur Dave, Joseph E. Gonzalez, Michael J. Franklin, and Ion Stoica. "Graphx: Unifying Data-Parallel and Graph-Parallel Analytics." arXiv preprint arXiv:1402.2394 (2014).
- Ratner, Alexander, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. "Snorkel: Rapid Training Data Creation with Weak Supervision." In Proceedings of the VLDB endowment. International conference on very large data bases, vol. 11, no. 3, 2017, 269.
- Meyer, Frank. "Recommender Systems in Industrial Contexts." arXiv preprint arXiv:1203.4487 (2012).
- Vassiliadis, Vassilis, Michael A. Johnston, and James L. McDonagh. "Fast, Transparent, and High-Fidelity Memoization Cache-Keys for Computational Workflows." In 2022 IEEE International Conference on Services Computing (SCC), IEEE, 2022, 174-184.
- Baylor, Denis, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal et al. "Tfx: A Tensorflow-Based Production-Scale Machine Learning Platform." In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, 1387-1395.
- George, Johnu, and Amit Saha. "End-to-End Machine Learning Using Kubeflow." In Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD), 2022, 336-338.
- Stevens, Kevin, Mert Erdemir, Hang Zhang, Taesoo Kim, and Paul Pearce. "BluePrint: Automatic Malware Signature Generation for Internet Scanning." In Proceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses, 2024, 97-214.
- Liu, Jie, Bogdan Nicolae, Dong Li, Justin M. Wozniak, Tekin Bicer, Zhengchun Liu, and Ian Foster. "Large Scale Caching and Streaming of Training Data for Online Deep Learning." In Proceedings of the 12th Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures, 2022, 19-26.
- Cafarella, Michael J., and Christopher Ré. "Manimal: Relational optimization for Data-Intensive Programs." In Procceedings of the 13th International Workshop on the Web and Databases, 2010, 1-6.
- Domhan, Tobias, Jost Tobias Springenberg, and Frank Hutter. "Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves." In IJCAI, vol. 15, 2015, 3460-8.
- Donadio, Matteo. "Declarative Data Pipelines: Implementing A Logical Model Through Automated Code Generation." PhD diss., Politecnico di Torino, 2024.
- McKinney, Wes. "Data structures for Statistical Computing in Python." scipy 445, no. 1 (2010): 51-56.
- Gu, Rong, Zhihao Xu, Yang Che, Xu Wang, Haipeng Dai, Kai Zhang, Bin Fan et al. "High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native Platforms." IEEE Transactions on Parallel and Distributed Systems 34, no. 11 (2023): 2946-2964.
- S. Foundation, “Apache Airflow Documentation,” https://airflow.apache.org/docs/, 2021.
- Akidau, Tyler, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety et al. "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing." Proceedings of the VLDB Endowment 8, no. 12 (2015): 1792-1803.
- https://www.kubeflow.org/docs/components/pipelines/concepts/pipeline/.

