CREATING EFFICIENT AND SCALABLE DATA PIPELINES FOR CLOUD-BASED ANALYTICS

Authors

  • Raghavendra Sirigade Texas A&M University College Station, USA. Author

Keywords:

Cloud-based Data Orchestration, Google Composer, Scalable Data Pipelines, Data Build Tool (DBT), Analytics Infrastructure Optimization

Abstract

This article presents a comprehensive approach to optimizing data orchestration processes for cloud-based analytics by transitioning from Apache Airflow to Google Composer within the Google Cloud Platform (GCP) ecosystem. The research addresses critical challenges in data pipeline management, including security, efficiency, and scalability, while leveraging GCP's managed services and adhering to industry best practices. By implementing Google Dataproc for dynamic resource allocation and integrating the Data Build Tool (DBT) for streamlined analytics code deployment, the study demonstrates significant improvements in data processing capabilities. The new architecture, built on a secure Virtual Private Cloud (VPC) network, incorporates auto-scaling mechanisms for compute nodes, storage, and network resources, ensuring adaptability to varying workloads. Results show a 40% reduction in pipeline execution time, a 300% increase in data volume handling capacity, and a 25% reduction in cloud infrastructure costs. Furthermore, the integration of DBT facilitated rapid deployment of analytics code, fostering a collaborative development approach and adherence to software engineering best practices. This comprehensive solution not only enhanced data orchestration efficiency and pipeline scalability but also improved overall analytical capabilities, enabling more sophisticated machine learning models and reducing time-to-insight by 50%. The article provide valuable insights for organizations seeking to enhance their cloud-based analytics infrastructure in an increasingly data-driven business environment.

References

A. Kashlev and S. Lu, "A System Architecture for Running Big Data Workflows in the Cloud," 2014 IEEE International Conference on Services Computing, Anchorage, AK, USA, 2014, pp. 51-58, doi: 10.1109/SCC.2014.16. [Online]. Available: https://ieeexplore.ieee.org/document/6930516

Kathiravelu, Pradeeban & Veiga, Luís. (2016). SENDIM for Incremental Development of Cloud Networks. 10.48550/arXiv.1601.02130. [Online]. Available: https://arxiv.org/abs/1601.02130

Shanmugasundaram, G. & Aswini, V. & Suganya, G.. (2017). A comprehensive review on cloud computing security. 1-5. 10.1109/ICIIECS.2017.8275972. [Online]. Available: https://ieeexplore.ieee.org/document/8275972

S. Shahzadi, M. Iqbal, Z. U. Qayyum, and T. Dagiuklas, "Infrastructure as a Service (IaaS): A Comparative Performance Analysis of Open-Source Cloud Platforms," 2017 IEEE 22nd International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), Lund, Sweden, 2017, pp. 1-6, doi: 10.1109/CAMAD.2017.8031528. [Online]. Available: https://ieeexplore.ieee.org/document/8031522

C. Liu, C. Yang, X. Zhang, and J. Chen, "External integrity verification for outsourced big data in cloud and IoT: A big picture," Future Generation Computer Systems, vol. 49, 2015, pp. 58-67, doi: 10.1016/j.future.2014.08.007. [Online]. Available: https://www.sciencedirect.com/science/article/abs/pii/S0167739X14001551

A. Simitsis, K. Wilkinson, M. Castellanos, and U. Dayal, "Optimizing analytic data flows for multiple execution engines," Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012, pp. 829-840, doi: 10.1145/2213836.2213960. [Online]. Available: https://dl.acm.org/doi/10.1145/2213836.2213963

M. Kleppmann, A. R. Beresford, and B. Svingen, "Online Event Processing: Achieving Consistency Where Distributed Transactions Have Failed," Queue, vol. 17, no. 1, pp. 20-42, 2019, doi: 10.1145/3321612.3321620. [Online]. Available: https://dl.acm.org/doi/10.1145/3317287.3321612

L. Zhu, L. Bass, and G. Champlin-Scharff, "DevOps and Its Practices," IEEE Software, vol. 33, no. 3, pp. 32-34, May-June 2016, doi: 10.1109/MS.2016.81. [Online]. Available: https://ieeexplore.ieee.org/document/7458765

C. Batini, A. Rula, M. Scannapieco, and G. Viscusi, "From Data Quality to Big Data Quality," Journal of Database Management, vol. 26, no. 1, pp. 60-82, 2015, doi: 10.4018/JDM.2015010103. [Online]. Available: https://www.igi-global.com/gateway/article/140546

Downloads

Published

2024-09-30