Journal of Artificial Intelligence and Big Data, 2025, Volume 5, Number 1
www.scipublications.com/journal/index.php/jaibd
DOI: 10.31586/jaibd.2025.6049
DOI: https://doi.org/10.31586/jaibd.2025.6049 Journal of Artificial Intelligence and Big Data
Review Article
Enhancing Scalability and Performance in Analytics Data
Acquisition through Spark Parallelism
Hanza Parayil Salim
1*
, Yanas Rajindran
2
1
Staff Engineer, Neiman Marcus, Texas, USA
2
Lead Engineer, AT&T, Texas, USA
*Correspondence: Hanza Parayil Salim (hanzapsalim@gmail.com)
Abstract: Data acquisition serves as a critical component of modern data architecture, with REST
API integration emerging as one of the most common approaches for sourcing external data. This
study evaluates the efficiency of various methodologies for collecting data via REST APIs and
benchmark their performance. It explores how leveraging the Spark distributed computing platform
can optimize large scale REST API calls, enabling enhanced scalability and improved processing
speeds to meet the demands of high volume data workflows.
Keywords: Distributed computing, Parallel processing, Data Acquisition, Apache Spark, RESTful
Web Services, REST API, Data Analytics
1. Introduction
REST APIs are commonly used for data acquisition due to their flexibility, scalability,
and standardization. They are widely used in enterprises for data acquisition from
external sources. In most cases we need to perform large volumes of API calls at a time
and that involves a lot of challenges like latency, rate limits, error handling etc. This paper
specifically examines the challenges related to latency caused by traditional API calls and
explores how these can be addressed using Spark's parallel processing architecture. It
discusses how Apache Spark [6] DataFrames, RDDs, UDFs (User defined functions) can
be utilized to parallelize REST API calls, enhancing overall performance.
2. REST API and Apache Spark
A REST API (also called RESTful web API) is an application programming interface
(API) that follows the design principles of the representational state transfer (REST)
architectural style. REST APIs provides lightweight, flexible ways to integrate
applications and is known for its Scalability, Flexibility and portability and independence
as there is a separation between client and server. They provide a simple and efficient
method for accessing data from various sources, enabling developers to integrate systems
easily and retrieve information through basic HTTP requests. This makes the data
acquisition process streamlined and effective across different platforms and
programming languages.
In various real life scenarios, we need to do Parallel REST API [2] calls. In
applications with high traffic, such as e-commerce platforms or real time dashboards,
making parallel API requests ensures the application remains responsive by reducing the
load time for fetching external data. Parallel API calls can help meet the real time
requirements, in scenarios like real time analytics or monitoring systems, where the data
needs to be collected from multiple sources without delay and stored in high performance
scalable storage like Delta lake [4] and further used for analytics and machine learning
[11] applications.
How to cite this paper:
Salim, H. P., & Rajindran, Y. (2025).
Enhancing Scalability and
Performance in Analytics Data
Acquisition through Spark
Parallelism. Journal of Artificial
Intelligence and Big Data, 5(1), 38-46.
DOI: 10.31586/jaibd.2025.6049
Received: February 2, 2025
Revised: March 8, 2025
Accepted: March 19, 2025
Published: March 22, 2025
Copyright: © 2025 by the authors.
Submitted for possible open access
publication under the terms and
conditions of the Creative Commons
Attribution (CC BY) license
(http://creativecommons.org/licenses
/by/4.0/).