Abstract: For any retail company, managing inventory is of prime importance. Every store should have enough items so that it can fulfll the demand. To achieve this, the stores must be restocked before those items become out of stock. For restocking, the items must arrive from a fulfllment center which distributes the items to various stores, also called distribution centers. Since, distribution center and fulfllment centers are generally far from each other, there is a delay between request for restock and the time it takes for the item to reach from fulfllment centers to distribution centers. To prevent out of stock conditions, the request should be made by considering the time it takes for an item to arrive from fulfllment center. The quantity of item also determines the request time as only few quantities of large items can be sent at once and need multiple transits to restock to the required numbers. Along with these, there are other conditions like general traffc, seasonal climate variations, etc. that can affect the transit time of items. All of these conditions must be taken care while deciding when the item is requested. The proposed system decides the request time and quantity of items along with different variations by training from years of data. This allows the system to work more effciently and prevent the out of stock conditions to increase sales of the company. Keywords: Big data, ETL process, HDFS, SparkML, SparkSQL, Value proposition. Value Proposition and ETL Process in Big Data Environment Prateek Kumar 1 and Veena Gaded 2* 1 RVCE, Bangalore, Karnataka, India. 2 RVCE, Bangalore, Karnataka, India. Email: veena.gadad@gmail.com *Corresponding Author I. IntroductIon Value of a project is determined by how well it can serve the organization, having the knowledge about what the project is supposed to do and how it is going to achieve the goals help determine how the fnal result is going to be. This allows determining the business value of the project and also enables people responsible to make better decisions throughout the development of the project. Various graphs are used to display the data which displays the on-going problems which can be understood by non-technical people [1]. Since the data is huge, the selection of visualization software must be done on the basis of fast data ingestion and quick indexing so that the visualization is done at a faster speed. A Lucene based text search engine is preferred for the operation as it stores indices as text which makes it faster to search hence the elastic stack is chosen for the project [7]. Elastic stack contains Elasticsearch, a Lucene based text search engine, Logstash, for data ingestion, and Kibana, to create visualizations [8]. These visualizations are then studied to determine the importance of the project. ETL, Extract Transform and Load, is a process in databases. Data extraction is the process where data is extracted from various data sources, both homogenous and heterogenous [9]. Data transformation is the process in which the data is transformed in a format that can be used for the purpose of analysis. Data loading is the process of loading the data on a suitable storage like other databases or datamart. To increase the performance, the process is parallelized. This parallelization is done using Spark in this project. Spark also provides other tools such as SparkSQL and SparkML which makes it one in all tool for this project. SparkSQL allows writing SQL queries to extract data from databases and use its built-in machine learning framework to use it on trained data generated by fltering the available data for required attributes. Spark is also compatible with Hadoop so it allows seamless integration with scalable storage. It can load data on and from HDFS parallely which speeds up all the storage operations and queries. Since Spark provide such seamless integration, it is an ideal tool for big data operations and development of all the modules of this project. Many approaches have been used till date for value proposition analysis. The techniques used had their own advantages and disadvantages. This section describes some of the works done in past, along with their summarized conclusion. The study by Bowman and Forman [1, 6] tries to increase ETL effciency and reduce the time of processing in distributed system using Hadoop Distributed File System (HDFS) and Apache Spark. Usually, there is an increase in the time taken by ETL for processing, with the increase in number of records in the data source. ETL needs to process equal number of records as are in the data sources. The work uses an ETL method using Spark. It is observed that it is an effcient model and the amount of data to be processed is reduced using this model. International Journal of Distributed and Cloud Computing 7 (1) June 2019, 01-04 http://www.publishingindia.com/ijdcc