31 ST I NTERNATIONAL CONFERENCE ON I NFORMATION SYSTEMS DEVELOPMENT (ISD2023 LISBON,PORTUGAL) Data Analysis on Blockchain Distributed File Systems: Systematic Literature Review Miguel Rodrigues Baptista Instituto Superior Técnico & INOV Lisbon, Portugal miguelbaptista@tecnico.ulisboa.pt Miguel Mira da Silva Instituto Superior Técnico & INOV Lisbon, Portugal mms@tecnico.ulisboa.pt Paulo Rupino da Cunha University of Coimbra DEI & CISUC Coimbra, Portugal rupino@dei.uc.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt Abstract The interest on the discovery of information hidden in large amounts of data exploded in the last decade, bringing to light the need of efficient and effective tools to access all sources and kinds of data. On the other hand, the need to secure and share valuable data led to the development of new technologies, like blockchain, that warrant data integrity and transparency. Combining both is a natural demand, but several issues become clear, such as the lack of access efficiency and the need of data replication in common solutions. Indeed, the unique existing approach is by emulating queries, mostly through Smart Contracts, and applying traditional machine learning algorithms over the resulting data, stored externally for allowing multiple accesses. In this paper, we performed a systematic literature review that provides the above conclusions. Later, we discuss a new system architecture for the analysis of data stored in a blockchain, exploring the scalability and high-performance of data access in distributed file systems and the fast and up-to-date predictions of a streaming analysis approach. Keywords: Blockchain; Information System Security; Data Analysis; Incremental Machine Learning; Distributed File System 1. Introduction Blockchain and Data Analysis are topics of high interest, and are being integrated together in a multitude of applications [1]. However, research combining them does not provide neither guidelines on how to access data on a blockchain, nor how to analyse the collected data. Indeed, this process is not as straightforward as when mining traditional databases. There is no standard data structure for the data stored in the blockchain that makes the analysis efficient in time, like data cubes for data warehouses. In particular, blockchain does not have a built-in query system, so most solutions can be classified into one of two categories: emulating querying with smart contracts and custom search engines, or extracting the data to a traditional database and accessing it from there. However, both solutions have issues. Querying data through smart contracts has high costs and slow performance, and extracting data to an off-chain database looses the data integrity protections afforded by the blockchain, requiring additional storage. Lastly, with smart contracts and custom search engines, analysing data stored in a blockchain is