RAG VS F INE - TUNING :P IPELINES ,T RADEOFFS , AND A C ASE S TUDY ON AGRICULTURE Microsoft Angels Balaguer, Vinamra Benara, Renato Cunha, Roberto Estevão, Todd Hendry, Daniel Holstein, Jennifer Marsman, Nick Mecklenburg, Sara Malvar, Leonardo O. Nunes, Rafael Padilha, Morris Sharp, Bruno Silva, Swati Sharma, Vijay Aski, Ranveer Chandra ABSTRACT There are two common ways in which developers are incorporating proprietary and domain-speciﬁc data when building applications of Large Language Models (LLMs): Retrieval-Augmented Genera- tion (RAG) and Fine-Tuning. RAG augments the prompt with the external data, while ﬁne-Tuning incorporates the additional knowledge into the model itself. However, the pros and cons of both approaches are not well understood. In this paper, we propose a pipeline for ﬁne-tuning and RAG, and present the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4. Our pipeline consists of multiple stages, including extracting information from PDFs, generating questions and answers, using them for ﬁne-tuning, and leveraging GPT-4 for evaluating the results. We propose metrics to assess the performance of different stages of the RAG and ﬁne-Tuning pipeline. We conduct an in-depth study on an agricultural dataset. Agriculture as an industry has not seen much penetration of AI, and we study a potentially disruptive application - what if we could provide location-speciﬁc insights to a farmer? Our results show the effectiveness of our dataset generation pipeline in capturing geographic-speciﬁc knowledge, and the quantitative and qualitative beneﬁts of RAG and ﬁne-tuning. We see an accuracy increase of over 6 p.p. when ﬁne-tuning the model and this is cumulative with RAG, which increases accuracy by 5 p.p. further. In one particular experiment, we also demonstrate that the ﬁne-tuned model leverages information from across geographies to answer speciﬁc questions, increasing answer similarity from 47% to 72%. Overall, the results point to how systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a speciﬁc industry, paving the way for further applications of LLMs in other industrial domains. Keywords GPT-4 · Agriculture · Retrieval Augmented Generation · Fine-tuning 1 Introduction Over the past few years, artiﬁcial intelligence and natural language processing have seen signiﬁcant advancements, leading to the development of powerful large language models (LLMs) such as the Generative Pre-trained Transformer (GPT). The technology driving LLMs, including advanced deep learning techniques, large-scale transformers, and vast amounts of data, have propelled their rapid evolution. Models like GPT-4 (OpenAI, 2023) and Llama 2 (Touvron et al., 2023b) have demonstrated exceptional performance across numerous tasks and domains, often without speciﬁc prompts. These models surpass their predecessors and hold immense potential in various ﬁelds like coding, medicine, law, agriculture, and psychology, closely approaching human-level expertise (Bubeck et al., 2023; Nori et al., 2023; Demszky et al., 2023). As LLM research continues, it is critical to identify their limitations and address the challenges of developing more comprehensive artiﬁcial general intelligence (AGI) systems. Moreover, the machine learning community must move beyond traditional benchmarking datasets and evaluate LLMs in ways that closely resemble human cognitive ability assessments. The adoption of Artiﬁcial Intelligence (AI) copilots across various industries is revolutionizing the way businesses operate and interact with their environment. These AI copilots, powered by LLMs, provide invaluable assistance in arXiv:2401.08406v3 [cs.CL] 30 Jan 2024