AI 2023, 4, 1–15. https://doi.org/10.3390/ai4010001 www.mdpi.com/journal/ai Article Data Synthesis for Alfalfa Biomass Yield Estimation Jonathan Vance 1 , Khaled Rasheed 1,2, *, Ali Missaoui 3 and Frederick W. Maier 2 1 School of Computing, University of Georgia, 415 Boyd Graduate Studies, 200 D. W. Brooks Drive, Athens, GA 30602, USA 2 Institute for Artificial Intelligence, University of Georgia, 515 Boyd Graduate Studies, 200 D. W. Brooks Drive, Athens, GA 30602, USA 3 Department of Crop and Soil Sciences, Institute of Plant Breeding Genetics and Genomics, University of Georgia, 4317 Miller Plant Science, Athens, GA 30602, USA * Correspondence: khaled@uga.edu Abstract: Alfalfa is critical to global food security, and its data is abundant in the U.S. nationally, but often scarce locally, limiting the potential performance of machine learning (ML) models in pre- dicting alfalfa biomass yields. Training ML models on local-only data results in very low estimation accuracy when the datasets are very small. Therefore, we explore synthesizing non-local data to estimate biomass yields labeled as high, medium, or low. One option to remedy scarce local data is to train models using non-local data; however, this only works about as well as using local data. Therefore, we propose a novel pipeline that trains models using data synthesized from non-local data to estimate local crop yields. Our pipeline, synthesized non-local training (SNLT pronounced like sunlight), achieves a gain of 42.9% accuracy over the best results from regular non-local and local training on our very small target dataset. This pipeline produced the highest accuracy of 85.7% with a decision tree classifier. From these results, we conclude that SNLT can be a useful tool in helping to estimate crop yields with ML. Furthermore, we propose a software application called Predict Your CropS (PYCS pronounced like Pisces) designed to help farmers and researchers esti- mate and predict crop yields based on pretrained models. Keywords: machine learning; data synthesis; generative models; alfalfa; biomass; precision agriculture; classification; climate change; yield prediction; deep learning 1. Introduction The alfalfa crop is an important livestock feed and is crucial to global food security. In previous work, we used climate data to estimate alfalfa biomass yields. We compared the accuracies of feature selection techniques and machine learning (ML) models for this task. We obtained promising results using local training data with R 2 values over 0.90, as we had access to rich curated datasets from state university variety trials [1]. However, since our team is developing a software application to aid real-world farmers, whose da- tasets may be much smaller, the current work addresses the problem of estimating yields for very small target datasets. We find that local training on very small target datasets results in very low accuracy, while, non-local training on much larger datasets performs only about as well as local training. Our solution combines ideas inspired by [2], which shows success using pretrained models and sparse datasets, with ideas inspired by [3,4], which show the promise of deep learning generative models like the adversarial autoen- coder (AAE) [3] and generative adversarial networks (GANs) [4]. We propose a novel pipeline where models are trained with data generated or synthesized by other deep learning (DL) models. In this pipeline, the synthesized training data are synthesized from non-local sources, and the resulting classifiers estimate local targets. We call this synthe- sized non-local training (SNLT pronounced like sunlight), and we show it consistently achieves better accuracy than both local and non-local training. We extend the work of Xu Citation: Vance, J.; Rasheed, K.; Missaoui, A.; Maier, F.W. Data Synthesis for Alfalfa Biomass Yield Estimation. AI 2023, 4, 1–15. https:// doi.org/10.3390/ai4010001 Academic Editor: Arslan Munir Received: 12 November 2022 Revised: 29 November 2022 Accepted: 30 November 2022 Published: 21 December 2022 Copyright: © 2022 by the authors. Li- censee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and con- ditions of the Creative Commons At- tribution (CC BY) license (https://cre- ativecommons.org/licenses/by/4.0/).