Investigating Masking-based Data Generation in Language Models Ed S. Ma ∗ eddsma@outlook.com ABSTRACT The current era of natural language processing (NLP) has been defned by the prominence of pre-trained language models since the advent of BERT. A feature of BERT and models with similar architecture is the objective of masked language modeling, in which part of the input is intentionally masked and the model is trained to predict this piece of masked information. Data augmentation is a data-driven technique widely used in machine learning, including research areas like computer vision and natural language process- ing, to improve model performance by artifcially augmenting the training data set by designated techniques. Masked language mod- els (MLM), an essential training feature of BERT, have introduced a novel approach to perform efective pre-training on Transformer based models in natural language processing tasks. Recent studies have utilized masked language model to generate artifcially aug- mented data for NLP downstream tasks. The experimental results show that Mask based data augmentation method provides a simple but efcient approach to improve the model performance. In this paper, we explore and discuss the broader utilization of these data augmentation methods based on MLM. KEYWORDS natural language processing, data augmentation, language models, neural networks 1 INTRODUCTION Pre-trained language models (PLMs) have revolutionized the feld of natural language processing, with BERT architectures stand- ing out for their innovative design and impressive performance. These Transformer [56] based models, based largely on transformer architecture, use bi-directional representations to understand con- text, thus pushing the limits of previous uni-directional models. A distinctive feature of BERT and similar models is the goal of masked language modeling, in which part of the input is intention- ally masked and the model is trained to predict this masked token. This strategy simulates a fuller understanding of the context and relationships between words, leading to a better understanding of language nuances and meaning. As a result, BERT-like models have found extensive application in a variety of NLP tasks, such as text classifcation, sentiment analysis, and question answering, signif- cantly pushing the boundaries of what machines can understand and accomplish in the realm of human language. NLP tasks requires a signifcant amount of high-quality anno- tated data for several critical reasons, primarily related to the in- herent complexity of human language and the need for machine learning models to model it efectively understand, interpret and generate. Languages are extremely complicated and complex, with ∗ Independent project. Work in Progress. Contact: eddsma@outlook.com countless nuances, exceptions and rules. Consisting of morpho- logical, syntactic, semantic, and pragmatic aspects, they require understanding not only of the words and phrases, but also of con- text, intent, and even cultural or social cues. In order for an NLP model to understand all of these elements and generate human- like text, it must learn from a variety of examples that show these characteristics in multiple diferent contexts. This is where high- quality annotated data comes into play. They provide the models with explicit labels or additional information that make it easier to understand the various features of the language. The performance of machine learning models is highly depen- dent on the variety and amount of training data. Because these models learn by identifying patterns in the input data, a larger and more diverse data set allows the models to be exposed to a wider range of patterns and situations. This leads to better generalization ability and the models can process new inputs more efectively. For example, if you train an NLP model on annotated data from diferent domains like literature, science, law, social media, etc., it can understand and generate texts related to each of these domains. The quality of the annotated data is important. Poorly annotated data can mislead the model during training, resulting in suboptimal performance or even completely wrong outputs. Accurate annota- tions are fundamental to supervised learning as they serve as the basis for the model. They help models distinguish between difer- ent elements of language, understand the relationships between words, and understand the meaning and intent behind phrases or sentences. Therefore, high-quality annotated data plays a crucial role in training robust and reliable NLP models. It provides the rich, diverse, and concise input the models need to learn the complexities of the language, ensures their applicability in diferent domains and scenarios, and acts as an efective guide during the training process to optimize their performance. It is often difcult and expensive to obtain quality annotated text data in large volume. Traditional and expensive way is to hire crowd workers with target language capability to annotate data. An example is Amazon Mechanical Turk (AMT). AMT is a crowdsourc- ing service that enables individuals and businesses to outsource tasks to a distributed workforce who can perform these tasks vir- tually. Based on requirements, workers (known as ‘Turkers‘) will go through target data, manually annotate it as per instructions. Once a worker completes a Human Intelligence Task (HIT), you can review their work, approve or reject it based on the quality of the annotation, and then pay the worker. With its vast, diverse workforce, AMT is often used to create large, annotated datasets for NLP downstream tasks. However, this annotating method is common but very expensive. Therefore, researchers have started exploring annotating methods at cheaper costs. Some examples are methods that are based on distant supervision. Unlabeled dialogue corpora in the target domain can be easily curated from previous