Assessing Semistructured Merge in Version Control Systems: A Replicated Experiment Guilherme Cavalcanti Federal University of Pernambuco Recife, Brazil gjcc@cin.ufpe.br Paola Accioly Federal University of Pernambuco Recife, Brazil prga@cin.ufpe.br Paulo Borba Federal University of Pernambuco Recife, Brazil phmb@cin.ufpe.br Abstract— Context: To reduce the integration effort arising from conflicting changes resulting from collaborative software development tasks, unstructured merge tools try to automati- cally solve part of the conflicts via textual similarity, whereas structured and semistructured merge tools try to go further by exploiting the syntactic structure of the involved artefacts. Ob- jective: In this paper, aiming at increasing the existing body of evidence and assessing results for systems developed under an alternative version control paradigm, we replicate an experiment conducted by Apel et al. [1] to compare the unstructured and semistructured approach with respect to the occurrence of conflicts reported by both approaches. Method: We used both semistructured and unstructured merge in a sample 2.5 times bigger than the original study regarding the number of projects and 18 times bigger regarding the number of merge scenarios, and we compared the occurrence of conflicts. Results: Similar to the original study, we observed that semistructured merge reduces the number of conflicts in 55% of the scenarios of the new sample. However, the observed average conflict reduction of 62% in these scenarios is far superior than what has been observed before. We also bring new evidence that the use of semistructured merge can reduce the occurrence of conflicting merge scenarios by half. Conclusions: Our findings reinforce the benefits of exploiting the syntactic structure of the artefacts involved in code integration. Besides, the reductions observed in the number and size of conflicts suggest that the use of semistructured merge, when compared to the unstructured approach, might decrease integration effort without compromising correctness. Keywords— replication study, collaborative development, soft- ware merging, semistructured merge, version control systems. I. I NTRODUCTION In a collaborative development environment, developers often implement tasks in an independent way using individual copies of project files. As a result, while merging separate code contributions from each task, one likely has to deal with conflicting changes and dedicate substantial effort to resolve conflicts. These conflicts occur due to a number of reasons. For example, when different developers make changes to the same artefact without being aware of the other changes — the so-called direct or textual conflicts — or when there are concurrent modifications in different artefacts, leading to build or test failures — the indirect conflicts [2, 3]. Regardless of the nature of the conflicts, they may hamper productivity, since detecting and solving conflicts might be a tiresome and error prone activity, and, as a consequence, they delay the project while developers trace its cause and seek a solution. To learn about the occurrence of conflicts and their conse- quences, previous empirical studies answer questions concern- ing when developers detect conflicts, and how often conflicts occur. Zimmermann [4], for instance, describes that textual conflicts occurred in a range from 23% to 47% of all files’ integration. Brun et al. [2] and Kasi and Sarma [5] found that textual conflicts occurred in an average of 15% of all merge scenarios — a set consisting of a common base revision and its derived versions — and 31% of the merge scenarios free of textual conflicts resulted in build or behavioral errors. Such evidence motivates and guides the design of tools that use different strategies to both decrease integration effort and improve correctness during code integration. For example, to reduce the integration effort, unstructured merge tools are purely text-based and resolve conflicts via textual similarity. On the other hand, a structured merge tool is tailored to a specific programming language and uses knowledge of the language’s grammar to resolve conflicts [6, 7, 8, 9, 10]. Finally, semistructured merge [1] attempts to combine the previous ones, so that it provides structural information about software artefacts to resolve conflicts automatically, and when this information is not sufficient, it applies the usual textual resolution to the conflict. Apel et al. [1] in a previous empirical study found that the semistructured approach was promising if compared to the unstructured one. By studying 24 projects using Subversion, a Centralized Version Control System (CVCS), and analysing a total of 180 merge scenarios, they found that, in 60% of their sample merge scenarios, the semistructured approach was able to reduce the number of textual conflicts by, on average, 34%. They also found that, in 82% of their sample merge scenarios, the semistructured approach reduced the number of conflicting lines of code by, on average, 61%; and that, in 72% of their sample merge scenarios, semistructured merge reduced the number of conflicting files by, on average, 28%. Given the importance of such tools for collaborative software development, here we further investigate Apel’s et al. [1] hypothesis and replicate their study. To possibly expand external validity of the original study, we analyse different systems stored on Git, a Distributed Version Control System (DVCS), since DVCSs have seen an increase in popularity compared to traditional CVCS, and offer extra information that help us to better understand software development processes, such as merge tracking [11, 12]. We then compare semistructured merge to the unstructured one in 3266 merge scenarios from 60 projects, a sample 2.5 times bigger than the original study regarding the number of projects and 18 times bigger regarding the number of merge scenarios.