Estimating Foraging Values and Costs in Stack Overflow Abim Sedhain The University of Tulsa Tulsa, Oklahoma abs5423@utulsa.edu Sruti Srinivasa Ragavan Microsoft Research Cambridge, UK a-srutis@microsoft.com Brett McKinney The University of Tulsa Tulsa, Oklahoma brett-mckinney@utulsa.edu Sandeep Kaur Kuttal The University of Tulsa Tulsa, Oklahoma sandeep-kuttal@utulsa.edu Abstract—We operationalized information foraging theory for Stack Overflow and built a semi-supervised model to recommend optimal information to the developers. I. I NTRODUCTION Software development is an information-intensive activity. Research has shown that developers spent almost 35% of their time foraging for information from various sources, including the Internet [1], [2]. Prior research has aimed to support developers by answering developers’ questions [3]–[5]. Stack Overflow (SO) is the most popular Question and Answer (Q&A) website containing a vast amount of programming-related topics, and is the first place developers go to, when they run into any sort of programming issue [6]. Developers look for solutions on SO through query formu- lation. For each query, multiple question pages are returned; the developer must then determine which questions to forage for their answer. Once inside the question, a developer needs to read through the answers and then digest the information within them. Many of the answers are unique and have their own strengths and weaknesses. Collectively, these steps increase a developer’s foraging time as well as cognitive load [7]. To mitigate this issue, we conjecture that Information For- aging Theory (IFT) [8] can help. IFT explains how, fundamen- tally, humans seek information. It has been applied success- fully to various domains (e.g., web, documents, visualizations) [8]–[11]. In software engineering, IFT has been applied to explain debugging [12], programmers’ code navigation during debugging and reuse [13]–[16] of analysts’ requirements’ foraging [17] and in recommending code locations [18]. IFT posits that human’s seeking behavior are fundamentally value/cost considerations - that a predator (e.g., developer) foraging for information prey (e.g., answer to a programming problem) aims to maximize the amount of valuable informa- tion gained for the cost they spend foraging that information. The developer uses various cues (e.g., words in question title, no. of votes, answer count, tags) to optimize their foraging and to efficiently find the information they need. Our research aims to build a semi-supervised model high- lighting the relative importance of various cues in Stack Overflow and recommending optimal information to users, thereby lowering overall foraging costs. II. METHODOLOGY A. Operationalization: IFT for Stack Overflow To leverage any theory in a new domain, we must first operationalize the theory’s constructs for the domain. There- fore, we systematically mapped the contents of a SO post to IFT constructs. Specifically, as shown in the Appendix, we cataloged the foraging costs and values associated with each piece of information on a SO page. In order to seek the right information on SO, a user must engage in two distinct foraging activities: Between-patch foraging, where users chose a relevant post (question), among several available questions, and Within-patch foraging, where users forage within the question for the most suitable answer. In our operationalization, each post is a patch; between-patch and within-patch foraging are between-post and within-post (but between-answer) foraging, respectively. We accounted for costs and values for both these activities. B. Data Collection We used SOTorrent [19], an open dataset that contains the official SO dump since the creation of first post till the last edit on December 31, 2020. For this preliminary work, we extracted the data for one language, namely JavaScript; it is the most popular programming language in SO 1 . To replicate the queries developers would search on SO, we gathered over 800 questions from Leetcode problem section 2 . Using those questions and the hashtag JavaScript as filters, we obtained 4896 questions and 4334 answers. C. Model Building Our aim is to build a supervised model that can predict the costs and values of a given patch. To do so, labeled data is necessary, but unfortunately was not available in the SO dataset. Therefore, we took an initial unsupervised approach, namely clustering, to obtain labels (e.g., high value and low 1 https://insights.stackoverflow.com/survey/2021 2 https://leetcode.com 978-1-6654-4214-5/22/$31.00 ©2022 IEEE 2022 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) | 978-1-6654-4214-5/22/$31.00 ©2022 IEEE | DOI: 10.1109/VL/HCC53370.2022.9833135