Estimating Foraging Values and Costs in
Stack Overflow
Abim Sedhain
The University of Tulsa
Tulsa, Oklahoma
abs5423@utulsa.edu
Sruti Srinivasa Ragavan
Microsoft Research
Cambridge, UK
a-srutis@microsoft.com
Brett McKinney
The University of Tulsa
Tulsa, Oklahoma
brett-mckinney@utulsa.edu
Sandeep Kaur Kuttal
The University of Tulsa
Tulsa, Oklahoma
sandeep-kuttal@utulsa.edu
Abstract—We operationalized information foraging theory for
Stack Overflow and built a semi-supervised model to recommend
optimal information to the developers.
I. I NTRODUCTION
Software development is an information-intensive activity.
Research has shown that developers spent almost 35% of their
time foraging for information from various sources, including
the Internet [1], [2]. Prior research has aimed to support
developers by answering developers’ questions [3]–[5].
Stack Overflow (SO) is the most popular Question
and Answer (Q&A) website containing a vast amount of
programming-related topics, and is the first place developers
go to, when they run into any sort of programming issue [6].
Developers look for solutions on SO through query formu-
lation. For each query, multiple question pages are returned;
the developer must then determine which questions to forage
for their answer. Once inside the question, a developer needs
to read through the answers and then digest the information
within them. Many of the answers are unique and have
their own strengths and weaknesses. Collectively, these steps
increase a developer’s foraging time as well as cognitive load
[7].
To mitigate this issue, we conjecture that Information For-
aging Theory (IFT) [8] can help. IFT explains how, fundamen-
tally, humans seek information. It has been applied success-
fully to various domains (e.g., web, documents, visualizations)
[8]–[11]. In software engineering, IFT has been applied to
explain debugging [12], programmers’ code navigation during
debugging and reuse [13]–[16] of analysts’ requirements’
foraging [17] and in recommending code locations [18].
IFT posits that human’s seeking behavior are fundamentally
value/cost considerations - that a predator (e.g., developer)
foraging for information prey (e.g., answer to a programming
problem) aims to maximize the amount of valuable informa-
tion gained for the cost they spend foraging that information.
The developer uses various cues (e.g., words in question title,
no. of votes, answer count, tags) to optimize their foraging
and to efficiently find the information they need.
Our research aims to build a semi-supervised model high-
lighting the relative importance of various cues in Stack
Overflow and recommending optimal information to users,
thereby lowering overall foraging costs.
II. METHODOLOGY
A. Operationalization: IFT for Stack Overflow
To leverage any theory in a new domain, we must first
operationalize the theory’s constructs for the domain. There-
fore, we systematically mapped the contents of a SO post to
IFT constructs. Specifically, as shown in the Appendix, we
cataloged the foraging costs and values associated with each
piece of information on a SO page.
In order to seek the right information on SO, a user must
engage in two distinct foraging activities:
• Between-patch foraging, where users chose a relevant
post (question), among several available questions, and
• Within-patch foraging, where users forage within the
question for the most suitable answer.
In our operationalization, each post is a patch; between-patch
and within-patch foraging are between-post and within-post
(but between-answer) foraging, respectively. We accounted for
costs and values for both these activities.
B. Data Collection
We used SOTorrent [19], an open dataset that contains the
official SO dump since the creation of first post till the last
edit on December 31, 2020. For this preliminary work, we
extracted the data for one language, namely JavaScript; it is
the most popular programming language in SO
1
. To replicate
the queries developers would search on SO, we gathered over
800 questions from Leetcode problem section
2
. Using those
questions and the hashtag JavaScript as filters, we obtained
4896 questions and 4334 answers.
C. Model Building
Our aim is to build a supervised model that can predict
the costs and values of a given patch. To do so, labeled data
is necessary, but unfortunately was not available in the SO
dataset. Therefore, we took an initial unsupervised approach,
namely clustering, to obtain labels (e.g., high value and low
1
https://insights.stackoverflow.com/survey/2021
2
https://leetcode.com
978-1-6654-4214-5/22/$31.00 ©2022 IEEE
2022 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) | 978-1-6654-4214-5/22/$31.00 ©2022 IEEE | DOI: 10.1109/VL/HCC53370.2022.9833135