Software research must become more reusable
This month I gave two invited talks that as paired pieces form a kind of bracket to hold together several big things I've been thinking about and working on lately: community-based research for developer science and what it gets us (to an audience of researchers), and what to do after you've measured the impact of AI badly in your organization (to an audience of leaders).
The first was to the UC Irvine inaugural forum on Developer Experience, invited by Tom Zimmerman (whose work you can read here). Speakers were invited across software research in both academia and industry, and included a full panel of the authors of the SPACE framework. You can read my live posting thoughts about the event.
One of the salient highlights from the interstitial, dinner and panel conversation included whether or not the Activity dimension of "SPACE" (which is really a general thought piece-style framework about topics to think about in developer productivity, rather than a set of specifically-defined, testable theories) is misaligned to the AI era. I believe that the vast majority of "activity" measures are badly used in engineering organizations, and that simplistic averages of things like PR throughput are likely to be wildly misleading given the obvious over-time variance in this work.
Without a real sense of the causal drivers of those changes, and very careful hierarchical multisystem statistical modeling of changes and variance over time (which I have to be honest never seen happen inside of an engineering organization – although tell me if you're doing this), counting up and averaging tiny chunks of output against some arbitrary threshold seems almost designed to ensure developers will be mismeasured at the very time that this "throughput" is becoming inherently decoupled from the processes of problem-solving. The meaning of "throughput" is clearly changing. The SPACE panel led me to believe I am in a minority view on this, though, as they all gave quite a lot of endorsement and defense of "Activity." For my part, I suspect characteristics of PRs that are entirely different from quantities and rates will prove to be critical to interpreting what we're seeing: e.g., types of tasks, features of PRs that allow for review and verification or not, and all sorts of characteristics of what a PR is actually doing. Activity measures are untroubled by such concerns.
I have always noticed that in the software teams I work with, the C of "SPACE" seems to go critically undermeasured, yet it is in relational and collaborative/coordinative elements of software work that we most clearly see some of the emerging impacts from AI– and those are also the activities from which we will probably need to derive the features of a "good PR" relative to the concerns of the team (stay tuned, because I actually have a few measurement answers I've been testing with real teams about how to better explore those features). I was delighted to hear Dr. Margaret-Anne Storey share that same perspective that we should pay more attention to collaboration/coordination from her experiences watching this framework have such a sizable impact on what the industry believes about itself.
For my talk, I have been thinking about how we get to make progress on understudied parts of people's experiences. I took the opportunity to give a talk that I've been meaning to give but hadn't had the right venue for before the UC Irvine event: Reusable Research.
Reusable research is simply, any research project that has a life longer than a short blip of attention when it gets published. That reuse could be shaping theory, providing data someone else builds on, or changing practice. It could also be the use we get out of evidence when we're able to translate it to a new domain or apply an underlying structural insight to a new version of the problem. Reused research is also research that makes it into those strange branching telephone games of tiktok summaries and instagram infographics (yes, we live in the modern world and should respect modern information flow, for better or worse!). Aiming for reusable research doesn't mean you can't value small, bespoke, and highly specific evidence. It just means you care beyond the conventional output of the study.

Not all reasons research gets reused are good ones. But almost every time a research project earns its keep in the library of human knowledge can be traced back to decisions and design choices that happened either before the study was run, or after the study is over. This can be exhausting for researchers when it's already a herculean task to continue to refine the specialized skills and domain knowledge of research itself. So it's easy for us to think this is unfair. Nevertheless, unfair realities are important – very important in the agentic coding era – to pay attention to the decisions and choices that shape not just how we study things, but why we choose to study some things and not others. It is often in those choices that we can uncover things that help people.
I opened the talk by asking the audience how many of them wonder if anyone will ever use their study after they publish it, and a roomful of arms shot up. Then I revealed this picture I simply cannot get over: The Isolator, an invention from Hugo Gernsberg that aimed to solve the problem of how to get people to focus at work by removing distractions and noise.

The Isolator Helmet, as an intervention to solve productivity problems, has a causal model of the world that we can model very simply. If noise -> distraction, and distraction -> stopping working, than an invention which reduces noise to zero should reduce stopping working to zero. So why don't we see every knowledge worker don the Isolator Helmet in the workplace?
This gets a laugh because it's quite easy to mentally simulate the Isolator Helmet intervention implementation: you'd feel hot, other things matter other than distraction, getting a little oxygen tank feels pretty arduous, people have goals other than production like not looking totally ridiculous in front of their friends. Even within the simplistic noise->distraction->less working model of the Isolator, we could start to complicate the model when we think about what those concepts mean. Is it the same "distraction" when it's a piece of the problem-solving you need to talk to a peer about? Is it the same "noise" when it's white noise that helps a person focus? Scientists call this specification. We also call hypotheses underspecified when they lack the precision and detail to become testable in the real world.
It's a good laugh, but I like this example because it's a silly way to get researcher-type people like me to start thinking about a serious concern I have. When we create simulations of the world (whether digital or in a laboratory) that have a profoundly simplistic mental model of what works to create outcomes, we end up with broken intervention ideas. And broken intervention ideas simply can't become reusable because the failures are so obvious once you roll them out into the real world, such as a workplace or a school.
What helps? I argued in my talk that we should try to use three main strategies:
1. Build with real people (their problems, the texture of their real lives, their needs)
2. Test theories systematically (so we can better know both what they actually are and when they fail and succeed)
3. Be a little braver than a single study (by this I mean: put effort, time and care into what happens to findings after you publish your study, both communicating evidence but also being open to the two-way street of adapting your thing to the elements of the world you didn't expect)
To me, Intervention studies are a particularly powerful way to do work that is marked by these three features, and I wish we would do more of them when we study developer experience.

For example, if you want to claim that your science project shows us evidence about how a certain message about vaccines can change parents' decision-making about vaccines, you'll quickly find you really need to specify your hypothesis to even design and run the intervention in the first place. It's very hard to get real people, trying to change with your tool, to stay abstract and fuzzy concepts...their concerns, needs, problems and pushback as you run your study will almost certainly deepen your understanding of the problem space you're trying to work in. Surveys are valuable, but it's easier to force people into the leading questions you want them to answer in observational work. Directly manipulating elements of your theory with an intervention is an incredible opportunity to create strong evidence.
That also means researchers shouldn't waste that incredible opportunity. You start to scrutinize things more because it's that much more expensive to waste the time it takes to run an intervention. You'll need to make choices about what not to do, and that forces specification because you are rewarded for building on better theory. It's simply harder to get it right without robust theory. In our Code Review Anxiety intervention study, we could have chosen to hammer on code skills and drilling syntax (indeed, the fact that this is a psychological appraisal intervention and not a skill-building intervention for code skills is something that our study was criticized for). But there is a theoretical reason we did not do this, even though some developers proposed that would be the fix for anxiety: anxiety theory tells us that it's not happening because of a deficit of skill. Rather, anxiety creates a deficit in your ability to access your skills. For high performing but highly anxious developers, who already know how to code, it's likely much more helpful to use strategies of increasing self-efficacy (which loosens the grip that anxiety takes on your executive functions, thereby improving your problem-solving!).
That's a successful intervention, but there are many unsuccessful interventions (because doing harder and more ambitious studies also means accepting that some of them will fail). What can we learn about that? I like John List's idea of a voltage drop: when an intervention seems to work well in the laboratory, but then loses its efficacy in the field. The simple insight here is that we truly need to think about how we can build in smaller-scale versions of the friction that exists in the real world, alongside considering the contexts that our interventions rely on. I have written about examples of this from the psychological affordances around software teams (fun fact re: reusable research – this review paper has been useful to at least several thousand people working in technical spaces, and yet I have gotten rejections on it from both software and psychology venues saying it's simply not useful. I know that it is; if you are an enterprising journal editor who wants to feature work that's helped real software teams you let me know if you want to publish it as a commentary).
More concepts I find helpful come from authors like Geoffrey Cohen, a Stanford psychological scientist who has published extensive work on psychological interventions, including improving people's sense of belonging and decreasing threat appraisals, highly relevant to what developers are going through today. I find the "three Ts" of successful interventions a really helpful checklist that I look at when I'm trying to design a change practice for a team: Targeted, Tailored, Timely.

Finally and perhaps most importantly, I advocated for soliciting our research questions from developers themselves. Our Code Review Anxiety intervention happened because developers themselves kept describing this experience to my co-author Carol and I every time they learned we were psychologists in tech – we were more called in to do this work than coming up with it alone (when you're a researcher trapped inside of the Isolator Helmet, no one can hear you scream your own ideas). Greg Wilson has collected a sample of research questions from practitioners, and it's something I've been thinking a lot about. I keep copious notes of the conversations and questions I have with developers about their lives, and these conversations massively shaped what's inside of The Psychology of Software Teams.
As I said to the UC Irvine DevEx Research Forum audience, it's incredibly scary to open up your science to the outside world. But it might just be worth it, and lead us to things we never would have discovered without being brave. Otherwise, I fear we risk remaining in simplistic models of the world that remain underspecified, and dissatisfying, to real teams. And the chance to do research is a precious privilege, so I think we should spend it on the things that help this community.
If you're research-aligned and interested in this sort of conversation for your eng org or otherwise, I'd be happy to consider giving it again (when my rapidly-filling book schedule allows!). I'd really like to connect with more orgs and communities that care about open science for the people in tech, and I think we're in a critical window of time when this community needs evidence to protect its practice, see a new future, and build for a human-centered version of our work.
Member discussion