How Not to Measure the ROI from AI in your Software Organization
You did your best, but it happened: somebody told you that you are the one who has to measure the ROI of AI on the developers in your software organization.
I like to imagine that when this is your first time getting pulled into being responsible for estimating a generalizable pattern about an enormous change as articulated and experienced by thousands of people and then reporting back on it in clean and simple and accessible enough numbers that your leadership won’t accuse you of overcomplicating things (“what do you need, a month?”), that night, there is a knock on your door. You open it to find a wizened person with a calming expression, Terry Pratchett-character-like, who comes into your house with a drink of your choice under one arm and a basic research textbook under the other. And the two of you stay up all night and hash out the plan that takes you from extremely panicked to I’ll do my f-king best I guess. It is one of my particular fantasies that there could be a secret society around the evidence science of software teams, available and on call for these moments. Perhaps once a month there is even a cross-societies meeting where you can listen to the health evidence people trying to assess whether nursing training is helping hospitals, and the education evidence people trying to assess whether math apps are helping rural school districts. The problems won’t get less big but they will get more shared.
Lacking that, here is a newsletter. I have been measuring the effects of interventions for a long time, and arguing about the ways to measure the effects of interventions for even longer. There is a lot here. But yay, your nightmare project could create some enduring skills? Measuring change will not become less relevant in our changing world, and when we try to evaluate ROI, we are usually talking about measuring the effect of an intervention. The I part is usually a lot more straightforward than the R.
But there are also some wins that come from not doing things, and we systematically tend to overlook them. There are a few very common pitfalls in how people tend to think about change and our Return (or the outcome measures, in my world) that create misleading findings. And misleading findings can be worse than just guessing, because it can be a lot harder to argue against. I cannot tell you how to solve everything in what will already be a too-lengthy newsletter (I did write a book which has a chapter on creating an evidence practice for your software organization and is coming out this year, so stay tuned for that). But these have been consistently useful concepts in my conversations with the developer experience leads, infrastructure leaders, and organizational tool-buyers right now.
1. Don’t Assume Everything Is The Same
One of the emerging problems I see for understanding the impact of agentic coding on software development is the fact that how we use it has tremendous heterogeneity. That was the scientist way to say it, and here is the way you can say it to your boss: everyone uses this thing differently. There is a lot of different stuff going on. Do not think about this as one experiment in your organization, but as a million little experiments. Your best first job might be to ask which experiments are working and why, rather than assuming you have just set one experiment in motion. Or for some of us, the job might be finding which experiments are damaging and why. I am not interesting in telling you what to think about this, I am interested in how we refine what we think.
One developer I spoke with said that he reads every response he gets from Claude Code, thinks about it, reads the response, perhaps compares it against some other piece of code in the project, writes a prompt back or deliberates on, and does a couple of pullups as he waits for it to respond again (I envied this, as I am working on my pullups at the moment at the gym, and because I have trouble remembering I have lat muscles, when I do it I am constantly thinking to myself: stop using your shoulders. Being an orchestrator is not new for our cognition). Is this rote copy-pasting? Sometimes it is and sometimes it isn’t. But it is very different from the process that another developer friend of mine uses, as he tasks multiple agents with cross-conversation and experiments more with the validation between them than concerns himself with looking at their individual output, a process which has a cadence across several days. Another person has outfitted their Claude time with a learning skill built by some random psychologist they know and it has added about forty-five minutes of non-coding, just browsing time to their agentic coding day, which makes their day-by-day “velocity” numbers look bad. But if you had thought to test their comprehension of the generated files over a longer cadence (let’s say, two weeks), you would discover they are making leaps and bounds on learning a new language compared to people working "faster." Many such cases.
Agentic coding workflows create personalization of tool*behavior interactions on a scale that is hard for us to wrap our minds around (and interactions are already pretty hard for most people to wrap their heads around). That's one of the reasons you distrust simplistic studies about AI right now (me too). That has always been true for software developers to some extent but now it's escaping our grasp, because we can no longer default to simple and software-culture-loaded assumptions like “well most reasonable people do it like this,” or, “I’m sure all the engineers use the tool this way because that’s what I do.” Look, I don’t fault you for starting there. We all start there.
But in a landscape of change, which is a landscape of heterogeneity, you are served by getting more specific. You need to unpack the mechanisms of what you really want to learn about (are they activating lat muscles? shoulders?). Or, you need to be clear that you cannot unpack those mechanisms and stand firm in the face of people trying to get you to use stupid uncontextualized outcome measures (like task-untroubled “velocity,” assuming that all the various different kinds of tasks everyone does should see equivalent speedup, which is an absurd statistical proposition. Imagine thinking that a hundred types of medicine should have the same effect on the body despite targeting completely different cells).
You can basically do two things here:
- Invest in rich enough investigations that you feel confident about how you’ve grouped people’s behaviors together, or
- Ask the questions that you can actually answer with higher-order groupings
With technology, sometimes I think about how we get compelling estimates of effects from concrete behaviors and their impact 1) as train & constrain. For instance, 1) might look like bringing a group of developers into an intervention study where you give them all a training in how you want them to use AI to review and then verify that they’re doing that, then you can compare this condition to another review condition and make a claim about the particular usage (the one you forced people to use).
Getting more specific about outcome measures is also required. If you are concerned about how developers are reviewing AI-generated code you need to think about how you measure their reviewing instead of assuming they are all reviewing AI-generated code in the same way, on the same days, and for the same goals, and you need to think about how you operationalize a “good review” if that’s the outcome you want to see. There is also a quasi-experimental version of this where you find people inside of your organization and verify they are doing a certain practice and from there, compare to another group (hopefully you have also done some work to consider how well these groups compare to each other on demographics, roles, contexts, etc. If not, buy me a drink schedule an evidence strategy consult with me and we can figure that out together).
There is another very valuable version of this which is simply observational, and almost any version of 1) will either require it or be much better for it. It will not yet rollup into your single estimate of the ROI of AI, but it might be the precondition you need to decide what should go into that estimate: gather up a bunch of examples of how people are using it and start to figure out if there are shared patterns that have behaviorally coherent stories, such that you can group them (at this point you may be thinking: my leadership didn't give me enough time or money for this project. No they probably didn't, yes you should advocate for yourself, this is a really hard problem they are spending a lot of money on and you have the momentum of the moment behind everyone's urgency, this is not the time to be chill about what you need to do your job).
For instance, perhaps you have a strong hunch that there are three basic “types” of AI users in your organization (it would be nice if this were informed by interviews, not just your gut instinct), and you want to take an observational look at what is happening for those three groups. Frankly, starting with the observational and the rich descriptive is what most people should be doing, and it’s very anxiety-relieving. Some good practices will start to become very obvious if you just let yourself observe what’s happening before you get out over your evidence skis and mandate how people should use their tools, top-down, a thing that people are famously responsive to and absolutely love, especially developers. If you are lucky enough to be an organization that actually lets developers learn together, the power of mutually-transformative iteration across that cumulative culture will probably reveal the best patterns to you, and your job becomes so much vastly easier, as all you have to do is report it and advocate for it.
But 2) is also a good choice for a lot of people in their particular organizational situations. Choosing 2) might look like telling your leadership that instead of being able to estimate the effects precisely from different types of usage of AI, you’re going to look at the organization’s average review rate before and after the introduction of AI and be able to see whether it took a hit or not. You can dispel a lot of fear with 2), because so many changes are actually not as dramatic as we think, or not happening as we assume. That can guide your next round of investigation which will be into the taxonomy of usage itself. There are a lot of ways of estimating change that we have figured out in the world of observational causal inference, and you might benefit from investing in an applied statistician or at least invest in someone who can help you handle variance over time and create reasonable thresholds for detecting change in a highly volatile environment. There are absolutely ways to do this, you should just know that it’s incredibly easy to generate false effects and find false signals.
Because of this, lest you make the rookie mistake of thinking 2) is always the easy way out, I will tell you this is often harder than getting more specific. You will be tempted to think things like “but I get to sit at my computer and model big data instead of talking to people about how they use AI (ugh) and my bosses will think the human stuff isn't as impressive as the data science stuff (uuuuugh).” I have had this thought myself, but this is the mind-killer. Big data is terrible when you don’t know what’s really happening. Everything will look significant. You can get stuck in a cave of all signal. You'll almost certainly need to think about multivariate, multilevel models of data over time, and you’ll need to do detective work about its gaps. You will need to look at what is likely highly variable data and pay attention to the fact that defining “a significant change” is actually quite difficult. You will realize the wrong things have been measured. You might even discover horrific underbelly kind of things in your organization’s data, like that everyone assured you that software processes were totally instrumented and observed and visible and traceable and that this is a wet sack of lies (never, not one single time, have I consulted with an engineering organization on their evidence and found that they had all the data they thought they had).
In fact, for as much money as many organizations spend on doing the thirty-millionth version of this type 2) study, you could probably run a couple of well-scoped pilot intervention studies of the 1) train & constrain type, but perhaps that’s my bias (intervention scientist really enjoys thinking about interventions: shocker).
2. Don’t Assume the Size, Shape, and Stability of Effects
Failure to appreciate task heterogeneity is also one of the biggest things developers criticize (rightfully) about AI research studies right now, and it is one of my own pet peeves. But another goddamn issue with this whole “what is agentic coding doing in software organizations” thing is that you can’t assume an effect that happens at first is going to stay the same. Yes, it’s getting absurd, but expecting change is the path to enlightenment.
Effects don’t stay the same. If you attended that monthly meeting and listened to your evidence peers you would hear about plenty of interventions that seem to have a big effect, which then stabilizes or even vanishes. For example, when a new drug for depression enters the market, we frequently see that people have wonderful results from it. But then, the efficacy seems to plummet. There’s something buried in that situation: it was a new drug entering the market, and people get excited about new drugs. Depression medicine is infamously prone to placebo effects because depression (for a lot of cases; it’s a very big category, oh jfc there’s that heterogeneity again) is highly responsive to what people expect and think about. Just participating in a clinical trial for a depression drug is itself experienced as a positive intervention by a lot of people, because you get to go to a center where all the doctors and scientists care about you and your particular treatment-resistant depression, which is obviously very uplifting for some people, so even if they actually aren’t getting a medicine at all, they could experienced a positive change in their depressive symptoms.
This is one of a thousand examples of a confound, and a reason that scientists get really unbelievably excited about the possibility of comparison and control groups. Even if you can't assign people to do things, you should think about whether your treatment group is really the same as your comparison group. There are undoubtedly placebo effects happening all the time from developer tools, although I’m not even sure placebo is the right word – there are effects happening based on how we are thinking about our tools, not just the tools themselves.
Another good one to think about? Novelty effect. If you introduce a new math app into a classroom and compare it to the old, boring way of teaching, a thing that might happen is that teachers are very excited about the new math app and so they also teach better. Students sense this, and engage more deeply, and believe that they’re capable of learning a little more. You, the researcher measuring the impact of a math app, are happy to report that the math app looks like it’s having incredibly good results on student learning, and you go on your merry way and roll the math app out to a hundred schools. A year later, the student learning scores are just the same as they were with the old method, because all that novelty is gone. It's not that something real didn't happen, it's just that the chain of causes is more complicated than the one we wanted it to be.
There’s some indication from emerging research on Github projects adopting Cursor* (caveat: this is a very noisy view of the issue, in that individual-level effects are not able to be estimated here, and estimates from mining Github should be undertaken with caution because inside-organization data might look very different) that when AI tooling is introduced into repos we might expect to see an initial spike in velocity that then goes way down, while task complexity changes (possibly because people are taking on different types of tasks than they would have without assistance). I think this is characteristic of changes to workflows in general and can map onto a reasonable hypothesis. We might see very large changes to the amount of code being generated, but we might also expect that number to drop as practices stabilize. You can’t know, and you shouldn’t make your estimate of a “good change” a static thing that you pick out of the first month of adoption. And thinking about the different types of task piece shows you that it could be that a measure that meant one thing before AI (code complexity with everyone writing in their familiar language) now means something different (code complexity with everyone writing in new and unpracticed languages). Measurement is a practice, not a single thing, I’m so sorry.
By the way, here is another note for those of you who are going the applied statistician/or consulting with an evidence expert route: ask us about the many statistical issues to check against when making an estimate about a pre and post change when effects themselves are highly heterogenous and there’s extreme temporal heterogeneity. When effects decay rapidly, we need to be super careful about setting policy on them and have a higher bar for showing meaningful change. 😬 I usually recommend that you plan to look at at least several quarters of software data if not a full year, because seasonality and planning effects are so big in organizations.
But at least now you’ve thought of something you could incorporate into your measures: are developers doing new and unfamiliar tasks? Do they need learning support? Should I expect to shift time saved in code generation into new time needed in reviews? These are the practical recommendations you can begin to see take shape when you internalize that effects are dynamic over time, not magical fixed numbers.
We're back to the recommendation I already made in the previous section: get more specific. Effects measured without a clear understanding of mechanisms for those effects are going to continually fall into these kinds of traps. Now, no project is perfect and will measure everything, but you should at least be thinking about the major confounds that might exist in your comparisons, and you should have at least some kind of idea of the mechanisms that seem reasonable. Why should the math app work better than the old method, is it a principle of pedagogy? What behaviors are you expecting to emerge because you have introduced the math app?
Even just a little bit of thinking about this can put your project ahead of many others because once you think about a major confound you can try to measure and account for it at least a bit. This can also protect you from the mirror-image problem: not measuring a real effect when there actually is one.
3. Don’t Optimize for Individual Velocity Measures
This one is more opinionated, but it’s an educated opinion.
One of the issues we struggle against here is deeply rooted in our psychology: we want inherent properties of individuals to explain everything. However, in my experience working on evidence strategies with engineering organization, this mental model can be a tremendous trap. You can assume that one of your secret jobs as an evidence designer is to puncture everybody's fundamental attribution error. Tech loves its lone genius narratives, and it will try to find ways to shove that narrative into the AI era, no matter how ridiculous “throughput” measures start to look when code can be generated and the costs are all over the place.
Unfortunately, the bias toward individual-and-fixed-cognition explanations encourages micromanagement and turning managers into hall monitors of cognition, which feels both icky for us to experience but also just isn’t going to get us the results we want. I deal with this a great deal in my book (local book author thinks you should read her book, news at eleven) and probably post about it ad nauseam, but I’m just going to keep saying it because it’s critical. Every measurement project is an opportunity to make an argument to your organization about what you should care about, and we should care about collective outcomes.
If what developers need is time and space to learn; what teams need is a beneficial culture that will reward them for thoughtfully learning and for experimenting in ways that are collectively innovative. And in fact, we know that rewarding heroes for doing things the most right is actually not always as good for collective intelligence as rewarding people who make the group more right, even when they are individually less accurate themselves. You can measure learning culture in your with my free measures which we tested across thousands of folks; you can also develop your own specific and situated ways to measure something like learning culture in your org.
@dr_cat_thinks We often assume that the most strategic thing a business can do is focus on rewarding heroes, experts and lone geniuses. But the research on collective innovation reveals some surprising paradoxes! Focusing on group performance is often better as a strategy overall and more adaptive for being able to face a wide range of problems, and one of the ways you get there is by rewarding people who care about their group even if they aren't always the highest performers themselves. #psychologyfacts #science #research ♬ original sound - Cat Hicks, PhD
This is one reason I steer many of the leaders I talk to into thinking about how we are asking developers to tell us about their culture and reflect on the shared environment, rather than looking at all developer experience measures as being about peering into people’s heads. This is also what self-report is better for! We’re not good reporters of our own learning because we don’t observe it very well; we are good reporters of the psychological affordances that our team is surrounding us with. This is yet another lesson of the assistive tech era – “let machines do what machines are good at, and spend our human energy on the things humans are good at” goes for data collection too. You can still use software metrics, especially as they feel powerful and defensible to you, but you better use the right interpretation of them.
9 times out of 10 I advise my evidence strategy clients to combine qualitative interviews, focus groups, experiencing sampling and surveys with those comfy-feeling Big Data trace measures like software metrics. You can have it all. They can make each other more valuable, and they can shift your entire investigation toward the environment. Human beings are not good at counting and reporting their heartbeats over time so we created ways to quietly measure that trace measure in the background, continuously, providing a depth of data a human would find impossible. But human beings have incredible sensemaking capacities which reveal themselves on well-designed measurement instruments.
That well-designed part isn’t optional. To get good data about what's happening in our workplaces, we need implementation science first and foremost. You might ask yourself questions like: Why should people trust you enough to tell you what's happening to them? Why should they be honest with you? How have you helped all responders understand your measure the same way? How have you shown them what your intentions are? How have you made it easy for people in your organization to generously share their experiences? Who might be systematically less likely or able to respond? Can you go out of your way to compensate for that? What happens to their data after your project? How have you committed to benefitting them?
These questions are the ones that will take you from seeing measurement as a single task to seeing it as it really is: a practice that you need to have a relationship with. And that brings us finally after all these don’ts, to a do. Do think of measurement as a practice in your work. Do tell people what you are doing, and why. Do admit to yourself that measurement is hard and scary and particularly hard and scary when things are as high stakes as they are right now. Do take measurement seriously as an opportunity to advocate for the things you believe in, rather than a test you’re going to try to get through as quick as you can. Do the pullup and try to activate those lat muscles. I believe that you are stronger than you think.
Member discussion