20 min read

Unfortunately, you have to think about cause

So somebody came to you (maybe it was your own mind, that traitor!) and said you need to think about cause. Maybe you are asking yourself how to remove friction or identify blockers that are disrupting your colleagues’ work. Maybe there is a DevEx pilot we said we would evaluate back in January and now it’s June and it’s pretty much feeling like it’s time to evaluate the thing and didn’t evaluation feel a hell of a lot more obvious back in January, did anybody write that down?? Maybe you are responsible for thinking about how your product is changing things for your users (I hope one of you is).

At the heart of much of this is cause. We want our actions to change something for the better in the world; we want to know that our product does a thing; we want our changes to matter to our outcomes. We want to know what causes what. Unfortunately, thinking about cause is notoriously awful. Scientists have been arguing about how best to do it for longer than the tech industry has existed and we don’t even agree with ourselves. Human societies at scale have been wrong about causal relationships that are now accepted, such as “lead was put into the world by our cars” and “smoking drives up cancer.”

But in software, causal questions are here, and you might be asked to take a causal stance. This might look like explaining things (“we've decided these factors in DevEx drive developer productivity!”) or a little bit more secretly, describing things (de-escalation tip from Cat: if somebody is asking you an impossibly huge explanation question about an ultimate cause see if you can turn it into a more descriptive investigation! “What are 4-5 things that all of our developers seem to coherently agree on as necessary for unlocking their productivity?” “What is the most immediately changeable major thing we can agree is blocking productivity?” Try it on your friends!).

This won’t be a textbook chapter about causal inference (although I have a couple to recommend further down). In this piece I want to share four things I’ve learned about our relationship to thinking about cause when it comes to software teams and innovation work.

1. Practical Cause

I do not believe, as some writing on systems theory[1] seems to propose or at the very least look down upon you for being so stupid as to not realize, that cause is unknowable. At least not all causes. Not the cause that we understand as an immediate construct for our investigation, the cause that causes us to attempt to change something, with the hope that this change will impact some part of the world in a better or a worse way. So, practical cause. Perhaps a better way to describe it is this: I think that thinking about cause is useful even when it’s never complete.

Not everyone thinks this. In a certain social situation (a conference panel, let’s say) we could find ourselves arguing until the heat death of the universe about how all things break down into molecules, and atoms or worse, something-something-something quantum physics and the spacetime continuum (catch me at Comic-Con, by the way! The continuum I’m currently most concerned with is the spectrum between conformity and courage).

To fill up that time, we could beat each other over the head with terms like emergence, and nonlinearity, and how dumb people are when they say root cause. When I first heard software measurement arguments around the idea of identifying cause I had a significant experience of déjà vu. I may not be a software engineer but I went to a liberal arts school with four years of philosophy requirements and had a drama scholarship in a theater program that included both improv and debate kids, so I have had my fair share of traumatic arguments about reductionism. I understand that people in these software arguments are more likely to have physics backgrounds but it’s pretty similar to the theater-philosophy kids if you ask me.

The thing is those kids still flung themselves (dramatically) into chairs and expected the result of that motor movement to be met with the affordance of sitting while they told you that cause was unknowable and you were stupid for trying. So we seem to feel comfortable with some causal assumptions. Similarly, we often have to make a decision about what to do for ourselves and other people. In order to face this and despite the best efforts of these environments to entomb and crystallize me in the bogs of uncertainty, I became an applied scientist which is a job defined by people asking you to pitch them a causal story.

Causal stories are really, really hard in software. The tech industry has very little even basic evidence about what matters to the people in it, but it is changing things all of the time. It’s difficult to think clearly about cause when you’re getting thrown around inside a tornado of change (psychology would call this uncertainty, which has characteristics for what we are able to do in it, as a complex dynamic state – perhaps for physics friends, we could invoke words like Lagrangian, the funniest line in the current movie I like to watch on planes, Twisters). Doesn’t matter that they’re hard. I am somehow always finding myself entering these industries where people interrupt your theory with a harried look on their face and say, “Yes, fine, thanks for the new data, but what should I do?

What should we do is the question on top of all software development questions, or inside of them, biding its time until you need a really good jump scare. If software development questions were a state fair what should we do would be the batter everything is deep-fried in, omnipresent no matter the extraordinary range of the other ingredients: oreos! Pickles! A piece of pie! Ice cream! A cappuccino! TDD! Agile! Pair programming! Mobbing! Solutiding![2]. And this is a question that flings you into the world of cause and our mental models about it.

Practical cause is concerned with questions like: Does this thing seem reliably connected to the outcome I want? If so why do I believe this work? When does this work? What is the mechanism by which this works? Is this really the thing that is making the other thing work or is there another thing, always here, never measured, that could be it too? What do we not know about our environment? What could be the next thing we map? Can we save those maps over time, such that we are beginning to build a cumulative library of evidence? Can we use real theory building to steer our investigations, even the scrappy ones? To not go insane I find it helpful to realize that with a practical cause approach in tech we are very often concerned with the relationships between two variables, for instance, “what learning about one variable tells you about the other,” rather than ultimate cause.

2. Helpful Weirdnesses

Measuring and determining cause in a scientific way is very difficult. But we can develop our intuitions about cause in our everyday lives. I believe anyone can cultivate greater sensitivity about cause that is immediately actionable. Encouraging this for your team, colleagues, and community groups is engaging in a form of distributed cognition. Making it ok to raise counterfactuals is a big part of this: maybe together we can design better tests for our evidence. I think of this intuition like letting yourself be aware of helpful weirdness: the rough edges that poke holes in your intervention evidence, the behaviors you puzzle over.

Some of these have been well documented through painful experience by social scientists. Here’s an example. When I talk to developer experience teams about assessing first whether they’ve created a change at all, and second what the nature of that change might be, I often tell them about the novelty effect. Your observation of your intervention will carry a number of other, possibly causal factors with it, including the fact that the thing you’ve usually introduced is new. Does a brand-new math app deployed in a classroom really supercharge children’s motivation in a classroom, or is it simply the fact that teachers and students have received something new in their environment, so they’re engaged in a way they won’t be six months from now (novelty effects tend to attenuate)? When we operationalize and measure some element of student engagement (logging into the app? Time spent on math quizzes?), it could be that what we observe is being caused by the clever learning science design of the app but it could also be that the size of change we observe would have happened with any new app at all (we often call this a counterfactual: the unobserved state of the world that might provide important evidence about our supposed effect). By allowing yourself to think about this, you start to realize that “we saw a change happen” will not be precise enough evidence to answer whether or not what we see is a result of app design, or mere novelty.

An intuition-booster: notice when we are making an argument about a quite specific thing (the learning science design of the app!) but measuring it at a higher level than our claim (using the app at all, which from the person-and-behavior pov includes all kinds of behaviors like sitting quietly, having a free thirty minutes in the lesson plan, looking at the colors of the app, doing a trivia game that your product team inexplicably put in there, and whatever number of other things. These are all real examples from consulting I did on research for edtech that stay within the app use itself and don't get into larger factors like student familiarity with apps, or other edtech being used, all of which are sources of a social scientist’s favorite word: heterogeneity).

Ok, let's intuition-boost some more, we can learn to break this change observation down even further by imagining more weird. What if some portion of that change we saw in student engagement is due to the novelty effect, and some portion is due to the learning science design? That is a believable scenario. How might we puzzle out these differences? Well, one scrappy and observational way would be to continue to measure the engagement over time, long after we would expect the novelty effect to dissipate but the learning science pattern to still be in place. A better resourced, higher standard of evidence way might be to consider control conditions: can we test multiple interventions in this environment, in a way that separates out the things we want to test and holds other changes equivalent? For instance, a suite of apps, some of which have a certain learning science engagement design, others of which do not?

Nurture your intuition further by expanding your scope of inquiry. Novelty effects also can kick off other effects: let’s say simply by purchasing an expensive and shiny app, a school district has directed attention to the value of changing things in learning and signaled its willingness to invest in learning.[3] From this, teachers might take the cue that this is an important moment to try new techniques in the classroom, or at the very least, face their math curricula with some schedule changes, which could kick off further changes. What if the change we observe in student engagement is actually due to teachers’ increased motivation, or the change in class schedule – perhaps they are using the non-app time differently now? These additional sources of variation (heterogeneity!) in the environment and social factors draw our attention to the fact that there are not just many causes, but many changes happening in complex real-world interventions. Some of these may be measurable, and some may be controllable. It is likely that many won’t be, and so we need to be careful about understanding what we have directly observed and what plausible other explanations there are than the one we most prefer ("our incredible app is the sole driver of all of this!").

Practically, we might think about solutions like ensuring better control conditions, including a check that will capture some presence of confounding or unexplained other causes at a high level (say, "how much did you change your other lesson plans" for both our treatment and control conditions), and/or tackle this with qualitative followup (in the ideal situation we get to the whole toolkit). For an example of that second one, a DevEx team could include a high-level measure of Developer Agency (you can use mine) and realize that the development practice they're looking at seems to work really well for teams with high agency but terribly for teams with low agency (there is a whole world around understanding these pathways! In psych we tend to call these mediators and moderators). Or in a less severe case, that low agency seems to attenuate the otherwise positive effect of a practice, like water diluting the dose of a medicine. I have argued before that these types of gate-keeping affordances around what software developers perceive as possible in their environments are probably going to be very critical for developer experience teams to see in order to not get blamed for failures that aren’t their own making, and/or to receive credit for successes that are otherwise hard to see. If our hypothetical DevEx team gets to interview a sample of their users about agency, they may also develop good working hypotheses for the next why questions, therefore setting up better future tests, and now we're cooking with gas (/evidence. Is this still a good expression? After the great turn against gas stoves??).

It is more helpful weirdness to ask things like: what do we compare to what to get our change number in the first place? A precipitous change can be a signal even if the raw number isn’t very different from another number that at a surface glance seems relevant as a comparison. Numbers are tricksy and evil little spirits who can only be held to account by strict mental prisons. For example, the average unemployment rate for CS undergrads can look very similar quantitatively to the average unemployment rate for all undergrads in general but the difference we’re concerned with might not be CS compared to everyone but CS compared to itself, a change over time. You might think this is obvious but it is often not. I once spent a very stressful couple of hours convincing the leadership of a large school district that when they have a situation with five years of steadily declining test scores, a result that was keeping steady rather than declining in test scores (in otherwords a change in the slope of decline or rate of decline but not direction) might actually be a very positive signal even though it wasn’t (yet!) an increase in test scores. These causal reasoning helpful weirdnesses matter, and teaching people to correct misconceptions of them matter, because with the wrong causal story, sometimes they fire all the heroic teachers or stop doing the thing that is actually working (another Cat protip: I have learned people far prefer to compare between groups than within groups over time even though the latter is often what we really need. When you are constantly getting badgered about comparing between people and groups try pushing for a within-unit measure instead).

Is your head spinning? Good. I have worked on this kind of thing for years and my head spins when I think about it. Remember, we’re being thrown around by the tornado that is the complex world we're IN, and we're lucky if we can occasionally find the eye from which we can observe. While this is frustrating, thinking deeply about even just one bonus weirdness like the novelty effect helps us think more deeply about what we are really comparing to what. It opens up the space for your team/org/mind to admit that confounds exist. Turn the weird and disorienting into help. Take every skepticism about your evidence and use it to deepen your understanding. Maybe you will end up with unexpectedly vital findings, like in edtech the nearly universal finding that teacher buy-in and teachers getting the right help, which is also impacted by what they think a school is demanding from them, is a necessary precondition to get an app successfully deployed in a classroom and achieve the learning outcomes we want. Skepticism can also help us defuse one of the most dangerous bombs out there in our current evidence-claim-drenched landscape, a control that isn’t really a control, for instance: perhaps our observed change that we’re calling an RCT is happening because the treatment group gets teachers’ time and attention and the control group doesn't!

3. Unspeakable Causes

Some causes are like ghosts, haunting us. To get at these may require exorcism.

A lot of important causes face a steep uphill battle in becoming things we can talk about at all. Things that we have deep social pressures to not talk about or think about are like this. For instance, maintaining coal power sources often costs us more than shutting them down but people have been working for years to make that a say-able thing in policy. Currently one of the most promising causes in understanding our physical well-being, social determinants of health, is being censored out of scientists’ work.

Software has its own ghosts. In our work on AI Skill Threat, I argued that developers’ experience of identity threat is one important possible cause for some negative outcomes that happen to engineering orgs adopting AI. This is of course a proximate cause, a practical cause; we could go back to that heat-death-of-the-universe level argument and talk about what creates identity threat (we take a practical and not heat-death-level stab at this in the lit review of the paper by describing the psychological models we use to predict the functions of threat and ingroups). Evidence toward the closer causes can be important to create more accountability in an organization, and there is a causal stance needed for that: some policies and cultural norms can damage our innovation engine, and identity threat damages people’s ability to innovate and the way this is happening is spiking identity threats. This is an observational study between groups so has the usual threats to validity included; listen, I would love to directly test an intervention that tries to increase belonging and learning culture within a software team and measure if this decreases AI Skill Threat within that same team, we could even cook up comparison groups with other interventions, email me!

Observational work can be a mighty way to make the causal ghosts visible to others, though. Developers are suffering under mandates for AI without clarity. In the case of my AI Skill Threat work, I knew we needed evidence fast and we needed it with a big group and we needed it on something big, the psychological factors that felt unspeakable. And exorcisms are made easier by bringing in the priests/scientists: we believed we were onto something with this proximate “developers most likely experience significant threat when imagining their roles changing, and threat changes how you think and what you prioritize” cause because despite the limits of our study this explanation is situated all the way back in the fundamentals of both evolution and culture: we need to avoid threat to survive, and we need to engage in social learning via cultures in order to flexibly optimize for problem-solving on this planet. When dealing with the scariest ghost a priest (scientist) can connect you to the power of better worked empirical theories, which is why it's paradoxically often most important to involve evidence people like us in your scrappiest projects, the ones where you finally got the chance to study something unusual, if only because I find these tend to be the projects where the ghostly causal questions seep in. Trying to build on theory greatly reduces our search space for good intervention levers on important but underresourced issues.

In (simplistic) economics, ghost causes are often shoved into the convenient mental boundary of an externality. But in real life and software, everything is an externality of something: you, me, your children, that friend-of-a-friend who has a lung condition you didn’t know about that just couldn’t take the cumulative insults of living in a space where air quality rules have been suspended for some greater economic need. Here I suppose I've inadvertantly written myself into sympathy with the “there is no root cause” stuff. Root cause is fundamentally relativistic, observable only from your position. This is the hardest topic in thinking about cause so I've put it last.

4. Causal Bundles, Evidence Ladders & Positionality

Cause does not care about our mental comfort. We are capable of greatly tricking ourselves in how we bundle things and what categories we construct and then forget we constructed and treat as real. We need to be careful about the way our thoughts stick together, one thing merging with another. I find knowing about three different but related ideas very helpful here: causal bundles, an evidence ladder, and positionality.

We’re frequently invoking causal bundles, not single causes (this isn’t a term I’ve seen anybody else use, it’s just what I call it). These can produce startling effects–sometimes paradoxes– that we misinterpret when we don’t think holistically enough about our bundles of causes. Fields with more immediately obvious or at least higher standards for themselves, such as fields that make claims about evidence for causes of death and change in them, have learned this better than software engineering.[4] For instance as epidemiologist and real-world causal inference expert Dr. Ellie Murray wrote recently, despite our strongly held beliefs that rises in chronic illness and rises in mortality will hang together (true!), our moving the needle on child mortality can still be a causal factor in why we see more sick kids: “children who die cannot get sick.”

Dr. Julia Rohrer writes eloquently about causal bundles and our relationship to them in her recent paper, Thinking Clearly About Age, Period, and Cohort Effects:

“Causal effects,” in turn, refers to differences in the world resulting from (often merely hypothetical) extremely precise interventions that change only one thing. For example, if this morning I had had muesli for breakfast instead of yogurt—but kept everything else about my morning routine up to that point the same—would I feel more or less hungry right now? It is admittedly strange to think about age (or period or cohort) in that manner. If this morning, I had suddenly aged by 10 years—but kept everything else up to the point the same, including my cohort—it is not immediately clear what would be different. Age is just a number, after all. To come up with a coherent conception of age effects as causal effects, one has to think of age as a variable that indexes all sorts of processes that, in turn, have causal effects

Almost every story in psychology that I can think of that moves large groups of people to action is really a story that indexes all sorts of processes. This becomes very very obvious when we realize that we are most often (in software real-world scenarios) dealing with evaluating effectiveness, not just efficacy. Effectiveness is a way of talking about the real-world performance of an intervention (or medicine, or “treatment,” the word social scientists like to steal from people in medical fields). Efficacy is an “ideal circumstances” treatment effect: this medicine works to prevent or change x disease, assuming we’re doing it “right.” Of course, the boundaries of right and ideal are quite loaded here. But just knowing how to distinguish between one or the other will set you ahead in many conversations.

Evidence ladders can help us understand where we are with regard to our understanding of the effectiveness of some intervention we want to make. I find this entire piece very helpful. Get into a safe space, maybe with a pet next to you for comfort, and take a look at Figure 2. Do we do any of this, in software? Do we place our evidence on this ladder, from problem definitions to systemic reviews to candidate identification to causal inference…? Notice that the real-world observation is used to create formal predictions before we move into large-scale testing.

I believe that evidence ladders like this can at least help us communicate more systematically about what we know and don’t know, the density and quality of evidence behind specific interventions we propose to impact developer experience. I also believe that most evidence around software work hangs out, definitively, no higher than level 4. Give me five years and five million dollars, and I can get us to those levels 8 and 9: first large-scale testing of solutions which were vetted in a variety of settings and which use the best stimuli, then deploying them in a crisis.

This genuinely matters to me, actually is my dream for software science and doing science for the humans of software, because finally getting to this level of causal insight can unlock a virtuous cycle. In their excellent but dense textbook (read it if you want to go deep), Counterfactuals and Causal Inference, Morgan & Winship write: “In general, research that takes account of heterogeneity by splitting treatment states into mutually exclusive component states will break new ground if sufficient data are available to estimate the more narrowly defined treatment effects” (p. 39). This is a rather mild side comment that says something exquisite. In plainer terms, moving up our evidence ladder gets us not just better tests of our causes but better causes altogether.

Finally, running throughout everything I have said about cause in this piece are implicit and explicit pointers to positionality. In response to my first essay in this project, I was asked whether I was writing "as a scientist," which is to say, without an opinion. As an action researcher this is fundamentally not a thing I believe is possible. You are in the world that you’re studying, you’re in the tornado. Scientific conventions of WEIRD extraction have not taken this seriously enough. And this does not always mean our visible or invisible identities. We also have positionality as members of our own organizations or communities, causal stories and mental models that we are dwelling in and investing in over time, which can greatly distort our view of evidence. I have written about this distortion that comes for us when we are members of an organizational body as Corporate Attribution Error. We also have positionalities as members of software communities, as generational cohorts, as users invested in certain technology patterns. These impact what we see and how we see, and we should challenge ourselves to bring in other positions. At the same time, we should draw deep from the wisdom that our positionality brings. What causes can only you surface?

While I believe positionality matters far beyond marginalized identities, it is of course in works that center marginalized perspectives where we will unsurprisingly find the most guidance for this and the most wisdom. Moving from known positionality is a place where marginalized communities are much more advanced, aware, and expert. Guyan writes in Queer Data, “Data helps name the problem and translate issues that affect many people into a format that becomes an evidence base for action. For groups of people that have been misrepresented, marginalized and oppressed by existing systems of power, data can offer a powerful tool in the fightback against inequality and injustice” (p. 168). Of course, the opposite can be true, because it is never just data. Walter & Andersen write, in Indigenous Statistics: “we argue that the researcher’s standpoint dictates how he or she makes sense of the many competing theoretical frames and therefore selects it (or them) as most appropriate for the research” (p. 54), and “We are not just researchers; we are socially located researchers” (p. 86) and “For researchers, explicating our social position can be a double-edged sword. On the one hand, spelling out who we are, who we think we are, and why provides insight across the multiple facets of our lives, and life biography allows us an understanding of why we are the scholars and the researchers that we are [...] On the other hand, explicating our social position publicly can make us vulnerable as both people and researchers” (pp. 86-87).

Ultimately, many investigations and stances about cause around developer experience and software teams remain unclear because they’re unclear about their positionality and their theory. While we sure love to throw around the word framework in this jawn, most software conversations are not truly interested developing a theoretical framework. In A Problem in Theory Muthukrishna & Henrich write:

useful theoretical frameworks tell scientists not only what to expect, but also what not to expect. They show the interconnections between theories. Understanding the mechanisms behind one phenomenon informs research in other areas, often limiting the likely hypotheses or strongly favoring some hypotheses over others. Each empirical result reverberates through the interconnected web of our understanding of a domain.

In a more meta position about what holds us back from theorizing itself, Adam Fetterman writes about the postionality of researchers who are afraid to attempt large claims in the aftermath of social and cultural criticism in his provocative editorial Be Daring and Cause Trouble:

There is now a torrential river of manuscripts where authors are reluctant to make any claim at all about the human condition that extends beyond the...orthodox, textbook statistics, and measures, seeking refuge in elaborate details of the methods, statistical models, and open-science practices [...] However, even the most well-powered, transparent, preregistered, replicated, and confirmatory research, with the most complex and impressive models, is meaningless if it does not answer important questions about human psychology.

Software finds itself frequently stranded between these same hard places: too-confident folk theorizing without rigorous theoretical frameworks on one side of things, too-timid investigations that are afraid to attack the important questions of humans in software on the other. Against the weight of all this causal theory and turmoil I would add my applied psychologist voice here: we who are trying to rehumanize tech are not just curious about answers to causal questions. We are in desperate need of aid as we try to convince others of our answers and persuade them of the value of the lives they are not hearing about. That is why we are all out here unfortunately having to think about cause and make arguments about it even when the situation is imperfect and the evidence is incomplete. The evidence is incomplete and people are hurting. Being clearer about cause, climbing evidence ladders, and mindfully speaking from your own positionality is critical for good science and for winning the arguments you need to win when you’re trying to help people.


  1. This is a generalized hot take from me rather than something with a specific citation. I sort of hope that if I only say this term once and not three times into a mirror it will be ok. In Psychology we talk about “complex dynamic systems” but have a different set of references and history than the ones I’m still working through in a syllabus on for software. At any rate, some of my closest friends are systems theorists! ↩︎

  2. Not a formalized practice I suppose but the presence of things like pair programming and mobbing always makes me want a name for when programmers code alone and try to get as little feedback as possible. Lone wolfing? ↩︎

  3. Or in the case of top-down mandates of edtech that don’t allow for teacher-student feedback and agency, perhaps what they are signaling is disinvestment from the decision making in learning, which could kick off disengagement in the environment. ↩︎

  4. Software engineering has plenty of death in it but software is often also a means of making death distal and its cause diffuse. There are causes though! For instance, something caused and someone didn’t fix a bug in a pharmacy benefits manager software, and that blocked me from getting a prescription to help me breath right before my wedding. I did not die obviously but it sure increased the odds for a while. Maybe being exposed to some bugs is like being exposed to cigarette smoke. I would love to someday partner with an epidemiologist and think about software as a public health risk. ↩︎