thinking about building evaluation ownership, theories of change — back from canadian evaluation society

this week i had the pleasure of attending the canadian evaluation society (#EvalC2015) meeting in montreal, which brought together a genuinely nice group of people thinking not just hard a-boot evaluation strategies and methodologies but also how evaluation can contribute to better and more transparent governance, improving our experience as global and national citizens — that is, evaluation as a political (and even social justice) as much as a technical act.

.

similarly, there was some good conversation around the balance between the evaluation function being about accountability versus learning and improvement and concern about when the pendulum swings too far to an auditing rather than an elucidating and improving role.

.

for now, i want to zoom in on a two important themes and more own nascent reflections on them. i’ll be delighted to get feedback on these thoughts, as i am continuing to firm them up myself. my thoughts are in italics, below.

.

1. collaboration, neutrality and transparency

  • there were several important calls relating to transparency, including a commitment to making evaluation results public (and taking steps to make sure citizens see these results (without influencing their interpretation of them or otherwise playing an advocacy role)) and for decision-makers claiming to have made use of evidence to inform their decisions to be more open about how and which evidence played this role. this is quite an important point and it echoes some of the points suvojit and i made about thinking about the use of evaluative evidence ex ante. we’re continuing to write about this, so stay tuned.

.

  • there was quite a bit of push back about whether evaluation should be ‘neutral’ or ‘arm’s length’ from the program — apparently this is the current standard practice in canada (with government evaluations). this push back seems to echo several conversations in impact evaluation about beginning stakeholder engagement and collaboration far earlier in the evaluation process, including howard white’s consideration of evaluative independence.

.

  • part of the push back on ‘arm’s length neutrality’ came from j bradley cousins, who will have a paper and ‘stimulus document’ coming out in the near future on collaborative evaluation that seems likely to be quite interesting. in another session, it was noted that ‘collaboration has more integrity than arm’s length approaches.’  i particularly liked the idea of thinking about how engagement between researchers and program/implementation folks could improve a culture of evaluative thinking and organizational learning — a type of ‘capacity building’ we don’t talk about all that often. overall, i am on board with the idea of collaborative evaluation, with the major caveat that evaluators need to report honestly about the role the play vis-a-vis refining program theory, refining the program contents, assisting with implementing the program, monitoring, etc.

.

2. building a theory of change and fostering ownership in an evaluation.

  • there was a nice amount of discussion around making sure that program staff, implementers, and a variety of stakeholders could “see themselves” in the theory of change and logic model/ results chain. this not only included that they could locate their roles but also that these planning and communication tools reflected the language with which they were used to talking about their work. ideally, program staff can also understanding their roles and contributions in light of their spheres of direct and indirect influence.

.

  • john mayne and steve montague made some very interesting points about building a theory of change, which i will have to continue to process over the upcoming weeks. there include:
    • making sure to think about ‘who’ in addition to ‘what’ and ‘why’ — this includes, i believe, who is doing what (different types and levels of implementers) as well as defining intended reach, recognizing that some sub-groups may require different strategies and assumptions in order for an intervention to reach them.
      • as was noted “frameworks that don’t consider reach conspire against equity and fairness” because “risks live on the margin.” i haven’t fully wrapped my head around the idea of ‘theories of reach’ embedded or nested within the theory of change but am absolutely on-board with considering distributional expectations and challenges from the beginning and articulating assumptions about when and why we might expect heterogeneous treatment effects — and deploying quantitative and qualitative measurement strategies accordingly.

.

  • john mayne advocated his early thoughts that for each assumption in a theory of change, the builders should articulate a justification for its:
    • necessity — why is this assumption needed?
    • realization — why is this assumption likely to be realized in this context?
    • this sounds like an interesting way to plan exercises towards collaborative building of theories of change

.

  • a productive discussion developed (fostered by john mayne, steve montague and kaireen chaytor, among others) around how to get program staff involved in articulating the theory of change. a few key points were recurring — with strong implications for how long a lead time is needed to set up an evaluation properly (which will have longer-term benefits even if it seems to be slightly inefficient upfront):
    • making a theory of change and its assumptions explicit is part of a reflective practice of operations and implementation.
    • don’t try to start tabula rasa in articulating the theory of change (‘the arrows’) with the implementing and program staff. start with the program documents, including their articulation of the logic model or results chain (the ‘boxes’ in a diagrammatic theory of change) and use this draft as the starting point for dialogue.
    • it may help to start with one-on-ones with some key program informants, trying to unpack what lies in the arrows connecting the results boxes. this means digging into the ‘nitty girtty’ micro-steps and assumptions, avoiding magical leaps and miraculous interventions. starting with one-on-ones, rather than gathering the whole group to consider the results chain, can help to manage some conflict and confusion and build a reasonable starting point — despite the fact that:
    • several commentators pointed out that it is unimportant whether the initial results chain was validated or correct — or was even set up as a straw-person. rather, what is important was having something common and tangible that could serve as a touchstone or boundary object in bringing together the evaluators and implementers around a tough conversation. in fact, having some flaws in the initial evaluators’ depiction of the results chain and theory of change allows opportunities for program staff to be the experts and correct these misunderstandings, helping to ensure that program staff are not usurped in the evaluation design process.
    • initial disagreement around the assumptions (all the stuff behind the arrows) in the theory of change can be productive if they are allowed to lead to dialogue and consensus-building. keep in mind that the theory of change can be a collaborative force. as steve montague noted, “building a theory of change is a team sport,” and needs to be an iterative process between multiple stakeholders all on a ‘collective learning journey.’
      • one speaker suggested setting up a working group within the implementing agency to work on building the theory of change and, moreover, to make sure that everyone internally understands the program in the same way.
  • this early engagement work is the time to get construct validity right.
  • the data collection tools developed must, must, must align with the theory of change developed collectively. this is also a point shagun and i made in our own presentation at the conference, where we discussed our working paper on meaningfully mixing methods in impact evaluation. stay tuned!
  • the onus is on the evaluator to make sure that the theory of change is relevant to many stakeholders and that the language used is familiar to them.
  • there was also a nice discussion about making sure to get leadership buy-in and cooperation early in the process on what the results reporting will look like. ideally the reporting will also reflect the theory of change.

.

overall, much to think about and points that i will definitely be coming back to in later work. thanks again for a great conference.

oops, got long-winded about ‘median impact narratives’

i finally got around to reading a post that had been flagged to me awhile ago, written by bruce wydick. while i don’t think the general idea of taking sampling and representativeness seriously is a new one, the spin of a ‘median narrative’ may be quite helpful in making qualitative and mixed work more mainstream and rigorous in (impact) evaluation.

.

anyway, i got a bit long-winded in my comment on the devimpact blog site, so i am sticking it below as well, with some slight additions:

first, great that both bruce and bill (in the comments) have pointed out (again) that narrative has a useful value in (impact) evaluation. this is true not just for a sales hook or for helping the audience understand a concept — but because it is critical to getting beyond ‘did it work?’ to ‘why/not?’

i feel bill’s point (“telling stories doesn’t have to be antithetical to good evaluation“) should be sharper — it’s not just that narrative is not antithetical to good evaluation but, rather, it is constitutive of good evaluation and any learning and evidence-informed decision-making agenda. and bill’s right, part of the problem is convincing a reader that it is a median story that’s being told when an individual is used as a case study — especially when we’ve been fed outlier success stories for so long. this is why it is important to take sampling seriously for qualitative work and to report on the care that went into it. i take this to be one of bruce’s key points and why his post is important.

.

i’d also like to push the idea of a median impact narrative a bit further. the basic underlying point, so far as i understand it, is a solid and important one: sampling strategy matters to qualitative work and for understanding and explaining what a range of people experienced as the result of some shock or intervention. it is not a new point but it the re-branding has some important sex appeal for quantitative social scientists.

.

one consideration for sampling is that the same observables (independent vars) that drive sub-group analyses can also be used to help determine a qualitative sub-sample (capturing medians, outliers in both directions, etc). to the extent that theory drives what sub-groups are examined via any kind of data collection method, all the better. authur kleinman once pointed out that theory is what helps separate ethnography from journalism — an idea worth keeping in mind.

a second consideration is in the spirit of lieberman’s call for nested analyses (or other forms of linked and sequential qual-quant work), using quantitative outcomes for the dependent variable to drive case selection, iterated down to the micro-level. the results of quantitative work can be used to inform sampling of later qualitative work, targeting those representing the range of outcomes values (on/off ‘the line’).

.

both these considerations should be fit into a framework that recognizes that qualitative work has its own versions of representativeness (credibility) as well as power (saturation) (which I ramble about here).

.

finally, in all of this talk about appropriate sampling for credible qualitative work, we need to also be talking about credible analysis and definitely moving beyond cherry-picked quotes as the grand offering from qualitative work. qualitative researchers in many fields have done a lot of good work on synthesizing across stories. this needs to be reflected in ‘rigorous’ evaluation practice. qualitative work is not just for pop-out boxes (I go so far as to pitch the idea of a qualitative pre-analysis plan).

.

thanks to both bruce and bill for bringing attention to an important topic in improving evaluation practice as a whole — both for programmatic learning and for understanding theoretical mechanisms (as levy-paluck points out in her paper). i hope this is a discussion that keeps getting better and more focused on rigor and learning as a whole in evaluation, rather than quant v qual.

dear indian (and other) pharmaceutical manufacturers who produce birth control.

one of the greatest behavioral nudges of all time is the week of different-colored placebo pills in a package of birth control to allow you to just keep taking pills and always be on the right schedule.

please adopt this innovation. it cannot alter costs that much.

sincerely, spending too much time scheduling on google calendar now that my US birth control pills have run out

small sunday morning thoughts on external validity, hawthorne effects

recently, i had the privilege of speaking about external validity at the clear south asia m&e roundtable (thank you!), drawing on joint work i am doing with vegard iversen on this question of when and how to generalize lessons across settings.

.

my main musing for today is how the conversation at the roundtable, as well as so many conversations that i have had on external validity, always bend back to issues of monitoring and mixed-methods work (and reporting on the same) throughout the course of the evaluation.

.

my sense is that this points to a feeling that a commitment to taking external validity seriously in study design is about more than site and implementing partner selection (doing so with an eye towards representativeness and generalization, especially if the evaluation has the explicit purpose of informing scale-up).

.

it is also about more than measuring hawthorne effects or trying to predict the wearing off of novelty effects and the playing out of general equilibrium effects should the program be scaled-up  — though all these things are clearly important.

.

the frequency with which calls for better monitoring come up as an external validity concern suggests to me that we need to take a hard look at what we mean by internal validity. in a strict sense, internal validity relates to the certainty about the causal claim of program ‘p’ on interesting outcome ‘y.’ but surely this also includes a clear understanding of what program ‘p’ itself is — that is, what is packed into that beta treatment variable in the regression, which is likely not to have been static over the course of an evaluation or uniform across all implementation sites.

.

this is what makes monitoring using a variety of data collection tools and types so important — so that we know what a causal claim is actually about (as cartwright, among others, have discussed). this is both important in understanding what happened at the study site itself, as well as trying to learn from a study site ‘there’ for any work another implementer or researcher may want to do ‘here.’ some calls for taking external validity seriously seem to me to be veiled calls for re-considering the requirements of internal validity (and issues of construct validity).

.

as a side musing, here’s a question for the blogosphere: we usually use ‘hawthorne effects‘/observer effects (please note, named for the factory where the effect was first documented, not for some elusive dr. hawthorne) to refer to the changes in participant/subject/beneficiary behavior strictly because they are being observed (outside of the behavior changes intended by the intervention itself).

.

but in much social science and development research, (potential) beneficiaries are not the only ones being observed. so too are implementers, who may feel more pressure to implement with fidelity to protocol, even if the intervention doesn’t explicitly alter their incentives to do so. can we also consider this a hawthorne effect? is there another term in place for observer-effects-on-implementers? surely the potential for such an effect must be one of the potential lessons from the recent paper on how impact evaluations help deliver development projects?

something to ponder: cataloging the evaluations undertaken in a country

working my way through “demand for and supply of evaluations in selected sub-saharan african countries,” which is a good read. there are several great points to note and consider but just one that i want to highlight here:

in no country was there a comprehensive library of evaluations that had been undertaken [there]

this seems like something that should change, as an important public good and source of information for national planning departments. i wonder if ethics review / institutional review bodies that work to register and approve studies may be able to take on some of this function.

gem from the anti-politics machine: they only seek the kind of advice they can take

i am starting to re-read the anti-politics machine after some time… and, of course, started with the epilogue — the closest ferguson comes to giving advice from his vivisection. here’s a gem that remains relevant ten-plus years later, in spite of major political changes in southern africa:

certainly, national and international ‘development’ agencies do constitute a large and ready market for advice and prescriptions, and it is the promise of real ‘input’ that makes the ‘development’ form of engagement such a tempting one for many intellectuals. these agencies seem hungry for good advice, and ready to act on it. why not give it?

.

but as i have tried to show, they only seek the kind of advice they can take. one ‘developer’ asked my advice on what his country could do to ‘help these people.’ when i suggested that his government might contemplate sanctions against apartheid, he replied, with predictable irritation, ‘no, no! i mean development!

.

the only ‘advice’ that is in question here is advice about how to ‘do development’ better. there is a ready ear for criticisms of ‘bad development projects,’ so long as these are followed up with calls for ‘good development projects.’

what does it mean to do policy relevant evaluation?

a different version of this post appears here.

for several months, i have intended to write a post about what it actually means to do research that is ‘policy relevant,’ as it seems to be a term that researchers can self-ascribe* to their work without stating clearly what this entails or if it is an ex ante goal that can be pursued. i committed to writing about it here, alluded to writing about it here, and nearly stood up to the chicken of bristol in the interim. now, here goes a first pass. to frame this discussion, i should point out that i exist squarely in the applied space of impact evaluation (work) and political economy and stakeholder analysis (dissertation), so my comments may only apply in those spheres.

.

the main thrust of the discussion is this: we (researchers, donors, folks generally bought-into the evidence-informed decision-making enterprise) should parse what passes for ‘policy relevant’ into  ‘policy adjacent’ (or ‘policy examining?’) and ‘decision relevant’ (or ‘policymaker-relevant’) so that it is clear what we are all trying to say and do. just because research is conducted on policy does not automatically make it ‘policy relevant’ — or, more specifically, decision-relevant. it is, indeed, ‘policy adjacent,’ by walking and working alongside a real, live policy to do empirical work and answer interesting questions about whether and why that policy brought about the intended results. but this does not necessarily make it relevant to policymakers and stakeholders trying to make prioritization, programmatic, or policy decisions. in fact, by this point, it may be politically and operationally hard to make major changes to the program or policy, regardless of the evaluation outcome.

.

this is where more clarity (and perhaps humility) is needed.

.

i think this distinction was, in part, what tom pepinsky wrestled with when he said that it was the murky and quirky (delightful!) questions “that actually influence how they [policymakers / stakeholders] make decisions” in each of their own murky and quirky settings. these questions may be narrow, operational, and linked to a middle-range or program theory (of change) when compared to grander, paradigmatic questions and big ideas. (interestingly, and to be thought through carefully, this seems to be the opposite of marc bellemare’s advice on making research in agricultural economics more policy-relevant, in which he suggests pursuing bigger questions, partially linked to ag econs often being housed in ‘hard’ or ‘life’ science departments and thus dealing with different standards and expectations.)

.

i am less familiar with how tom discusses what is labelled as highly policy-relevant (the TRIP policymaker survey and seeing whether policymakers are aware of a given big-thinking researcher’s big idea) and much more familiar with researchers simply getting to declare that their work is relevant to policy because it is in some way adjacent to a real! live! policy. jeff hammer has pointed out that even though researchers in some form of applied work on development are increasingly doing work on ‘real’ policies and programs, they are not necessarily in a better position to help high-level policymakers choose the best way forward. this needs to be taken seriously, though it is not surprising that a chief minister is asking over-arching allocative questions (invest in transport or infrastructure?) whereas researchers may work with lower-level bureaucrats and NGO managers or even street-level/front-line workers, who have more modest goals of improving workings and (cost-)effectiveness of an existing program or trying something new.

.

what is decision-relevant in a particular case will depend very much on the position of the stakeholder with whom the researcher-evaluator is designing the research questions and evaluation (an early engagement and co-creation of the research questions and plan for how the evidence will be used that i consider a pre-req to doing decision-relevant work — see, e.g., the beginning of suvojit‘s and my discussion of actually planning to use evidence to make decisions). intention matters in being decision-relevant, to my way of thinking, and so, therefore, does deciding whose decision you are trying to inform.

.

i should briefly say that i think plenty of policy-adjacent work is immensely valuable and useful in informing thinking and future planning and approaches. one of my favorite works, for example, the anti-politics machine, offers careful vivisection (as ferguson calls it) of a program without actually guiding officials deciding what to do next. learning what is and isn’t working (and why) is critically important. his book is a profound, policy-adjacent work (by being about a real program) but it did not set out to be directly decision-relevant nor is it. the book still adds tremendous value in thinking about how we should approach and think about development but it is unlikely that a given bureaucrat can use it to make a programmatic decision.

.

but here is where i get stuck and muddled, which one of the reasons i put off writing this for so long. at some stage of my thinking, i felt that being decision-relevant, like being policy-adjacent, required working on real, live policies and programs. in fact, in a july 2014 attempt at writing this post, i was quite sympathetic to howard white’s argument in a seminar that a good way to avoid doing ‘silly IE’ (sillIE©?) is to evaluate real programs and policies, even though being about a real program is not an automatic buffer against being silly.

.

but i increasingly wonder if i am wrong about decision-relevance. instead, the main criterion is working with a decision-maker to sort out what decision needs to be made. one outcome of such a decision is that a particular way forward is definitely not worth pursuing, meaning that there is a serious and insurmountable design failure (~in-efficacy) versus an implementation failure (~in-effectiveness). a clear-cut design failure firmly closes a door on a way forward, which is important in decision-making processes (if stakeholders are willing to have a closed door be a possible result of an evaluation). for example, one might (artificially) test a program or policy idea in a crucial or sinatra case setting — that is, if the idea can’t make it there, it can’t make it anywhere (gerring, attributed to yates). door closed, decision option removed. one might also want to deliver an intervention in what h.l. mencken called a ‘horse-doctor’s dose‘ (as noted here). again, if that whoppingly strong version of the program or policy doesn’t do it, it certainly won’t do it at the more likely level of administration. a similar view is expressed in running randomized evaluations, noting the ‘proof-of-concept evaluations’ can show that even “a gold-plated, best-case-scenario version of the program is not effective.” door closed, decision option removed.

.

even more mind-bending, ludwig, kling, and mullainathan suggest laying out how researchers may approximate the ‘look’ of a policy to test the underlying mechanism (rather than the entirety of the policy’s causal chain and potential for implementation snafus) and, again, directly informing a prioritization, programmatic, or policy decision. as they note, “in a world of limited resources, mechanism experiments concentrate resources on estimating the parameters that are most decision relevant,” serving as a ‘first screen’ as to whether a policy is even worth trying. again, this offers an opportunity to close a door and remove a decision option. it is hard to argue that this is not decision-relevant and would not inform policy, even if the experimental evaluation is not a real policy, carried out by the people who would take the policy to scale, and so on. done well, the suggestion is (controversially) that a mechanism experiment that shows that even under ideal or even hyper-ideal conditions (and taking appropriate time trajectory into account) a policy mechanism does not bring about the desired change could be dismissed on the basis of a single study.

.

but, the key criterion of early involvement of stakeholders and clarifying the question that needs to be answered remains central to this approach to decision-relevance. and, again, having an identified set of stakeholders intended to be the immediate users of evidence seems to be important to being decision-relevant. and, finally, the role of middle-range or programmtic theory (of change) and clearly identified mechanisms of how a program/policy is meant to lead to an outcome is critical in being decision-relevant. .

.to return to

to return to the opening premise, it does not seem helpful to label all evaluation research associated with a real-world policy or program as ‘policy relevant.’ it is often seen as desirable to be policy relevant in the current state of (impact) evaluation work but this doesn’t mean that all policy-adjacent research projects should self-label as being policy relevant. this is easy to do when it is not entirely clear what ‘policy relevance’ means and it spreads the term too thin. to gain clarity, it helps to parse studies that are policy adjacent from those that are decision-relevant. being relevant to decisions or policymakers demands not just stakeholder engagement (another loose term) but stakeholder identification of the questions they need answered in order to make a prioritization, programmatic, or policy decision.

.

there must, therefore, be clear and tangible decision-makers who intend to make use of the generated evidence to work towards a pre-stated decision goal — including a decision to shut the door on a particular policy/program option. while being policy-adjacent requires working alongside a real-world policy, being decision-relevant may not have to meet this requirement, though it does need to ex ante intend to inform a specific policy/program decision and to engage appropriately with stakeholders to this end.

.

this is far from a complete set of thoughts — i have more reading to do on mechanisms and more thinking to do about when murky and quirky decisions can be reasonably made for a single setting based on a single study in that murky and quirky setting. nevertheless, the argument that there should be some clear standards for when the term ‘policy relevant’ can be applied and what it means holds.

.

*in the same somewhat horrifying way that a person might self-ascribe connoisseur status or a bar might self-label as being a dive. no no no, vomit.