Aside

Something to Ponder: Cataloging The Evaluations Undertaken in a Country

Working my way through “Demand for and supply of evaluations in selected Sub-Saharan African countries,” which is a good read. there are several great points to note and consider but just one that i want to highlight here:

In no country was there a comprehensive library of evaluations that had been undertaken [there].

This seems like something that should change, as an important public good and source of information for national planning departments. I wonder if ethics review / institutional review bodies that work to register and approve studies may be able to take on some of this function.

What Does It Mean To Do Policy Relevant Evaluation?

A different version of this post appears here.

For several months, I have intended to write a post about what it actually means to do research that is ‘policy relevant,’ as it seems to be a term that researchers can self-ascribe* to their work without stating clearly what this entails or if it is an ex ante goal that can be pursued. I committed to writing about it here, alluded to writing about it here, and nearly stood up to the chicken of Bristol in the interim. Now, here goes a first pass. To frame this discussion, I should point out that I exist squarely in the applied space of impact evaluation (work) and political economy and stakeholder analysis (dissertation), so my comments may only apply in those spheres.

The main thrust of the discussion is this: we (researchers, donors, folks generally bought-into the evidence-informed decision-making enterprise) should parse what passes for ‘policy relevant’ into  ‘policy adjacent’ (or ‘policy examining?’) and ‘decision relevant’ (or ‘policymaker-relevant’) so that it is clear what we are all trying to say and do. Just because research is conducted on policy does not automatically make it ‘policy relevant’ — or, more specifically, decision-relevant. it is, indeed, ‘policy adjacent,’ by walking and working alongside a real, live policy to do empirical work and answer interesting questions about whether and why that policy brought about the intended results. but this does not necessarily make it relevant to policymakers and stakeholders trying to make prioritization, programmatic, or policy decisions. In fact, by this point, it may be politically and operationally hard to make major changes to the program or policy, regardless of the evaluation outcome.

This is where more clarity (and perhaps humility) is needed.

I think this distinction was, in part, what Tom Pepinsky wrestled with when he said that it was the murky and quirky (delightful!) questions “that actually influence how they [policymakers / stakeholders] make decisions” in each of their own murky and quirky settings. these questions may be narrow, operational, and linked to a middle-range or program theory (of change) when compared to grander, paradigmatic questions and big ideas. (Interestingly, and to be thought through carefully, this seems to be the opposite of Marc Bellemare’s advice on making research in agricultural economics more policy-relevant, in which he suggests pursuing bigger questions, partially linked to ag econs often being housed in ‘hard’ or ‘life’ science departments and thus dealing with different standards and expectations.)

I am less familiar with how tom discusses what is labelled as highly policy-relevant (the TRIP policymaker survey and seeing whether policymakers are aware of a given big-thinking researcher’s big idea) and much more familiar with researchers simply getting to declare that their work is relevant to policy because it is in some way adjacent to a real! live! policy. Jeff Hammer has pointed out that even though researchers in some form of applied work on development are increasingly doing work on ‘real’ policies and programs, they are not necessarily in a better position to help high-level policymakers choose the best way forward. This needs to be taken seriously, though it is not surprising that a chief minister is asking over-arching allocative questions (invest in transport or infrastructure?) Whereas researchers may work with lower-level bureaucrats and NGO managers or even street-level/front-line workers, who have more modest goals of improving workings and (cost-)effectiveness of an existing program or trying something new.

What is decision-relevant in a particular case will depend very much on the position of the stakeholder with whom the researcher-evaluator is designing the research questions and evaluation (an early engagement and co-creation of the research questions and plan for how the evidence will be used that i consider a pre-req to doing decision-relevant work — see, e.g., the beginning of Suvojit‘s and my discussion of actually planning to use evidence to make decisions). Intention matters in being decision-relevant, to my way of thinking, and so, therefore, does deciding whose decision you are trying to inform.

I should briefly say that I think plenty of policy-adjacent work is immensely valuable and useful in informing thinking and future planning and approaches. One of my favorite works, for example, The Anti-Politics Machine, offers careful vivisection (as Ferguson calls it) of a program without actually guiding officials deciding what to do next. Learning what is and isn’t working (and why) is critically important. His book is a profound, policy-adjacent work (by being about a real program) but it did not set out to be directly decision-relevant nor is it. The book still adds tremendous value in thinking about how we should approach and think about development but it is unlikely that a given bureaucrat can use it to make a programmatic decision.

But here is where I get stuck and muddled, which one of the reasons I put off writing this for so long. at some stage of my thinking, I felt that being decision-relevant, like being policy-adjacent, required working on real, live policies and programs. In fact, in a July 2014 attempt at writing this post, I was quite sympathetic to Howard White’s argument in a seminar that a good way to avoid doing ‘silly IE’ (sillIE©?) is to evaluate real programs and policies, even though being about a real program is not an automatic buffer against being silly.

But I increasingly wonder if I am wrong about decision-relevance. Instead, the main criterion is working with a decision-maker to sort out what decision needs to be made. One outcome of such a decision is that a particular way forward is definitely not worth pursuing, meaning that there is a serious and insurmountable design failure (~in-efficacy) versus an implementation failure (~in-effectiveness). A clear-cut design failure firmly closes a door on a way forward, which is important in decision-making processes (if stakeholders are willing to have a closed door be a possible result of an evaluation). For example, one might (artificially) test a program or policy idea in a crucial or Sinatra case setting — that is, if the idea can’t make it there, it can’t make it anywhere (Gerring, attributed to Yates). door closed, decision option removed. One might also want to deliver an intervention in what H.L. Mencken called a ‘horse-doctor’s dose‘ (as noted here). again, if that whopping strong version of the program or policy doesn’t do it, it certainly won’t do it at the more likely level of administration. A similar view is expressed in running randomized evaluations, noting the ‘proof-of-concept evaluations’ can show that even “a gold-plated, best-case-scenario version of the program is not effective.” door closed, decision option removed.

Even more mind-bending, Ludwig, Kling, and Mullainathan suggest laying out how researchers may approximate the ‘look’ of a policy to test the underlying mechanism (rather than the entirety of the policy’s causal chain and potential for implementation snafus) and, again, directly informing a prioritization, programmatic, or policy decision. As they note, “in a world of limited resources, mechanism experiments concentrate resources on estimating the parameters that are most decision relevant,” serving as a ‘first screen’ as to whether a policy is even worth trying. Again, this offers an opportunity to close a door and remove a decision option. It is hard to argue that this is not decision-relevant and would not inform policy, even if the experimental evaluation is not a real policy, carried out by the people who would take the policy to scale, and so on. Done well, the suggestion is (controversially) that a mechanism experiment that shows that even under ideal or even hyper-ideal conditions (and taking appropriate time trajectory into account) a policy mechanism does not bring about the desired change could be dismissed on the basis of a single study.

But, the key criterion of early involvement of stakeholders and clarifying the question that needs to be answered remains central to this approach to decision-relevance. And, again, having an identified set of stakeholders intended to be the immediate users of evidence seems to be important to being decision-relevant. And, finally, the role of middle-range or programmatic theory (of change) and clearly identified mechanisms of how a program/policy is meant to lead to an outcome is critical in being decision-relevant. .

To return to the opening premise, it does not seem helpful to label all evaluation research associated with a real-world policy or program as ‘policy relevant.’ It is often seen as desirable to be policy relevant in the current state of (impact) evaluation work but this doesn’t mean that all policy-adjacent research projects should self-label as being policy relevant. This is easy to do when it is not entirely clear what ‘policy relevance’ means and it spreads the term too thin. To gain clarity, it helps to parse studies that are policy adjacent from those that are decision-relevant. Being relevant to decisions or policymakers demands not just stakeholder engagement (another loose term) but stakeholder identification of the questions they need answered in order to make a prioritization, programmatic, or policy decision.

There must, therefore, be clear and tangible decision-makers who intend to make use of the generated evidence to work towards a pre-stated decision goal — including a decision to shut the door on a particular policy/program option. While being policy-adjacent requires working alongside a real-world policy, being decision-relevant may not have to meet this requirement, though it does need to ex ante intend to inform a specific policy/program decision and to engage appropriately with stakeholders to this end.

This is far from a complete set of thoughts — I have more reading to do on mechanisms and more thinking to do about when murky and quirky decisions can be reasonably made for a single setting based on a single study in that murky and quirky setting. Nevertheless, the argument that there should be some clear standards for when the term ‘policy relevant’ can be applied and what it means holds.

*In the same somewhat horrifying way that a person might self-ascribe connoisseur status or a bar might self-label as being a dive. no no no, vomit.

Aside

“Politically Robust” Experimental Design in Democracies and a Plea For More Experience Sharing

Sometimes I re-read a paper and remember how nice a sentence or paragraph was (especially when thinking that a benevolent or benign dictator might make research so much easier, as though easy was the main goal of research).

So it is with the paper by Gary King and colleagues (2007) on “a ‘politically robust’ experimental design for public policy evaluation, with application to the mexican universal health insurance program”.

Scholars need to remember that responsive political behavior by political elites is an integral and essential feature of democratic political systems and should not be treated with disdain or as an inconvenience. Instead, the reality of democratic politics needs to be built into evaluation designs from the start — or else researchers risk their plans being doomed to an unpleasant demise. thus, although not always fully recognized, all public policy evaluations are projects in both political science and political science.

What would be nice is if researchers would share more of their experiences and lessons learned not just in robust research design (though this is critical) but also in working to (and failing to) persuade local political leaders to go along with randomization schemes and to potentially hold off any kind of scale-up until the results are in… and only if they are promising!

Pipeline Designs and Equipoise: How Can They Go Together?

I am writing about phase-in / pipeline designs. Again. I’ve already done it here. and more here. but.

The premise of a pipeline or phase-in design is that groups will be randomized or otherwise experimentally allocated to receive a given intervention earlier or later. The ‘later’ group can then serve as the comparison for the ‘early’ group, allowing for a causal claim about impact to be made. I am specifically talking about phase-in designs premised on the idea that the ‘later’ group is planned (and has perhaps been promised) to receive the intervention later. I take this to be a ‘standard’ approach to phase-in designs.

I’d like to revisit the issue of phase-in designs from the angle of equipoise, which implies some sense of uncertainty about the causal impact of a given intervention. This uncertainty provides the justification for studying making use of an ex ante impact evaluation. Equipoise literally translates to equal weight / force / interest. Here, the force in question is the force of argument about the impact of an intervention and which direction it will go (or whether there will be one at all).

There have already been some great conversations, if not decisive answers, as to whether, in social science research, the justification for using experimental allocation of an intervention needs to meet the standards of clinical equipoise or policy equipoise.* The key difference is the contrast between ‘a good impact’ (clinical equipoise) and ‘the best impact achievable the resources’ (policy equipoise). In either case, it is clear that some variant of equipoise is considered a necessary justification. For theoretical and/or empirical reasons, it just isn’t clear whether an intervention is (a) good (investment).

Whichever definition of equipoise you pursue, the underlying premise is one of a genuine uncertainty and an operational knowledge gap about how well a certain intervention will work in a certain setting at a certain point in time and at what degree of relative resource efficiency. This uncertainty is what lends credibility to an ex ante impact evaluation (IE) and the ethical justification for a leave-out (‘business as usual’ or perhaps ‘minimal/basic package’) comparison group. Hence, no RCTs on parachutes.

Uncertainty implies that the impact results could plausibly, if not with fully equal likelihood, come back positive, negative, null or mixed. At least some of those outcomes imply that a program is not a good use of resources, if not actually generating adverse effects. Such a program, we might assume, should be stopped or swapped for some alternative intervention (see Berk’s comments here).

To move forward from the idea of uncertainty, the following two statements simply do not go together despite often being implicitly paired:

  1. We are uncertain about the effectiveness impact our intervention will bring about / cause, so we are doing an (any type of ex ante) IE.
  2. We plan to scale this intervention for everyone (implicitly, at least, because we believe it works – that is, the impacts are largely in the desired direction). Because of resource constraints, we will have to phase it in over time to the population.

Yes, the second point could be and is carried on to say, ‘this offers a good opportunity to have a clean identification strategy and therefore to do IE.’ But this doesn’t actually square the circle between the two statements. It still requires the type of sleight of hand around the issue of uncertainty that I raised here about policy champions..

Unless there are some built-in plans to modify (or even cancel) the program along the phase-in process, the ethics of statement 2 rests solely on the resource constraint (relative to actual or planned demand), not on any variant of equipoise. This is an important point when justifying the ethics of ex ante IE. And it is worth noting how few development programs have been halted because of IE results. It would be a helpful global public good if someone would start compiling a list of interventions that have been stopped, plausibly, because of IE outcomes, perhaps making note of the specific research design used. Please and thank you.

Moreover, unless there is some built-in planning about improving, tweaking or even scrapping the program along the way, it is not clear that the ex ante IE based on a phase-in design can fully claim to be policy relevant. This is a point I plan to elaborate in a future post but, for now, suffice it to say that I am increasingly skeptical that being about a policy (being ‘policy adjacent’ by situating a study in a policy) is the same as informing decisions about that policy (being ‘decision relevant’).

To me, the latter has stronger claims on being truly policy relevant and helping making wise and informed decisions about the use of scarce resources – which I think is the crux of this whole IE game anyway. IEs of phase-in designs without clear potential for mid-course corrections (i.e. genuine decision points) seem destined for policy adjacency, at best. Again, the underlying premise of a phase-in design is that it is a resource constraint, not an evidence constraint, which is dictating the roll-out of the program. But the intention to make a decision at least partly based on the evidence generated by an IE again rests on the premise of ex ante uncertainty about the potential for (the most cost-efficient) impact.

To come back to the issue of equipoise and phase-in designs: if the ethics of much of the work we do rests on a commitment to equipoise, then more needs to be done to clarify how we assess it and whether IRB/ethics review committees take it seriously when considering research designs. What information does a review board need to make that assessment?

Moreover, it requires giving a good think on what types of research designs align with the agreed concept of equipoise (whichever that may be). My sense is that phase-in designs can only be commensurate with the idea of equipoise if they are well-conceived, with well-conceived indicating that uncertainty about impact is indeed recognized and contingencies planned for in a meaningful way – that is, that the intervention can be stopped or altered during the phase-in process.

* I don’t propose to settle this debate between clinical and policy equipoise here, though I am sympathetic to the policy equipoise argument (and would be more so if more ex ante IEs tended towards explicitly testing two variants of an intervention against one another to see which proves the better use of resources moving forward – because forward is the general direction people intend to move in development).

Aside

On Science, from Eula Biss’s On Immunity

A nice reminder from Eula Biss (via On Immunity: An Inoculation) that science is a series of building blocks, with small tests and then bigger ones to see if each brick helps us reach higher and see farther.

Science is, as scientists like to say, “self-correcting,” meaning that errors in preliminary studies are, ideally, revealed in subsequent studies. One of the primary principles of the scientific method is that the results of a study must be reproducible. Until the results of a small study are duplicated in a larger study, they are little more than a suggestion for further research. Most studies are not incredibly meaningful on their own, but gain or lose meaning form the work that has been done around them… This doesn’t mean that published research should be disregarded but that, as John Ioannidis concludes, “what matters is the totally of the evidence” (p. 133)…

Thinking of our knowledge as a body suggests the harm that can be done when one part of that body is torn from its context. Quite a bit of this sort of dismemberment goes on in discussions about vaccination, when individual studies are often used to support positions or ideas that are not supported by the body as a whole… When one is investigating scientific evidence, on must consider the full body of information (p. 135).

that may not mean quite what you think it means: john henry and americana edition

occasionally on this site, i try to provide some background on phrases and cliches in social science and global health (such as here and here). it is a small public service to help folks not be sicilians yelling “inconceivable!” (or from starting land wars in asia, if at all possible).

today, the john henry effect.

.

the john henry effect is a reactive effect we could find in the comparison group of an experiment (or an any non-intervention group) when the comparison group is aware it is not receiving treatment. with this knowledge, they might react by working harder to compensate for not having the intervention. the effect, apparently, also includes the reaction amongst the ‘non-treated’ of becoming discouraged at not having received the intervention and working less hard, though i am less familiar with this usage. in any case, we could just call them ‘reactive effects’ and given all the other cultural roles and meanings of john henry, i wonder if we just should.

the point of this post is not about the john henry effect but about john henry. however, a small point. david mckenzie‘s post on the john henry effect (and that we shouldn’t be too worried about it) concludes “often our best approach may be to try and reduce the likelihood of such effects in the first place – while it can be hard (or impossible) to hide from the treatment group the fact they are getting a treatment, in many cases the control group need not know they are controls.”

this seems at odds with mckenzie’s seeming support in other places for public randomization (example here)– in which case, the comparison group would very well know that they were not receiving the treatment. (the problem, in part, is that we have limited scope in the way of placebos in social science work. ethics aside, we simply don’t know how to give you a malaria-bednet-that-isn’t-really-protective in the way that i can give you a lookalike pill that has no active pharmaceutical ingredients. which is, perhaps, another argument for testing treatment variants against each other rather than treatment against just ‘business as usual’/nothing new.)

.

in any case, the real point of this post is about john henry the man/myth. from a recent conversation with a colleague, it was clear that, for him/her, the john henry effect could have just as easily been named for the researcher that discovered the effect or the site at which it was first noted (as in the hawthorne experiments).

which is fair enough. john henry is an element of americana folklore (though there may well be counterpart or antecedent stories in different cultures and i would be delighted to hear about them), so why should anyone else be clued in?

however, i had to sing a song about john henry in 5th grade choir performance about american tall tales (quite possibly the last time i was permitted to sing on stage), so i am fully qualified to provide some background on john henry.

.

it seems (mostly according to here and here) that john henry was likely a real man — definitely black, possibly born a slave. he worked for the railroads following the civil war (in the late 1860s and 1870s). he was well-suited to this work, as a “steel driving man”, as he was, from existing accounts, both quite tall and muscular. most accounts say he worked for the C&O Railroad (chesapeake & ohio) and many accounts put his work as drilling through the big bend mountain in west virgina, where it was decided it was more expedient to make a tunnel rather than go around the mountain (alternatively, he worked on the nearby lewis tunnel under similar circumstances).

“as the story goes, john henry was the strongest, fastest, most powerful man working on the rails. he used a 14-pound hammer to drill 10 to 20 feet in a 12-hour day – the best of any man on the rails. one day, a salesman came to camp, boasting that his steam-powered machine could outdrill any man. a race was set: man against machine. john henry won, the legend says, driving 14 feet to the drill’s nine. he died shortly after, some say from exhaustion, some say from a stroke.”

another account, by an alleged eyewitness account collected by sociologist guy johnson in the 1920s, is:

“when the agent for the steam drill company brought the drill here, john henry wanted to drive against it. he took a lot of pride in his work and he hated to see a machine take the work of men like him. well, they decided to hold a test to get an idea of how practical the steam drill was. the test went on all day and part of the next day. john henry won. he wouldn’t rest enough, and he overdid. he took sick and died soon after that.”

john henry became the subject of ballads and work/hammer songs (e.g. and here and here) and an important touchstone for the american labor movements and civil rights movements. he is a lot more than a possible effect in social experiments!

.

as a closing thought, when we discuss john henry effects, we mostly think about his working hard in compensation for not having the treatment (a machine) — or even proving that the treatment was unnecessary because of pride in the status quo. we think less about the fact that he died from it. given this part of the story, we may want to consider, should we find john henry effects, not just that it might mess up our effect estimation — but that harms could be coming to groups not receiving interventions if they are over-compensating in this way (more akin to how john henryism and soujourner truthism are used in sociology and health psychology (e.g. here and here) to describe the african-american experience and weathering).

Refereeing an academic paper

The below list is 100% taken from the following sources; my only contribution is to mix them up into a three-page document.

Nevertheless, may prove useful. Additions, of course, welcome.

  • Assume that no referee reports are truly anonymous.  It is fine to be critical but always be polite.
  • Skim the paper within a couple of days receiving the request- my metro rides are good for this – you can quickly tell whether this is a paper that is well below the bar for some obvious reason and can be rejected as quickly as possible.
    • Unless it is immediate junk, read the paper once and return to it a week later with deeper thoughts and a fresh mind.
    • Referee within one month.
  • Remember you are the referee, not a co-author. I hear a lot that young referees in particular write very long reports, which try and do way more than is needed to help make a paper clear, believable and correct. I think 2 pages or less is enough for most reports.
  • Your report should not assume that the editor has a working knowledge of the paper.
    • The first paragraph should summarize the contribution. Reviewers should provide a concise summary of the paper they review at the start of their report and then provide a critical but polite evaluation of the paper.
    • Explain why you recommend that the paper be accepted, rejected, or revised.
      • If you would like the editor to accept the paper, your recommendation must be strong. The more likely you think the paper is to merit a revision the more detailed should be the comments.
      • The referee report itself should not include an explicit editorial recommendation. That recommendation should be in a separate letter to the editor.
      • If you consistently recommend rejection, then the editor recognizes you are a stingy, overly critical person. Do not assume that the editor will not reveal your identity to the authors. In the long run, there are no secrets.
      • If you recommend acceptance of all papers, then the editor knows you are not a discriminating referee.

Possible considerations:

  • Research question and hypothesis:
    • Is the researcher focused on well‐defined questions?
    • Is the question interesting and important?
    • Are the propositions falsifiable?
    • Has the alternative hypothesis been clearly stated?
    • Is the approach inductive, deductive, or an exercise in data mining? Is this the right structure?
  • Research design:
    • Is the author attempting to identify a causal impact?
    • Is the “cause” clear? Is there a cause/treatment/program/fist stage?
    • Is the relevant counterfactual clearly defined? Is it compelling?
    • Does the research design identify a very narrow or a very general source of variation?
    • Could the question be addressed with another approach?
    • Useful trick: ask yourself, “What experiment would someone run to answer this question?”
  • Theory/Model:
    • Is the theory/model clear, insightful, and appropriate?
    • Could the theory benefit from being more explicit, developed, or formal?
    • Are there clear predictions that can be falsified? Are these predictions “risky” enough?
      • Does the theory generate any prohibitions that can be tested?
      • Would an alternative theory/model be more appropriate?
        • Could there be alternative models that produce similar predictions—that is, does evidence on the predictions necessarily weigh on the model or explanation?
      • Is the theory a theory, or a list of predictions?
      • Is the estimating equation clearly related to or derived from the model?
  • Data:
    • Are the data clearly described?
    • Is the choice of data well‐suited to the question and test?
    • Are there any worrying sources of measurement error or missing data?
    • Are there sample size or power issues?
    • How were data collected? Is recruitment and attrition clear?
      • Is it clear who collected the data?
      • If data are self-reported, is this clear?
      • Could the data sources or collection method be biased?
      • Are there better sources of data that you would recommend?
      • Are there types of data that should have been reported, or would have been useful or essential in the empirical analysis?
      • Is attrition correlated with treatment assignment or with baseline characteristics in any treatment arm?
  • Empirical analysis:
    • Are the statistical techniques well suited to the problem at hand?
    • What are the endogenous and exogenous variables?
    • Has the paper adequately dealt with concerns about measurement error, simultaneity, omitted variables, selection, and other forms of bias and identification problems?
    • Is there selection not just in who receives the “treatment”, but in who we observe, or who we measure?
    • Is the empirical strategy convincing?
    • Could differencing, or the use of fixed effects, exacerbate any measurement error?
    • Are there assumptions for identification (e.g. of distributions, exogeneity?)
      • Were these assumptions tested and, if not, how would you test them?
      • Are the results demonstrated to be robust to alternative assumptions?
      • Does the disturbance term have an interpretation, or is it just tacked on?
      • Are the observations i.i.d., and if not, have corrections to the standard errors been made?
      • What additional tests of the empirical strategy would you suggest for robustness and confidence in the research strategy?
      • Are there any dangers in the empirical strategy (e.g. sensitivity to identification assumptions)?
      • Is there potential for Hawthorne effects or John Henry-type biases?
  • Results:
    • Do the results adequately answer the question at hand?
    • Are the conclusions convincing? Are appropriate caveats mentioned?
    • What variation in the data identifies the elements of the model?
    • Are there alternative explanations for the results, and can we test for them?
    • Could the author have taken the analysis further, to look for impact heterogeneity, for causal mechanisms, for effects on other variables, etc?
    • Is absence of evidence confused with evidence of absence?
    • Are there appropriate corrections for multiple comparisons, multiple hypothesis testing?
  • Scope:
    • Can we generalize these results?
    • Has the author specified the scope conditions?
    • Have casual mechanisms been explored?
    • Are there further types of analysis that would illuminate the external validity, or the causal mechanism at work?
    • Are there other data or approaches that would complement the current one?