Brief Thought on Commitment-To-Analysis Plans

First, I am starting a small campaign to push towards calling ‘pre-analysis plans’ something else before the train gets too far from the station. Something like ‘commitment to analysis plans’ or ‘commitment to analysis and reporting plans.’ I have two reasons for this.

  1. PAP just isn’t a super acronym; it’s kind of already taken.
  2. I think the name changes moves the concept a step back from indicating that the researcher needs to pre-specify the entirety of the analysis plan but, rather, to indicate the core intended dating cleaning and coding procedures and the central analysis — and to commit to completing and reporting those results, whether significant or not. this shift, from a commitment rather than a straitjacket, seems like it would go some way towards addressing concerns expressed by Olken an others that the task of pre-specifying all possible analyses ex ante is both herculean and blinkered, in the sense of not incorporating learning’s from the field to guide parts of the analysis. the commitment, it seems to me, should be partly around making clear to the reader of a study which analyses were ‘on plan’ and which came later, rather than claiming perfect foresight.

Second, speaking of those learning’s from the field that may be incorporated into analysis… I had a moment today to think a bit about the possible views from the field that come from surveyors (as I am working on doing some of my dissertation analysis and already starting to form a list of questions to write back to the survey team with which I worked!). Among the decisions laid out by folks like Humphreys and Mckenzie in their lists of what should be specified in a commitment to analysis plan (doesn’t a ‘CAP’ sound nice?) about data cleaning, surveyors play very little role.

Yet a survey (or discussion) among survey staff about their experience with the questionnaire can yield information on whether there were any questions with which they systematically felt uncomfortable or uncertain about or that respondents rarely seemed to understand. Yes, many of these kinks should be worked out during piloting but, no, they aren’t always. Sometimes surveyors don’t get up the gumption to tell you a question is terrible until the research is underway and sometimes they themselves don’t realize it.

For example, in one field experiment with which i was involved, surveyors only admitted at the end (we conducted an end-of-survey among them) how uncomfortable they were with a short-term memory test module (which involved asking respondents to repeat strings of numbers) and that it was quite embarrassing to ask these questions of their elders. To the point that some of them breezed through these questions pretty quickly during interviews and considered some of the answers they reported suspect. Some wrote fairly agonizing short essays to me in the end-of-survey questionnaire (it’s a good thing to make them anonymous!), asking me to “Imagine that you have to ask this question to an elder…” and proceeded to explain the extreme horror of this.* As the short-term memory module was not part of the central research question or main outcomes of interest, it was not subjected to any of the audit, back-check, or other standard data-quality procedures in place, and so the problem was not caught earlier.

I can imagine a commitment-to-analysis plan that committed to collecting and incorporating surveyor feedback. For example, a CAP that stated that if >90% of surveyors reported being uncertain about the data generated by a specific question, those data would be discarded or treated with extreme caution (and that caution passed on to the consumers of the research). Maybe this could be one important step to valuing, in some systematic way, the experience and insights of a survey team.

*For the record, I can somewhat imagine this, having used to work in a call center to conduct interviews with older women following up on their pelvic floor disorder surgery and whether they were experiencing any urinary symptoms. In that case, however, most of the discomfort was on my side, as they were well versed in — and fairly keen to — talking about their health issues and experiences! Note to self: aim not to have pelvic floor disorder.

Back to Basics — Trusting Whether and How The Data are Collected and Coded

This is a tangential response to the lacour and #lacourgate hubbub (with hats off to the summaries and views given here and here). While he is not implicated in all of the comments, below, I am mostly certainly indebted to Mike Frick for planting the seed of some of the ideas presented below, particularly on member-checking (hopefully our under-review paper on the same will be out sometime in the future…). Salifu Amidu and Abubakari Bukari are similarly motivational-but-not-implicated, as are Corrina Moucheraud, Shagun Sabarwal and Urmy Shukla.

To a large extent, the lacour response is bringing a new angle on an increasingly familiar concern: trusting the analysis. This means additional (and important) calls for replication and other forms of post-publication peer review (as Broockman calls for) as a guard against significance-hungry, nefarious researchers. Pre-analysis plans, analytic/internal replications, and so on, are all important steps towards research transparency. But they miss the fundamental tendency to treat data as ‘true’ once it makes it into the familiar, rectangular format of a spreadsheet.

Given lacour, it seems clear that we may need to take an additional step back to get into the heart of research: the data. We place a lot of trust in data themselves — between advisers and advisees, between research collaborators, and between producers and users of large, public data sets. and, in turn, between PIs and research assistants and the actual team collecting the data. This trust is about, of course, whether the data exist at all and whether they measure what they purport to measure. (Green seems to have had a hunch about this?)

We should be clear about the foundations of this trust and what we might do to strengthen it. Ultimately, the lacour story is a story about the production of data, not its analysis. The transparency agenda needs to expand accordingly, to address the fundamental constancy that ‘shit in leads to shit out.’

Here’s a few thoughts:

  • Start to teach data collection like it matters. Survey design and data collection are weirdly absent from many graduate programs — even those oriented towards research. You may pick these up in electives but they are rarely required, to my knowledge. Learning about construct validity, validating test instruments in new contexts, questionnaire design, the potential for interview effects, some of the murky and inchoate contents of the activity labelled as ‘formative work*,’ etc, need not be re-discovered by each new graduate student or research assistant who takes on field work. If a course-work model won’t work, then a much more explicit apprenticeship model should be sought for those pursuing primary empirical work. in terms of teaching, one occasionally might be forgiven for thinking that impact evaluators had discovered data collection and that there aren’t mounds of resources on household surveys, psychometric’s, and questionnaire design that can be used to better ensure the quality and truthfulness of the data being collected. Interdisciplinary work needs to start with what and by what means and measures data are collected to answer a particular question.
  • Report on data quality practices. Lots of survey firms and researchers employee strategies such as data audits and back-checks. Good on you. Report it. This almost never makes it into actual publications but these are not just internal operations processes. Researchers do need to put forth some effort to make their readers trust their data as well as their analysis but so much less work seems to go into this. With the rise of empirical work in economics in other fields, this needs to be given more documented attention. If you threw out 5% of your data because of failed back-checks, tell me about it. I’d believe the remaining 95% of your data a lot more. The onus is on the researchers to make the reader trust their data.
  • Treat surveyors as a valuable source of information. It is increasingly common to at least have surveyors fill a question at the end of questionnaire about whether the respondent was cooperative (usually a Likert scale item) or other brief reflection on how the interview went. I have no idea what happens to responses to the data so produced — if they are used to throw out or deferentially weight responses, do please tell the reader about it. Moreover, you can systematically ask your surveyors questions (including anonymously) about question items that they don’t trust. For example, I asked (in written form) this question of surveyors and most reported that it was incredibly embarrassing for them to ask their elders to play certain memory games related to short-term recall. This might be a good sign to tread lightly with those data, if not discount them completely (whether or not the surveyors faithfully asked the embarrassing question, it still suggests that it created a tense social interaction that may not have generated trustworthy data, even if it didn’t fall in the traditional space of ‘sensitive questions.’). If nothing else, the surveyors’ assessments may be given as part of the ‘methods’ or ‘results’ attended to in publications. And, in general, remembering that surveys are human interactions, not matrix populators, is important.
  • Member-check. Member-checking is a process described by Lincoln and Guba (and others) that involves taking results and interpretations back to those surveyed to test interpretative hypotheses, etc. if some results really fly in the face of expectations, this process could generate some ‘red flags’ about which results and interpretations should be treated with care. And these can be reported to readers.
  • Coding. As with ‘formative work,’ the nuances of ‘we coded the open-ended data’ is often opaque, though this is where a lot of the interpretive magic happens. This is an important reason for the internal replication agenda to start with the raw data. In plenty of fields, it would be standard practice to use two independent coders and to report on inter-rater reliability. This does not seem to be standard practice in much of impact evaluation. This should change.
  • Check against other data-sets. It would not take much time for researchers to put into context their own findings by comparing (as part of a publication) the distribution of results on key questions to the distribution from large data-sets (especially when some questionnaire items are designed to mimic the dhs, lsms, or other large public data-sets for precisely this reason). This is not reported often enough. This does not mean that the large, population-linked data-set will always trump your project-linked data-set but it seems only fair to alert your readers to key differences, for the purposes of internal believability as well as external validity.
  • Compare findings with findings from studies on similar topics (in similar contexts) — across disciplines. Topics and findings do not end with the boundaries of a particular method of inquiry. Placing the unexpectedness of your findings within this wider array of literature would help.
  • Treat all types of data with similar rigor and respect. (Cue broken record.) If researchers are going to take such care with quantitative data and then stick in a random quote as anec-data in the analysis without giving any sense of where it came from or whether it should be taken as representative of the entire sample or some sub-group… well, it’s just a bit odd. However you want to label these different types of data — quant and qual or data-set-observations and causal-process observations — they are empirical data and should be treated with the highest standards known in each field of inquiry.

I can’t assess whether any of these measures, singly or together, would have made a major difference in the lacour case — especially since it remains nebulous how the data were generated, let alone with what care. But the lacour case reveals that we need to be more careful. A big-name researcher was willing to trust that the data themselves were real and collected to the best of another researcher’s ability — and focused on getting the analysis right. In turn, other researchers bought into both the analysis and the underlying data because of the big-name researcher. This suggests we need to do a bit more to establish trust in the data themselves — and that the onus for this is on the researchers — big names or no — claiming to have led the data collection and cleaning processes. This is especially true given the unclear role for young researchers as potential replicators and debunkers, highlighted here. I hope the transparency agenda steps up accordingly.

*If on occasion a researcher reported on what happened during the ‘formative phase’ and about how the ‘questionnaire was changed in response,’ that would be really interesting learning for all of us. Report it. Also, if you are planning to do ‘qualitative formative work’ to improve your questionnaire, it would be good if you built in time in your research timeline to actually analyze the data produced by that work, report on that analysis, and explain how the analysis led to changing certain questionnaire items…

planning for qualitative data collection and analysis

this blog reflects conversations and on-going work with both mike frick (@mwfrick), shagun sabarwal (@shagunsabarwal), and urmy shukla (@urmy_shukla) — they should receive no blame if this blog is wacky and plenty of credit if it is not.

a recent post by monkey cage contributors on the washington post, then summarized by BITSS, asked/suggested whether “exploratory, qualitative, historical, and case-based research is much harder to present in a results-free manner, and perhaps impossible to pre-register.” this was just a brief point in their larger argument, which i agree with. but it seems worth pausing to consider whether it is likely to be true. below, i discuss research design for qualitative work, sampling considerations, and analysis itself.

throughout, i take a ‘pre-analysis plan’ to be a commitment on what the research will analyze and report on but not a constraint on doing analyses that are ‘off plan.’ rather, the researcher just needs to be explicit about what analyses were ‘on plan’ and which were not and there is a commitment to report everything that was ‘on plan’ – or why such reporting is infeasible.

my conclusion: a conversation on pre-analysis plans needs to distinguish whether planning is possible from whether planning is currently done. in general, I feel planning for analysis is possible when you plan to do analysis.

disclaimer: in this post, my reference to ‘qualitative research’ is to the slice of social science research that has to do with talking to and/or observing and/or participating with living people. i claim no knowledge on the analyses of historical manuscripts and a wide array of other qualitative research. by extension, i am mostly talking about planning for the collection and analysis of data from in-depth interviews, focus group discussion, and forms of (participant-) observation.

designing research: working with and asking questions of living people implies an ethics-review process, for which the research will have to lay out at least the ‘domains’ (aka themes, topics, categories) of information s/he hopes to observe and ask people about. usually, one does not get away with saying “i am just going to head over here, hang out, and see what i find.” this requires planning. like a pre-analysis plan, the domains for qualitative work can set up some bounds for the minimum of what will be collected and reported (“we will collect and analyze data on the following topics: x, y, z“), even if the final report is that a particular domain ended up being a flop because no one wanted to talk about it or it proved uninteresting for some reason.

some of the most famous ethnographies (say, nisa and tuhami) focus on a single person, often to try to give perspective on a larger culture — which may not be what the ethnographer initially set out to study. but the ethnographer can still tell you that (“i went to look at x, and here’s what i found — but i also found this really interesting person and that’s what the rest of the book is about”). so this does not seem inconsistent with the underlying logic of a plan, with the understanding that such a plan does not dictate everything that follows but does mandate that one reports why things changed.

which brings us to the nature of qualitative work: it is often iterative and the researcher often conducts data collection, entry, and analysis in parallel. analysis from an early set of interviews informs questions that are asked and observations attended to later on. this is one of the exciting (to me) elements of qualitative research, that you get to keep incorporating new learnings as you go along. this approach need not be inconsistent with having a set of domains that you intend to explore. within each, maybe the questions get sharper, deeper, or more elaborate over time. or maybe one planned domain turns out to be way too sensitive or way off-base. again, the researcher can report, relative to the initial plan, that this is what happened between designing the research and actually doing it.

sampling: certain aspects of qualitative research can be planned in advance. usually the aim is to be in some way representative. one way to aim for representation is to consider sub-groups of interest. in large-n analysis, the researcher may be able to hope that sufficient numbers of sub-groups will appear in the sample by default. in smaller-n analysis, more purposive sampling plans may be needed to be sure that different sub-groups are engaged in conversation. but, specifying sub-groups of interest can be done in advance — hence, plannable. but, at least some branches of qualitative research suggest that representativeness is about outputs rather than inputs — and that what the researcher is seeking is saturation (i am thinking of lincoln and guba here), which has implications for planning.

‘saturation’ relates to whether the researcher is starting to here the same answer over and over. in some cases, inputs are the determinant of representation — similar to the approach that can be taken in large-n work. let’s say that you want to get the input of the members of an elite government entity —  a particular committee with 20 people on it. fine, plan to talk to all of them. representativeness is here achieved by talking to all of relevant people (the whole population of interest) – and then finding someway of summarizing and analyzing the viewpoints of all of them, even if it’s 20 different viewpoints. there’s your sampling plan. (this may or may not be part of a pre-analysis plan, depending how that is defined and at what stage of the research process it is required. i take these to be open questions.)

for less clearly bounded groups that nevertheless have clear characteristics and may be expected to think or behave differently — let’s say men versus women, older versus younger people, different economic quantiles, different ethnic groups, whatever — then planning for saturation may look more like: plan to talk to men until we start getting repeated answers on key questions or interest or conduct focus groups that are internally homogenous with respect to ethnicity until we start to hear similar answers within each ethnicity (because it may be different numbers within each). that is, if representativeness is focused on output, then it is insufficient to plan at the beginning “we will do two focus groups in each village in which we collect data.” the researcher can specify the sub-groups of interest but probably not number of interviews, focus groups, or hours of observation required.

i make this point for two reasons. first, a pre-analysis plan for qualitative work should plan for iteration between considering what has been collected and whether more questions are necessary to make sense of the phenomena of interest. this makes it different in practice than a quantitative plan but the underlying principle holds. second, a pre-analysis plan, if it covers sampling, probably cannot plan for specific numbers of inputs unless the population is clearly bounded (like the committee members). rather, the plan is to aim for saturation within each sub-group of interest.

analysis: finally! in general, i feel more emphasis in a moving-towards-mixed-methods world needs to be put on analysis of qualitative inputs (and incorporation of those inputs into the larger whole). my hunch is that part of why people think planning for analysis of qualitative work may difficult is because, often, people don’t plan to ‘analyze’ qualitative data. instead, perhaps, the extent of the plan is to collect data. and then they plan to find a good quote or story (“anecdata”) — which may raise some questions about whether social science research is being done. not planning for analysis can limit one’s ability to set out a plan for analysis. this is different than saying that planning is not possible — there’s plenty of books on qualitative data analysis (e.g. here, here, here, and many others). here are some things that can be planned in advance:

  1. how will interviews and field notes be transcribed? verbatim?
  2. how will notes or transcripts be stored and analyzed? handwritten? in excel (my tendency)? using software such as NVivo?
  3. are you planning to be guided by theory or be guided by the data in your analysis? for example, maybe the domains you initially planned to study were themselves from some theory or framework about how the world works. this suggests that your analysis may be theory-driven and you may at least start with (promise to) close-code your data, looking for instances and examples of particular theoretical constructs in the transcripts and field notes and “code” (label) them as such.

or, maybe the work is far more exploratory and you set out to learn what people think and do, in general, about a particular topic. it’s more likely that you’ll be open-coding your data — looking for patterns that emerge (ideas that are repeatedly raised). and it’s likely you’ll have some idea in advance that that is what you intend to do. even if you start out closed-coding, it may turn out that a whole lot of your data end up falling outside the initially planned theoretical framework. fine. that doesn’t mean that you can’t report on what did fit in the framework (=plan) and then analyze all that interesting stuff that happened outside it as well. which, i think, is why are talking about pre-analysis plans rather than pre-analysis straitjackets. to close, in discussing whether pre-analysis plans for qualitative research — in the sense of talking to and watching living people, perhaps as part of a mixed-methods research agenda — are feasible, I hope the conversation is guided by whether planning is indeed possible in the social sciences as opposed to whether such planning currently takes place.

small thoughts on transparency in research (descriptions of methods, analysis)

there is currently a good deal of attention on transparency of social science research – as there should be. much of this is focused on keeping the analysis honest, including pre-analysis plans (e.g.) and opening up data for re-analysis (internal replication, e.g. here and here). some of this will hopefully receive good discussion at an upcoming conference on research transparency, among other fora.

but, it seems at least two points are missing from this discussion, both focused on the generation of the analyzed data itself.

 

intervention description and external replication

first: academic papers in “development” rarely provide a clear description of the contents of an intervention / experiment, such that it could be, plausibly, reproduced. growing up with a neuroscientist / physiological psychologist (that’s my pop), i had the idea that bench scientists had this part down. everyone (simultaneously researchers and implementers) has lab notebooks and they take copious notes. i know because I was particularly bad at that part when interning at the lab.*

then, the researchers report on those notes: for example, on the precise dimensions of a water maze they built (to study rodent behavior in stressful situations) and gave you a nice diagram so that you could, with a bit of skill, build your own version of the maze and follow their directions to replicate the experiment.

pop tells me i am overly optimistic on the bench guys getting this totally right. he agrees that methods sections are meant to be exact prescriptions for someone else to reproduce your study and its results. for example, they are very detailed on exactly how you ran the experiment, description of the apparatus used , where reagents (drugs) were purchased from, etc. he also notes that one thing that makes this easier in bench science is that “most experimental equipment is purchased from a manufacturer which means others can buy exactly the same equipment. gone are the dark days when we each made our own mazes and such. reagents are from specific suppliers who keep detailed records on the quality of each batch…”

then he notes: “even with all this, we have found reproducibility to be sketchy, often because the investigators are running a test for the first time. a reader has to accept that whatever methodological details were missed (your grad student only came in between 1 and 3AM when the air-conditioning was off) were not critical to the results.” or maybe this shouldn’t go unreported and accepted.

the basic idea holds in and out of the lab: process reporting on the intervention/treatment needs to get more detailed and more honest. without it, the reader doesn’t really understand what the ‘beta’ in any regression analysis means – and with any ‘real world’ intervention, there’s a chance that beta contains a good deal of messiness, mistakes, and iterative learning resulting in tweaks over time.

as pop says: “an investigator cannot expect others to accept their results until they are reproduced by other researchers.” and the idea that one can reproduce the intervention in a new setting (externally replicate) is a joke unless detailed notes are kept about what happens on a daily or weekly basis with implementation and, moreover, these notes are made available. if ‘beta’ contained some things at one time in a study and a slightly different mix at a different time, shouldn’t this be reported? if research assistants don’t / can’t mention to their PIs when things get a bit messy in ‘the field’, and PIs in turn don’t report glitches and changes to their readers or other audiences, then there’s a problem.

 

coding and internal replication

as was raised not-so-long-ago by the nice folks over at political violence at a glance, the cleaning and coding of data for analysis is critical to interpretation – and therefore critical to transparency. there is not enough conversation happening about this – with “this,” in large part, being about construct validity. there are procedures for coding, usually involving independent coders working with the same codebook and then doing a check for inter-rater reliability. and reporting the resultant kappa or other relevant statistic. the reader really shouldn’t be expected to believe the data otherwise, on the whole “shit in, shit out” principle.

in general, checks on data that i have seen relate to double-entry of data. this is important but hardly sufficient to assure the reader that the findings reported are reasonable reflections of the data collected and the process that generated them. the interpretation of the data prior to the analysis – that is, coding and cleaning — is critical, as pointed out by political violence at a glance, for both quantitative and qualitative research. and, if we are going to talk about open data for reanalysis, it should be the raw data, so that it can be re-coded as well as re-analyzed.

 

in short, there’s more to transparency in research than allowing for internal replication of a clean dataset. i hope the conversation moves in that direction — the academic, published conversation as well as the over-beers conversation.

 

*i credit my background in anthropology, rather than neuroscience, with getting better with note-taking. sorry, pop.

have evidence, will… um, erm? (4 of 6, going public)

this is a joint post with suvojit. it is also posted on people, spaces, deliberation.

in our last post, we discussed how establishing “relevant reasons” for decision-making ex ante may enhance the legitimacy and fairness of deliberations on resource allocation. we also highlight that setting relevant decision-making criteria can inform evaluation design by highlighting what evidence needs to be collected.

we specifically focus on the scenario of an agency deciding whether to sustain, scale or shut down a given program after piloting it with an accompanying evaluation — commissioned explicitly to inform that decision. our key foci are both how to make evidence useful to informing decisions and how, recognizing that evidence plays a minor role in decision-making, to ensure decision-making is done fairly.

for such assurance, we primarily rely on Daniels’ framework for promoting “accountability for reasonableness” (A4R) among decision-makers. if the four included criteria are met, Daniels argues, it will bring legitimacy to deliberations and, he further argues, consequent fairness to the decision.

in this post, we continue with the second criterion to ensure A4R: the publicity of decisions taken drawing on the first criterion, relevant reasons. we consider why transparency – that is, making decision criteria public – enhances the fairness and coherence of those decisions. we also consider what ‘going public’ means for learning.

disclaimer: logistical uncertainties / room for conversation and experimentation

from the outset, we acknowledge the many unanswered questions about how much publicity or transparency suffice for fairness and how to carry it out.

  • should all deliberations be opened to the public? Made available ex post via transcripts or recordings? or, is semi-transparency — explicitly and publicly announcing ex post the criteria deemed necessary and sufficient to take the final decision — acceptable, while the deliberation remains behind closed doors?
  • who is the relevant public?
  • can transparency be passive – making the information available to those who seek it out – or does fairness require a more active approach?
  • what does ‘available’ or ‘public’ mean in contexts of low-literacy and limited media access?

we do not address these questions — which are logistical and empirical as well as moral — here. as the first-order concern, we consider why this criterion matters.

 

fairness in specific decisions

any decision about resource allocation and limit-setting will be contrary to the preferences of some stakeholders – both those at and not at the decision table. in our scenario, for example, some implementers will have invested some quantity of blood, sweat and tears into piloting a program and may, as a result, have opinions on whether the program should continue; or, those that were comfortable in their inaction (as a result of lack of directives or funds or just plain neglect) who will now have to participate in a scale-up. there will be participants who benefited during the pilot – and those who would have done so if the program were scaled – that may prefer to see the program maintained.

these types of unmet preferences shape Daniels’s central concern: what can an agency* say to those people whose preferences are not met by a decision to convince them that, indeed, the decision “seems reasonable and based on considerations that take… [their] welfare into account?”** being able to give acceptable explanations to stakeholders about a decision is central to fairness.

 

coherence across decisions

the acceptability of criteria for a given decision contribute to the fairness of that decision. But long-run legitimacy of decision-makers benefits from consistency and coherency in organizational policy. transparency, and the explicitness it requires, can foster this.

once reasons for a decision are made public, it is more difficult to not deal with similar cases similarly – the use of ‘precedent’ in judicial cases aptly illustrates this phenomenon. treating like as like is an important requirement of fairness. Daniels envisions that a series of explicated decisions can function as an organizational counterpart of ‘case law’. future decision-makers can draw on past deliberations to establish relevant reasons. deviations from past decisions would need to be justified by relevant reasons.

 

 

implications for learning, decision-making and evaluations

if all decision-makers acknowledge that, at least, the final reasons for their decisions will be publicly accessible, how might that change the way they commission an evaluation and set about using the evidence from it?

it should encourage a review of past deliberations to help determine currently relevant reasons. second, it might encourage decision-makers and evaluators to consider as relevant reasons and measures that will be explainable and understandable to the public(s) when justifying their decisions.

  • in planning evaluations, decision-makers and researchers will have to consider the clarity in methods of data collection and analysis — effectively, will it pass a ‘grandmother test’? moreover, does it pass such a test when that granny is someone affected by your allocative decision? remember the central question that makes this criterion necessary: what can an agency say to those whose preferences are not met by a decision that, indeed, the decision “seems reasonable and based on considerations that take… [their] welfare into account?”
  • there are reasons that decision-makers might shy away from transparency. in his work on health plans, Daniels notes that such organizations speculatively feared media and litigious attacks. in our pilot-and-evaluate scenario, some implementers may not be comfortable with publicizing pilots that may fail; or from raising expectations of beneficiaries that are part of pilots.
  • the fear of failure may influence implementers; this may lead to low-risk/low-innovation pilots. again, this is an important consideration raised above, in the questions we did not answer: when and how much transparency suffices for fairness?

 

in our last blog, we stressed on the importance of engaging stakeholders in setting ‘relevant reasons’ before a project begins, as a key step towards fair deliberative processes as well as a way of shaping evaluations to be useful for decision-making. ensuring publicity and transparency of the decision-making criteria strengthens the perception of a fair and reasonable process in individual cases and over time.

this also sets the stage for an appeals process, where stakeholders can use evidence available to them to advocate a certain way forward; it also allows for stakeholders to revisit the decision-making criteria and the decisions they fostered – the subject of our next post in this series.

***

*we note that donors don’t actually often have to answer directly to implementers and participants for their decisions. We do not, however, dismiss this as a terrible idea.

**we are explicitly not saying ‘broader’ welfare because we are not endorsing a strictly utilitarian view that the needs of some can be sacrificed if the greater good is enhanced, no matter where or how  that good is concentrated.

back (and forward) from ‘the big push forward’ – thoughts on why evidence is political and what to do about it

i spent the beginning of the week in brighton at the ‘big push forward‘ conference, on the politics of evidence (#evpolitics) which mixed the need for venting and catharsis (about the “results agenda” and “results-based management” and “impact evaluation”) with some productive conversation, though no immediate concreteness on how the evidence from the conference would itself be used.

in the meantime, i offer some of my take-aways from the conference – based on some great back-and-forths with some great folks (thanks!), below.

for me, the two most useful catchphrases were trying to get to “relevant rigor” (being relevantly rigorous and rigorously relevant) and to pay attention to both “glossy policy and dusty implementation.” lots of other turns-of-phrase and key terms were offered, not all of them – to my mind – terribly useful.

there was general agreement that evidence could be political in multiple dimensions. these included in:

  • what questions are asked (and in skepticism of whose ideas they are directed), by whom, of whom, with whom in mind (who needs to be convinced), for whom – and why
  • the way questions are asked and how evidence is collected
  • how evidence is used and shared – by whom, where and why
  • how impact is attributed – to interventions or to organizations (and whether this fuels competitiveness for funds and recognition)
  • whether the originators of the idea (those who already ‘knew’ something was working in some way deemed insufficiently rigorous) or the folks who analyze evidence receive credit for the idea

questions and design. in terms of what evidence is collected and what questions are asked, a big part of the ‘push back’ relates to what questions are asked and whether they help goverments and organizations improve their practice. this requires getting input from many stakeholders on what questions are important to ask. in addition, it requires planning for how the evidence will be used, including what will be done if results are (a) null, (b) mixed, confused or inconclusive, and (c) negative. more generally, this requires recognizing that policy-makers aren’t making decisions about ‘average’ situations but rather decisions for specific situations. as such, impact evaluation and systematic reviews need to help them figure out what evidence applies to their situation. the sooner expectations are dispelled that an impact evaluation or a systematic review will provide a clear answer on the what should be done next, the better.

my sense, which was certainly not consensus, is that to be useful and to avoid being blocked by egos, impact questions need to shift away from “does X work?” to “does X work better than Y?” and/or “how an X be made to work better?” this also highlights the importance of monitoring and feedback of information into learning and decision-making (i.e.).

two more points on results for learning and decision-making. first, faced with the assertion that ‘impact evaluation doesn’t reveal *why* something works,’ it is unsatisfactory to say something along the lines of ‘we look for heterogenous treatment effects.’ it absolutely also requires asking front-line workers and program recipients why they think something is and is not working — not as the final word on the matter but as a very important source of information. second, as has been pointed about many places (e.g.), designing a good impact evaluation requires explication of a clear “Theory of Change” (still not my favorite term but apparently one that is here to stay). further, it is important to recognize that articulating a ToC (or LogFrame or use of any similar tool) should never be one person’s all-nighter for a funding proposal. rather, the tool is useful as a way of collectively building consensus around mission and why & how a certain idea is meant to work. as such, time and money need to allocated for a ToC to be developed.

collection. as for the actual collection of data, there was a reasonable amount of conversation about whether the method is extractive or empowering, though probably not enough on how to shift towards empowerment and the fact that extractive/empowering are not synonymous with quant/qual. an issue that received less attention than it should have was that data collection needs to align with an understanding of how long a program should take to work (and funding cycles should be realigned accordingly).

use. again, the conversation of the use of evidence was not as robust as i had hoped. however, it was pointed out early on (by duncan green) that organizations that have been comissioning systematic reviews in fact have no plan to use that evidence systematically. moreover, there was a reasonable amount of skepticism around whether such evidence would actually be used to make decisions to allocate resources to specific organizations or projects (for example, to kill or radically alter ineffective programs). rather, there is a sense that much impact evaluation is actually policy-based evidence-making, used to justify decisions already taken. alternatively, though, there was concern that the more such evidence was used to make specific funding decisions, the more organization would be incentivized to make ‘sausage‘ numbers that serve no one. thus, the learning, feedback and improving aspects of data need emphasis.

empowerment in the use of data (as opposed to its collection) was not as much a part of the conversation as i would have hoped, though certainly people raised issues of how monitoring and evaluation data were fed-back to and used by front-line workers, implementers, and ‘recipients.’  a few people stressed the importance of near-automated feedback mechanisms from monitoring data to generate ‘dashboards’ or other means of accessable data display, including alternatives to written reports.

a big concern on use of evidence was ownership and transparency of data (and results), including how this leads to the duplication/multiplication of data collection. surprisingly, with regards to transparency of data and analysis, no one mentioned the recent reinhart & rogoff mess, nor anything about mechanisms for improving data accessibility (e.g.)

finally, there was a sense that data collected needs to be useful – that the pendulum has swung too far from a dearth of data about development programs and processes to an unused glut, such that the collection of evidence feels like ‘feeding the beast.’ again, this loops back to planning how data will be broadly used and useful before it is collected.