thoughts from #evalcon on evidence uptake, capacity building

i attended a great panel today, hosted by the think take initiative and idrc and featuring representatives from three of tti’s cohort of think tanks. this is part of the broader global evaluation week (#evalcon) happening in kathmandu and focused on building bridges: use of evaluation for decision making and policy influence. the notes on evidence-uptake largely come from the session while the notes on capacity building are my own musings inspired by the event.

.

one point early-on was to contrast evidence-informed decision-making with opinion-informed decision-making. i’ve usually heard the contrast painted as faith-based decision-making and think the opinion framing was useful. it also comes in handy for one of the key takeaways from the session, which is that maybe the point (and feasible goal) isn’t to do away with opinion-based decision-making but rather to make sure that opinions are increasingly shaped by rigorous evaluative evidence. or to be more bayesian about it, we want decision-makers to continuously update their priors about different issues, drawing on evidence.

.

this leads to a second point. in focusing on policy influence, we may become too focused on influencing very specific decision-makers for very specific decisions. this may lead us to lose sight of the broader goal of (re-)shaping the opinions of a wide variety of stakeholders and decision-makers, even if not linked to the immediate policy or program under evaluation. so, again, the frame of shaping opinions and aiming for decision-maker/power-center rather than policy-specific influence may lead to altered approaches, goals, and benchmarks.

.

a third point that echoed throughout the panel is that policy influence takes time. new ideas need time to sink in and percolate before opinions are re-shaped. secretary suman prasad sharma of nepal noted that from a decision-maker point of view, evaluations are better and more digestible when they aim to build bit by bit. participants invoked a building blocks metaphor several times and contrasted it with “big bang” results. a related and familiar point about the time and timing required for evaluation to change opinions and shape decisions is that planning for the next phase of the program cycle generally begins midway through current programming. if evaluation is to inform this next stage of planning, it requires the communication of interim results — or a more thoughtful shift of the program planning cycle relative to monitoring and evaluation funding cycles in general.

.

a general point that came up repeatedly was what constitutes a good versus a bad evaluation. this leads to a key capacity-building point: we need more “capacity-building” to help decision-makers recognize credible, rigorous evidence and to mediate between conflicting findings. way too often, in my view, capacity-building ends up being about how particular methods are carried out, rather than on the central task of identifying credible methodologies and weighting the findings accordingly (or on broader principles of causal inference). that is, capacity-building among decision-makers needs to (a) understand how they currently assess credibility (on a radical premise that capacity-building exercises might generate capacity on both sides) and (b) help them become better consumers, not producers, of evidence.

.

a point that surfaced continuously about how decision-makers assess evidence was about objectivity and neutrality. ‘bad evaluations’ are biased and opinionated; ‘good evaluations’ are objective. there is probably a much larger conversation to be had about parsing objectivity from independence and engagement as well as further assessment of how decision-makers assess neutrality and how evaluators might establish and signal their objectivity. as a musing: a particular method doesn’t guarantee neutrality, which can also be violated in shaping the questions, selecting the site and sample, and so on.

.

other characteristics of ‘good evaluation’ that came out included those that don’t confuse being critical with only being negative. findings about what is working are also appreciated. ‘bad evaluation’ assigns blame and accountability to particular stakeholders without looking through a nuanced view of the context and events (internal and external) during the evaluation. ‘good evaluation’ involves setting eval objectives up front. ‘good evaluation’ also places the findings in the context of other evidence on the same topic; this literature/evidence review work, especially when it does not focus on a single methodology or discipline (and, yes, i am particularly alluding to RCT authors that tend to only cite other RCTs, at the expense of sectoral evidence and simply other methodologies), is very helpful to a decision-making audience, as is helping to make sense of conflicting findings.

..

a final set of issues related to timing and transaction costs. a clear refrain throughout the panel is the importance of the timing of sharing the findings. this means paying attention to the budget-making cycle and sharing results at just the right moment. it means seeing windows of receptivity to evidence on particular topics, reframing the evidence accordingly, and sharing it with decision-makers and the media. it probably means learning a lot more from effective lobbyists. staying in tune with policy and media cycles in a given evaluation context is hugely time consuming. a point was made and is well-taken that the transaction costs of this kind of staying-in-tune for policy influence is quite high for researchers. perhaps goals for influence by the immediate researchers and evaluators should be more modest, at least when shaping a specific decision was not the explicit purpose of the evaluation.

.

one is to communicate the findings clearly to and to do necessary capacity-building with naturally sympathetic decision-makers (say, parliamentarians or bureaucrats with an expressed interest in x issue) to become champions to keep the discussion going within decision-making bodies. to reiterate, my view is that a priority for capacity-building efforts should focus on helping decision-makers become evidence champions and good communicators of specific evaluation and research findings. this is an indirect road to influence but an important one, leveraging the credibility of decision-makers with one another. two, also indirect, is to communicate the findings clearly to and to do necessary capacity-building with the types of (advocacy? think tank?) organizations whose job is to focus on the timing of budget meetings and shifting political priorities and local events to which the evidence can be brought to bear.

.

the happy closing point was that a little bit of passion in evaluation, even while trying to remain neutral and objective, does not hurt.

Aside

john oliver on why context/setting matters

#lastweektonight, on mandatory minimums (video here, article with embedded video).

.

context is important. for instance, shouting the phrase, “i’m coming,” is fine when catching a bus but not ok when you’re already on the bus.”

tentative thoughts on ownership: work-in-progress

i am road-testing a few ideas from the conclusion of my thesis, in which i try to bring out two themes recurring throughout the analyses on adoption and implementation of the phase I pilot of the amfm in ghana, between 2010 and 2012. these themes are ownership and risk-taking. i have already written a bit about risk-taking here. below, i share some of my tentative ideas and questions on ownership (slightly edited from the thesis itself, including removing some citations of interviewees for now).

.

delighted for comments.

.

one undercurrent running throughout this thesis is the idea of ownership of: the definition of the problem and solution at hand, the process of adopting the amfm, and of the program itself and its implementation.

.

in chapter 4, i introduced stakeholder ideas of paths not taken, including how the program might have “develop[ed], not negate[d], local production capacity,” including through support to local manufacturers to upgrade to meet WHO prequalification and through work to bring local government and industry (rather than global industry) into closer partnership. both those ultimately receptive and resistant to the amfm acknowledged that all national stakeholders “would have preferred to have had quality, local drugs.” the very strength of the amfm design — high-level negotiations and subsidization — precluded local, structural changes.

.

in chapter 5, i highlighted that several key stakeholders refused to take — that is, to own — a stand on whether ghana should apply to the phase 1 pilot. moreover, the key, institutional decision-makers in the country coordinating mechanism for the global fund (ccm) vacillated on whether or not to send the application while a variety of circumstantial stakeholders felt they had stake in the decision and worked to influence the process. in chapter 6, i analyze how global ideas and actors played a role in ghana’s adoption of phase 1. in chapter 7, i describe the way the amfm coordination committee (amfm-cc) was set up which, in composition and process, differs from the ccm.

.

these points on alternatives not considered, on vacillation, on avoidance, and on outright resistance relate to conceptions of country ownership of development initiatives, as in the paris declaration. the absence of a national politics and aligned problem stream, in particular, neatly dovetails with the ideas of david booth that clarify what should be meant by country ownership (booth 2012). he proposes that it means an end to conditionality to “buy reform” and an end to channeling aid funding through “projects” as a way of by-passing country decision-making bodies, processes, and institutions (booth 2010)

.

the ccm represents an interesting example with which to examine country ownership. their explicit raison d’ětre is to foster ownership and they do indeed bring together representatives of government bureaucracy, business, and civil society, “representing the views and interests of grant recipient countries.” yet this structure allowed for vacillation within and strong views without. we must consider this and also juxtapose the make-up of the ccm versus the amfm-cc in terms of the stakeholders represented, the capacity and legitimacy to make relevant decisions, and a sense of ownership about the work ahead. having done this, it seems that, at a minimum, we must question whether the ccm composition when adopting phase 1 allowed for sufficient ownership. given the effort of ccm members to yield decision-making power to the minister of health, it seems that ccm members did not think so.

.

however, it is not fair to critique apparent limited ownership without raising three additional questions:

  • would ghana have tried out the amfm if political or bureaucratic actors had to take initial responsibility for the design?
  • did limited national ownership of the design and adoption decision allow national stakeholders to better, “energ[etically]” implement the initiative, maximizing credit-seeking after minimizing risk for blame during adoption (while recognizing that policy entrepreneurs and others still felt this risk keenly)?
  • how should we interpret ghana’s decision to continue with the global fund’s private sector co-payment mechanism?

these questions offer avenues for further analysis of the role of donors, the state, and the public.

.

indeed, ownership is not only an issue for capital-based elites; Fox (2015) recently highlighted that “the current aid architecture deprives both african governments and african publics of agency.” in chapter 7, I introduce views of the citizens and businesspeople at the street-level of implementation. about 20%, during in-depth interviews, spontaneously said they wanted to see the amfm continued — a view that seems to have had no way of entering any debates about the future of the amfm and is absent from the academic literature on this initiative.

.

though the minority, some respondents specifically voiced that they should have learned about the amfm through a government agency or professional association. two specifically raised their position as stakeholders. one, who heard from her supplier, said “i think it wasn’t fair because as major stakeholders, we should have been briefed before.” another, who heard first from the media, said “i felt this was wrong since we are a major stakeholder. we should have met as partners.” these concerns relate to relations between ghana and the global fund as well as between accra-based elites and tamale-based retailers.

.

the events of both adoption and implementation of the AMFm suggest that ownership is important (in no way a novel claim). note, though, that there may be certain amounts of freedom to innovate accorded by being just an implementer, rather than having clear ownership of a new idea, decision-making power over adopting and implementing that idea, and, accordingly, more risk if the idea does not pan out.

.

also, if we accept that ownership is indeed important, which seems a plausible lesson to draw from this thesis, we also learn that simply giving decision-making power to some national stakeholders is insufficient. the right national stakeholders and their existing decision-making structures need to be in play. we may glean something about relevant national stakeholders in this case through the composition of the amfm-cc and the committee characteristics raised as important (transparency, collaboration). but, given the views of some street-level implementers, ownership may require further consideration.

Aside

nice paragraph on local leadership

though i have read several of david booth‘s papers on country ownership, i appreciate craig valters pointing me (conversation here) to booth’s joint work with sue unsworth on doing development in ways that are politically smart and locally led. the whole paper is worth a read. this paragraph stood out:

the question of local leadership is not about the nationality of front-line actors; nor is it about donor agency staff not being involved in the process. it is about relationships in which aid money is not the primary motivator of what is done or a major influence on how it is done… the starting point is a genuine effort to seek out existing capacities, perceptions of problems and ideas about solutions, and to enter into some sort of relationship with leaders who are motivated to deploy these capabilities.

Brief Thought on Commitment-To-Analysis Plans

First, I am starting a small campaign to push towards calling ‘pre-analysis plans’ something else before the train gets too far from the station. Something like ‘commitment to analysis plans’ or ‘commitment to analysis and reporting plans.’ I have two reasons for this.

  1. PAP just isn’t a super acronym; it’s kind of already taken.
  2. I think the name changes moves the concept a step back from indicating that the researcher needs to pre-specify the entirety of the analysis plan but, rather, to indicate the core intended dating cleaning and coding procedures and the central analysis — and to commit to completing and reporting those results, whether significant or not. this shift, from a commitment rather than a straitjacket, seems like it would go some way towards addressing concerns expressed by Olken an others that the task of pre-specifying all possible analyses ex ante is both herculean and blinkered, in the sense of not incorporating learning’s from the field to guide parts of the analysis. the commitment, it seems to me, should be partly around making clear to the reader of a study which analyses were ‘on plan’ and which came later, rather than claiming perfect foresight.

Second, speaking of those learning’s from the field that may be incorporated into analysis… I had a moment today to think a bit about the possible views from the field that come from surveyors (as I am working on doing some of my dissertation analysis and already starting to form a list of questions to write back to the survey team with which I worked!). Among the decisions laid out by folks like Humphreys and Mckenzie in their lists of what should be specified in a commitment to analysis plan (doesn’t a ‘CAP’ sound nice?) about data cleaning, surveyors play very little role.

Yet a survey (or discussion) among survey staff about their experience with the questionnaire can yield information on whether there were any questions with which they systematically felt uncomfortable or uncertain about or that respondents rarely seemed to understand. Yes, many of these kinks should be worked out during piloting but, no, they aren’t always. Sometimes surveyors don’t get up the gumption to tell you a question is terrible until the research is underway and sometimes they themselves don’t realize it.

For example, in one field experiment with which i was involved, surveyors only admitted at the end (we conducted an end-of-survey among them) how uncomfortable they were with a short-term memory test module (which involved asking respondents to repeat strings of numbers) and that it was quite embarrassing to ask these questions of their elders. To the point that some of them breezed through these questions pretty quickly during interviews and considered some of the answers they reported suspect. Some wrote fairly agonizing short essays to me in the end-of-survey questionnaire (it’s a good thing to make them anonymous!), asking me to “Imagine that you have to ask this question to an elder…” and proceeded to explain the extreme horror of this.* As the short-term memory module was not part of the central research question or main outcomes of interest, it was not subjected to any of the audit, back-check, or other standard data-quality procedures in place, and so the problem was not caught earlier.

I can imagine a commitment-to-analysis plan that committed to collecting and incorporating surveyor feedback. For example, a CAP that stated that if >90% of surveyors reported being uncertain about the data generated by a specific question, those data would be discarded or treated with extreme caution (and that caution passed on to the consumers of the research). Maybe this could be one important step to valuing, in some systematic way, the experience and insights of a survey team.

*For the record, I can somewhat imagine this, having used to work in a call center to conduct interviews with older women following up on their pelvic floor disorder surgery and whether they were experiencing any urinary symptoms. In that case, however, most of the discomfort was on my side, as they were well versed in — and fairly keen to — talking about their health issues and experiences! Note to self: aim not to have pelvic floor disorder.

Thinking More About Using Personas/Personae In Developing Theories of Change

I have previously advocated, here (and here), for taking a ‘persona’ or character-based approach to fleshing out a theory of change. This is a way of involving a variety of stakeholders (especially those closer to the ground, such as intended beneficiaries and street-level implementer’s) in discussions about program and theory of change development — even when they are not physically at the table, which is not always possible (though encouraged, of course).

This week, I had a new chance to put some of these ideas into action. A few lessons learned for future efforts:

  • This activity worked well in small groups. However, it may be too much to ask groups to fully develop their own personae, especially given possible time limits within the confines of a workshop.
    • It may be better to have some partially developed characters in mind (for example, that represent differing initial levels of the key outcomes of interest and variation on some of the hypothesized sub-groups of interest (explanatory variables). Groups can then take a shorter amount of time to elaborate — rather than develop — these dossiers and give a name to each of their creations (Mary, Bob, Fatima, etc). Alternatively, developing dossiers (and therefore articulating sub-groups of interest) could be a separate, opening activity.
  • Introducing any language about “role-playing” can lead to only one person in a group assuming the role of a given character and the group sort of playing ’20 Questions’ to that character, rather than everyone trying to consider and take on the thoughts, intentions, and decisions and steps a given character might take, confronted with a given intervention (as either a targeted beneficiary or an implementer). The idea is to get the team thinking about the potential barriers and enablers at multiple levels of influence (i.e. assumptions) that may be encountered on the path towards the outcomes of interest.
  • Speaking in “I” statements is helpful in helping people try to think like the different adopted personae. I really had to nag people on this in the beginning but I think it was ultimately useful to get people speaking in this way. In relation to this, there may be important lessons from cognitive interviewing (how-to here) practice, to get activity participants to think out loud about the chain of small decisions and actions they would need to take when confronted with a new program or policy.
  • I noted a marked tendency this time around for men to only speak for male characters and for women, the same! There may be some creative ways to discourage this (thoughts welcome).
  • There are two potential key goals of an activity like this, which should be kept distinct (and be articulated early and often during the activity) even though they are iterative.
    • A first relates to Elaborating Activities, that is, to develop a robust intervention, so that nuance to activities and ‘wrap-around’ support structures (to use Cartwright and Hardie’s terminology) and activities can be developed. This can lead to a laundry or wish list of activities — so if is at the brainstorming stage, this can be articulated as an ‘ok’ outcome or even an explicit goal.
    • A second relates to Explicating and Elaborating assumptions, filling in all the intermediate steps between the big milestones in a results chain. This second goal is bound up in the process of moving from a log-frame to a robust theory of change (as noted by John Mayne at the Canadian Evaluation Society, this is adding all the arrows to the results chain boxes) as well as a more robust and nuanced set of indicators to measure progress towards goals and uncover mechanisms leading to change.
      • A nice wrap-up activity here could include sorting out the assumptions for which evidence is already available and which should be collected and measured as part of research work.
  • It remains an important activity to elaborate and verbally illustrate how X character’s routines and surroundings will be different if the end-goals are reached — given that social, environmental, infrastructural and institutional change is often the goal of ‘development’ efforts. This last step of actually describing how settings and institutions may operate differently, and the implications on quotidian life, is an important wrap-up and time needs to be made for it.

Of course, the use of personae (or an agent-based perspective) is only one part of elaborating a theory of change. But it can play an important role in guiding the other efforts to provide nuance and evidence, including highlighting where to fill in ideas from theoretical and empirical work to end up with a robust theory of change that can guide the development of research methods and instruments.

Would be great to hear further ideas and inputs!

Back to Basics — Trusting Whether and How The Data are Collected and Coded

This is a tangential response to the lacour and #lacourgate hubbub (with hats off to the summaries and views given here and here). While he is not implicated in all of the comments, below, I am mostly certainly indebted to Mike Frick for planting the seed of some of the ideas presented below, particularly on member-checking (hopefully our under-review paper on the same will be out sometime in the future…). Salifu Amidu and Abubakari Bukari are similarly motivational-but-not-implicated, as are Corrina Moucheraud, Shagun Sabarwal and Urmy Shukla.

To a large extent, the lacour response is bringing a new angle on an increasingly familiar concern: trusting the analysis. This means additional (and important) calls for replication and other forms of post-publication peer review (as Broockman calls for) as a guard against significance-hungry, nefarious researchers. Pre-analysis plans, analytic/internal replications, and so on, are all important steps towards research transparency. But they miss the fundamental tendency to treat data as ‘true’ once it makes it into the familiar, rectangular format of a spreadsheet.

Given lacour, it seems clear that we may need to take an additional step back to get into the heart of research: the data. We place a lot of trust in data themselves — between advisers and advisees, between research collaborators, and between producers and users of large, public data sets. and, in turn, between PIs and research assistants and the actual team collecting the data. This trust is about, of course, whether the data exist at all and whether they measure what they purport to measure. (Green seems to have had a hunch about this?)

We should be clear about the foundations of this trust and what we might do to strengthen it. Ultimately, the lacour story is a story about the production of data, not its analysis. The transparency agenda needs to expand accordingly, to address the fundamental constancy that ‘shit in leads to shit out.’

Here’s a few thoughts:

  • Start to teach data collection like it matters. Survey design and data collection are weirdly absent from many graduate programs — even those oriented towards research. You may pick these up in electives but they are rarely required, to my knowledge. Learning about construct validity, validating test instruments in new contexts, questionnaire design, the potential for interview effects, some of the murky and inchoate contents of the activity labelled as ‘formative work*,’ etc, need not be re-discovered by each new graduate student or research assistant who takes on field work. If a course-work model won’t work, then a much more explicit apprenticeship model should be sought for those pursuing primary empirical work. in terms of teaching, one occasionally might be forgiven for thinking that impact evaluators had discovered data collection and that there aren’t mounds of resources on household surveys, psychometric’s, and questionnaire design that can be used to better ensure the quality and truthfulness of the data being collected. Interdisciplinary work needs to start with what and by what means and measures data are collected to answer a particular question.
  • Report on data quality practices. Lots of survey firms and researchers employee strategies such as data audits and back-checks. Good on you. Report it. This almost never makes it into actual publications but these are not just internal operations processes. Researchers do need to put forth some effort to make their readers trust their data as well as their analysis but so much less work seems to go into this. With the rise of empirical work in economics in other fields, this needs to be given more documented attention. If you threw out 5% of your data because of failed back-checks, tell me about it. I’d believe the remaining 95% of your data a lot more. The onus is on the researchers to make the reader trust their data.
  • Treat surveyors as a valuable source of information. It is increasingly common to at least have surveyors fill a question at the end of questionnaire about whether the respondent was cooperative (usually a Likert scale item) or other brief reflection on how the interview went. I have no idea what happens to responses to the data so produced — if they are used to throw out or deferentially weight responses, do please tell the reader about it. Moreover, you can systematically ask your surveyors questions (including anonymously) about question items that they don’t trust. For example, I asked (in written form) this question of surveyors and most reported that it was incredibly embarrassing for them to ask their elders to play certain memory games related to short-term recall. This might be a good sign to tread lightly with those data, if not discount them completely (whether or not the surveyors faithfully asked the embarrassing question, it still suggests that it created a tense social interaction that may not have generated trustworthy data, even if it didn’t fall in the traditional space of ‘sensitive questions.’). If nothing else, the surveyors’ assessments may be given as part of the ‘methods’ or ‘results’ attended to in publications. And, in general, remembering that surveys are human interactions, not matrix populators, is important.
  • Member-check. Member-checking is a process described by Lincoln and Guba (and others) that involves taking results and interpretations back to those surveyed to test interpretative hypotheses, etc. if some results really fly in the face of expectations, this process could generate some ‘red flags’ about which results and interpretations should be treated with care. And these can be reported to readers.
  • Coding. As with ‘formative work,’ the nuances of ‘we coded the open-ended data’ is often opaque, though this is where a lot of the interpretive magic happens. This is an important reason for the internal replication agenda to start with the raw data. In plenty of fields, it would be standard practice to use two independent coders and to report on inter-rater reliability. This does not seem to be standard practice in much of impact evaluation. This should change.
  • Check against other data-sets. It would not take much time for researchers to put into context their own findings by comparing (as part of a publication) the distribution of results on key questions to the distribution from large data-sets (especially when some questionnaire items are designed to mimic the dhs, lsms, or other large public data-sets for precisely this reason). This is not reported often enough. This does not mean that the large, population-linked data-set will always trump your project-linked data-set but it seems only fair to alert your readers to key differences, for the purposes of internal believability as well as external validity.
  • Compare findings with findings from studies on similar topics (in similar contexts) — across disciplines. Topics and findings do not end with the boundaries of a particular method of inquiry. Placing the unexpectedness of your findings within this wider array of literature would help.
  • Treat all types of data with similar rigor and respect. (Cue broken record.) If researchers are going to take such care with quantitative data and then stick in a random quote as anec-data in the analysis without giving any sense of where it came from or whether it should be taken as representative of the entire sample or some sub-group… well, it’s just a bit odd. However you want to label these different types of data — quant and qual or data-set-observations and causal-process observations — they are empirical data and should be treated with the highest standards known in each field of inquiry.

I can’t assess whether any of these measures, singly or together, would have made a major difference in the lacour case — especially since it remains nebulous how the data were generated, let alone with what care. But the lacour case reveals that we need to be more careful. A big-name researcher was willing to trust that the data themselves were real and collected to the best of another researcher’s ability — and focused on getting the analysis right. In turn, other researchers bought into both the analysis and the underlying data because of the big-name researcher. This suggests we need to do a bit more to establish trust in the data themselves — and that the onus for this is on the researchers — big names or no — claiming to have led the data collection and cleaning processes. This is especially true given the unclear role for young researchers as potential replicators and debunkers, highlighted here. I hope the transparency agenda steps up accordingly.

*If on occasion a researcher reported on what happened during the ‘formative phase’ and about how the ‘questionnaire was changed in response,’ that would be really interesting learning for all of us. Report it. Also, if you are planning to do ‘qualitative formative work’ to improve your questionnaire, it would be good if you built in time in your research timeline to actually analyze the data produced by that work, report on that analysis, and explain how the analysis led to changing certain questionnaire items…