Back to Basics — Trusting Whether and How The Data are Collected and Coded

This is a tangential response to the lacour and #lacourgate hubbub (with hats off to the summaries and views given here and here). While he is not implicated in all of the comments, below, I am mostly certainly indebted to Mike Frick for planting the seed of some of the ideas presented below, particularly on member-checking (hopefully our under-review paper on the same will be out sometime in the future…). Salifu Amidu and Abubakari Bukari are similarly motivational-but-not-implicated, as are Corrina Moucheraud, Shagun Sabarwal and Urmy Shukla.

To a large extent, the lacour response is bringing a new angle on an increasingly familiar concern: trusting the analysis. This means additional (and important) calls for replication and other forms of post-publication peer review (as Broockman calls for) as a guard against significance-hungry, nefarious researchers. Pre-analysis plans, analytic/internal replications, and so on, are all important steps towards research transparency. But they miss the fundamental tendency to treat data as ‘true’ once it makes it into the familiar, rectangular format of a spreadsheet.

Given lacour, it seems clear that we may need to take an additional step back to get into the heart of research: the data. We place a lot of trust in data themselves — between advisers and advisees, between research collaborators, and between producers and users of large, public data sets. and, in turn, between PIs and research assistants and the actual team collecting the data. This trust is about, of course, whether the data exist at all and whether they measure what they purport to measure. (Green seems to have had a hunch about this?)

We should be clear about the foundations of this trust and what we might do to strengthen it. Ultimately, the lacour story is a story about the production of data, not its analysis. The transparency agenda needs to expand accordingly, to address the fundamental constancy that ‘shit in leads to shit out.’

Here’s a few thoughts:

  • Start to teach data collection like it matters. Survey design and data collection are weirdly absent from many graduate programs — even those oriented towards research. You may pick these up in electives but they are rarely required, to my knowledge. Learning about construct validity, validating test instruments in new contexts, questionnaire design, the potential for interview effects, some of the murky and inchoate contents of the activity labelled as ‘formative work*,’ etc, need not be re-discovered by each new graduate student or research assistant who takes on field work. If a course-work model won’t work, then a much more explicit apprenticeship model should be sought for those pursuing primary empirical work. in terms of teaching, one occasionally might be forgiven for thinking that impact evaluators had discovered data collection and that there aren’t mounds of resources on household surveys, psychometric’s, and questionnaire design that can be used to better ensure the quality and truthfulness of the data being collected. Interdisciplinary work needs to start with what and by what means and measures data are collected to answer a particular question.
  • Report on data quality practices. Lots of survey firms and researchers employee strategies such as data audits and back-checks. Good on you. Report it. This almost never makes it into actual publications but these are not just internal operations processes. Researchers do need to put forth some effort to make their readers trust their data as well as their analysis but so much less work seems to go into this. With the rise of empirical work in economics in other fields, this needs to be given more documented attention. If you threw out 5% of your data because of failed back-checks, tell me about it. I’d believe the remaining 95% of your data a lot more. The onus is on the researchers to make the reader trust their data.
  • Treat surveyors as a valuable source of information. It is increasingly common to at least have surveyors fill a question at the end of questionnaire about whether the respondent was cooperative (usually a Likert scale item) or other brief reflection on how the interview went. I have no idea what happens to responses to the data so produced — if they are used to throw out or deferentially weight responses, do please tell the reader about it. Moreover, you can systematically ask your surveyors questions (including anonymously) about question items that they don’t trust. For example, I asked (in written form) this question of surveyors and most reported that it was incredibly embarrassing for them to ask their elders to play certain memory games related to short-term recall. This might be a good sign to tread lightly with those data, if not discount them completely (whether or not the surveyors faithfully asked the embarrassing question, it still suggests that it created a tense social interaction that may not have generated trustworthy data, even if it didn’t fall in the traditional space of ‘sensitive questions.’). If nothing else, the surveyors’ assessments may be given as part of the ‘methods’ or ‘results’ attended to in publications. And, in general, remembering that surveys are human interactions, not matrix populators, is important.
  • Member-check. Member-checking is a process described by Lincoln and Guba (and others) that involves taking results and interpretations back to those surveyed to test interpretative hypotheses, etc. if some results really fly in the face of expectations, this process could generate some ‘red flags’ about which results and interpretations should be treated with care. And these can be reported to readers.
  • Coding. As with ‘formative work,’ the nuances of ‘we coded the open-ended data’ is often opaque, though this is where a lot of the interpretive magic happens. This is an important reason for the internal replication agenda to start with the raw data. In plenty of fields, it would be standard practice to use two independent coders and to report on inter-rater reliability. This does not seem to be standard practice in much of impact evaluation. This should change.
  • Check against other data-sets. It would not take much time for researchers to put into context their own findings by comparing (as part of a publication) the distribution of results on key questions to the distribution from large data-sets (especially when some questionnaire items are designed to mimic the dhs, lsms, or other large public data-sets for precisely this reason). This is not reported often enough. This does not mean that the large, population-linked data-set will always trump your project-linked data-set but it seems only fair to alert your readers to key differences, for the purposes of internal believability as well as external validity.
  • Compare findings with findings from studies on similar topics (in similar contexts) — across disciplines. Topics and findings do not end with the boundaries of a particular method of inquiry. Placing the unexpectedness of your findings within this wider array of literature would help.
  • Treat all types of data with similar rigor and respect. (Cue broken record.) If researchers are going to take such care with quantitative data and then stick in a random quote as anec-data in the analysis without giving any sense of where it came from or whether it should be taken as representative of the entire sample or some sub-group… well, it’s just a bit odd. However you want to label these different types of data — quant and qual or data-set-observations and causal-process observations — they are empirical data and should be treated with the highest standards known in each field of inquiry.

I can’t assess whether any of these measures, singly or together, would have made a major difference in the lacour case — especially since it remains nebulous how the data were generated, let alone with what care. But the lacour case reveals that we need to be more careful. A big-name researcher was willing to trust that the data themselves were real and collected to the best of another researcher’s ability — and focused on getting the analysis right. In turn, other researchers bought into both the analysis and the underlying data because of the big-name researcher. This suggests we need to do a bit more to establish trust in the data themselves — and that the onus for this is on the researchers — big names or no — claiming to have led the data collection and cleaning processes. This is especially true given the unclear role for young researchers as potential replicators and debunkers, highlighted here. I hope the transparency agenda steps up accordingly.

*If on occasion a researcher reported on what happened during the ‘formative phase’ and about how the ‘questionnaire was changed in response,’ that would be really interesting learning for all of us. Report it. Also, if you are planning to do ‘qualitative formative work’ to improve your questionnaire, it would be good if you built in time in your research timeline to actually analyze the data produced by that work, report on that analysis, and explain how the analysis led to changing certain questionnaire items…

Aside

Gem From the Anti-Politics Machine: They Only Seek the Kind of Advice They Can Take

I am starting to re-read the Anti-Politics Machine after some time… and, of course, started with the epilogue — the closest Ferguson comes to giving advice from his vivisection. here’s a gem that remains relevant ten-plus years later, in spite of major political changes in southern Africa:

Certainly, national and international ‘development’ agencies do constitute a large and ready market for advice and prescriptions, and it is the promise of real ‘input’ that makes the ‘development’ form of engagement such a tempting one for many intellectuals. These agencies seem hungry for good advice, and ready to act on it. Why not give it?

But as I have tried to show, they only seek the kind of advice they can take. One ‘developer’ asked my advice on what his country could do to ‘help these people.’ When I suggested that his government might contemplate sanctions against apartheid, he replied, with predictable irritation, ‘No, no! I mean development!

The only ‘advice’ that is in question here is advice about how to ‘do development’ better. There is a ready ear for criticisms of ‘bad development projects,’ so long as these are followed up with calls for ‘good development projects.’

Thinking About Stakeholder Risk and Accountability in Pilot Experiments

This post is also cross-posted here in slightly modified form.

Since I keep circling around issues related to my dissertation in this blog, I decided it was time to start writing about some of that work. As anyone who has stood or sat near to me for more than 5 minutes over the past 4.25 years will know, in my thesis I examine the political-economy of adopting and implementing a large global health program (the affordable medicines facility – malaria or “AMFm”). This program was designed at the global level (meaning largely in D.C. and Geneva with tweaking workshops in assorted African capitals). Global actors invited select Sub-Saharan African countries to apply to pilot the AMFm for two years before any decision was made to continue, modify, scale-up, or terminate. It should also be noted from the outset that it was not fully clear what role the evidence would play in the board’s decision and how the evidence would be interpreted. As I highlight below, this lack of clarity helped to foster feelings of risk as well as a resistance among some of the national-level stakeholders about participating in the pilot. . . as  . .

To push the semantics a bit, several critics have (e.g.) noted that scale and scope and requisite new systems and relationships involved in the AMFm disqualify it from being considered a ‘pilot,’ though i use that term for continuity with most other AMFm-related writing. . .

In my research, my focus is on the national and sub-national processes of deciding to participate in the initial pilot (‘phase I’) stage, focusing specifically on Ghana. Besides the project scale and resources mobilized, one thing that stood out about this project is that there was a reasonable amount of resistance to piloting this program among stakeholders in several of the invited countries. I have been very fortunate that my wonderful committee and outside supporters like Owen Barder have continued to push me over the years (and years) to try to explain this resistance to an ostensibly ‘good’ program. Moreover, I have been lucky and grateful that a set of key informants in Ghana that have been willing to converse openly with me over several years as I have tried to untangle the reasons behind the support and resistance and to try to get the story ‘right’. . .

The set-up of the global health pilot experiment, from the global perspective, the set-up was a paragon of planning for evidence-informed decision-making: pilot first, develop benchmarks for success and commission an independent evaluation (a well-monitored before and after comparison) — and make decisions later. . .

In my work, through a grounded qualitative analysis, I distil the variety of reasons for supporting and resisting Ghana’s participation in the AMFm pilot to three main types: those related to direct policy goals (in this case, increasing access to malaria medication and lowering malaria mortality), indirect policy goals (indirect insofar as they are not the explicit goals of the policy in question, such as employment and economic growth), and finally those related to risk and reputation (individual, organizational, and national). I take the latter as my main focus for the rest of this post. . . . .

A key question on which I have been pushed is the extent to which resistance to participation (which meant resisting an unprecedented volume of highly subsidized, high-quality anti-malarial treatments entering both the public and the private sector) emerges from the idea of the AMFm versus the idea of piloting the AMFm with uncertain follow-up plans. . ..

Some issues, such as threats to both direct and indirect policy goals often related to the AMFm mechanism itself, including the focus on malaria prevention rather than treatment as well as broader goals related to national pride and the support of local businesses. The idea of the AMFm itself, as well as it a harbinger of approaches (such as market-based approaches) to global health, provoked both support and resistance. . . .

But some sources of resistance stemmed more directly from the piloting process itself. By evidence-informed design, the global fund gave “no assurance to continue [AMFm] in the long-term,” so that the evaluation of the pilot would shape their decision. This presented limited risks to them. At the national level, this uncertainty proved troubling, as many local stakeholders felt it posed national, organizational, and personal risks for policy goals and reputations. Words like ‘vilification‘ and ‘chastisement‘ and ‘bitter‘ came up during key informant interviews. in a point of opposing objectives (if not a full catch-22, a phrase stricken from my thesis), some stakeholders may have supported the pilot if they knew the program would not be terminated (even if modified), whereas global actors wanted the pilot to see if the evidence suggested the program should (not) be terminated. Pilot-specific concerns related to uncertainties around the sunk investments of time in setting up the needed systems and relationships, which have an uncertain life expectancy. also, for a stakeholder trying to decide whether to support or resist a pilot, it doesn’t help when the reputation and other pay-offs from supporting are uncertain and may only materialize should the pilot prove successful and be carried to the next stage. . . .

A final but absolutely key set of concerns for anyone considering working with policy champions is what, precisely, the decision to continue would hinge upon. Would failure to meet benchmarks be taken as a failure of the mechanism and concept? A failure of national implementation capacity and managerial efforts in Ghana (in the face of a key donor)? A failure of individual efforts and initiatives in Ghana? .

Without clarity on these questions about how accountability and blame would be distributed, national stakeholders were understandably nervous and sometimes resistant (passively of actively) to Ghana’s applying to be a phase I pilot country. To paraphrase one key informant’s articulation of a common view, phase I of the AMFm should have been an experiment on how to continue, not whether to continue, the initiative. . . .

How does this fit in with our ideas of ideal evidence-informed decision-making about programs and policies? The experience recorded here raises some important questions when we talk about wanting policy champions and wanting to generate rigorous evidence about those policies. Assuming that the policies and programs under study adhere to one of the definitions of equipoise, the results from a rigorous evaluation could go either way.

What risks does the local champion(s) of a policy face in visibly supporting a policy?

Is clear accountability established for evaluation outcomes?

Are there built-in buffers for the personal and political reputation of champions and supporters in the evaluation design?

The more we talk about early stakeholder buy-in to evaluation and the desire for research uptake on the basis of evaluation results, the more we need to think about the political economy of pilots and those those stepping up to support policies and the (impact) evaluation of them. Do they exist in a learning environment where glitches and null results are considered part of the process? Can evaluations help to elucidate design and implementation failures in a way that has clear lines of accountability among the ‘ideas’ people, the champions, the managers, and the implementer’s? These questions need to be taken seriously if we expect government officials to engage in pilot research to help decide the best way to move a program or policy forward (including not moving it forward at all).

have evidence, will… um, erm (6 of 6, enforcing accountability in decision-making)

this is a joint post with suvojit, continuing from 5 of 6 in the series. it is also cross-posted here.

 

a recent episode reminded us of why we began this series of posts, of which is this is the last. we recently saw our guiding scenario for this series play out: a donor was funding a pilot project accompanied by a rigorous evaluation, which was intended to inform further funding decisions.

in this specific episode, a group of donors discussed an on-going pilot program in Country X, part of which was evaluated using a randomized-control trial. the full results and analyses were not yet in; the preliminary results, marginally significant, suggested that there ought to be a larger pilot taking into account lessons learnt.

along with X’s government, the donors decided to scale-up. the donors secured a significant funding contribution from the Government of X — before the evaluation yielded results. indeed, securing government funding for the scale-up and a few innovations in the operational model had already given this project a sort-of superstar status, in the eyes of both the donor as well as the government. it appeared the donors in question had committed to the government that the pilot would be scaled-up before the results were in. moreover, a little inquiry revealed that the donors did not have clear benchmarks or decision-criteria going into the pilot about key impacts and magnitudes — that is, the types of evidence and results — that would inform whether to take the project forward.

there was evidence (at least it was on the way) and there was a decision but it is not clear how they were linked or how one informed the other.

 

reminder: scenario

we started this series of posts by admitting the limited role evidence plays in decision-making — even when an agency commissions evidence specifically to inform a decision. the above episode illustrates this, as well as the complex and, sometimes, messy way that (some) agencies, like (some) donors, approach decision-making. we have suggested that, given that resources to improve welfare are scarcer than needs, this approach to decision-making is troubling at best and irresponsible at worst. note that it is the lack of expectations and a plan for decision-making that are troublesome as the limited use of outcome and impact evidence.

in response to this type of decision-making, we have had two guiding goals in this series of posts. first, are there ways to design evaluations that will make the resultant outcomes more useable and useful (addressed here and here)? second, given all the factors that influence decisions, including evidence, can the decision-making process be made more fair and consistent across time and space?

to address the second question, we have drawn primarily on the work of Norm Daniels, to consider whether and how decisions can be made through a fair, deliberative process that, under certain conditions, can generate outcomes that a wide range of stakeholders can accept as ‘fair’.

Daniels suggests that achieving four key criteria, these “certain conditions” for fair deliberation can be met, including deliberation about which programs to scale after receiving rigorous evidence and other forms of politically relevant feedback.

 

closing the loop: enforceability

so far, we have reviewed three of these conditions: relevant reasons, publicity, and revisibility. in this post, we examine the final condition, enforceability (regulation or persuasive pressure).

meeting the enforceability criterion means providing mechanisms to ensure that the processes set by the other criteria are adhered to. this is, of course, easier said than done. in particular, it is unclear who should do the enforcing.*

we identify two key questions about enforcement:

  • first, should enforcement be external to or strictly internal to the funding and decision-making agency?
  • second, should enforcement rely on top-down or bottom-up mechanisms?

 

underlying these questions is a more basic, normative question: In which country should these mechanisms reside — the donor or the recipient? the difficulty of answer this question is compounded by the fact that many donors are not nation-states.

we don’t have clear answers to these questions, which themselves likely need to be subjected to a fair, deliberative process. Here, we lay out some of our own internal debates on two key questions, in hopes that they point to topics for productive conversation.

 

  1. should enforcement of agency decision making be internal or external to the agency?

this is a normative question but it links with a positive one: can we rely on donors to self-regulate when it comes to adopted decision-making criteria and transparency commitments?

internal, self-regulation is the most common model we see around us, in the form of internal commitments such as multi-year strategies, requests for funds made to the treasury, etc. in addition, most agencies have an internal but-independent ‘results’ or ‘evaluation’ cell, intended to make sure that M&E is carried out. in the case of DFID for instance, the Independent Commission for Aid Impact (ICAI) seems to have a significant impact on DFID’s policies and programming. it also empowers the British parliament to hold DFID to account over a variety of funding decisions, as well as future strategy.

outside the agency, oversight and enforcement of achieving relevancy, transparency, and revisibility could come from multiple sources. from above, it could be a multi-lateral agency/agreement or a global INGO, similar to a Publish What You Pay(?). laterally, the government in which a program is being piloted could play an enforcing role. finally, oversight and enforcement could come from below, through citizens or civic society organizations, both in donor and recipient countries. this brings us to our next question.

 

  1. should enforcement flow top-down or bottom-up?

while this question could be answered about internal agency functioning and hierarchy, we focus on the potential for external enforcement from one direction or the other. and, again, the question is a normative one but there are positive aspects related to capacity to monitor and capacity to enforce.

enforcement from ‘above’ could come through multilateral agencies or through multi- or bi-lateral agreements. one possible external mechanisms is where more than one donor come together to make a conditional funding pledge to a program – contingent on achieving pre-determined targets.however, as we infer from the opening example, it is important that such commitments should be based on a clear vision of success, not just on political imperatives or project visibility.

enforcement from below can come from citizens in donor and/or recipient countries, including through CSOs and the media. one way in which to introduce bottom-up pressure is if donors adhere to the steps we have covered in our previous posts – agreement on relevant reasons, transparency and revisibility – and thereby involve a variety of external stakeholders, including media, citizens, CSOs. these can contribute to a mechanism where there is pressure from the ground on donors in living up to their own commitments.

media are obviously important players in these times. extensive media reporting of donor commitments is a strong mechanism for informing and involving citizens – in both donor and recipient countries; media are also relevant to helping citizens understand limits and how decisions are made in face of resource constraints.

 

our combined gut feeling, though, is that in the current system of global aid and development, the most workable approach will probably include a mixture of formal top-down and informal bottom-up pressure. from a country-ownership point of view, we feel that recipient country decision-makers should have a (strong) role to play here (more than they seem to have currently), as well as citizens in those countries.

however, bilateral donors, will probably continue to be more accountable to their own citizens (directly and via representative legislatures) and, therefore, a key task is to consider how to bolster their capacity to ensure ‘accountability for reasonableness’ in the use of evidence and decision-making more generally. at the same time multilateral donors may have more flexibility to consider other means of enforcement, since they don’t have a narrow constituency of citizens and politicians to be answerable to. however, we worry that the prominent multilateral agencies we know are also bloated bureaucracies with unclear chains of accountability (as well as a typical sense of self-perpetuation).

while there is no clear blueprint for moving forward, we hope the above debate has gone a small step towards asking the right questions.

 

in sum

in this final post, we have considered how to enforce decision-making and priority-setting processes that are ideally informed by rigorous and relevant evidence but also, more importantly, in line with principles of fairness and accountability for reasonableness. these are not fully evident in the episode that opened this post.

through this series of posts, we have considered how planning for decision-making can help in the production of more useful evidence and can set up processes to make fairer decisions. for the latter, we have relied on Norm Daniel’s framework for ensuring ‘accountability for reasonableness’ in decision-making. this is, of course, only one guide to decision-making, but one that we have found useful in broaching questions of not only how decisions are made but how they should be made.

in it, Daniels proposes that deliberative processes should be based on relevant reasons and commitments to transparency and revisibility that are set ex ante to the decision-point. we have focused specifically on decision-making relating to continuing, scaling, altering, or scrapping pilot programs, particularly those for which putatively informative evidence has been commissioned.

we hope that through these posts, we have been able to make a case for designing evaluations to generate evidence useful decision-making as well as for facilitating fair, deliberative processes for decision-making that can take account of evidence generated.

at the very least, we hope that evaluators will recognize the importance of a fair process and will not stymie them in the pursuit of the perfect research design.

*in Daniels’s work, which primarily focuses on national or large private health insurance plans, the regulative role of the state is clear. in cases of global development, involving several states and agencies, governance and regulation become less clear. noting this lack of clarity in global governance is hardly a new point; however, the idea of needing to enforce the conditions of fair processes and accountability for reasonableness provides a concrete example of the problem.

Mo money, mo problems? AMF does not make Givewell’s top-three for 2013 #giving season

This blog is a cross-post with Suvojit. Update 21 December: the conversation has also continued here.

Recently, Givewell has revised its recommendation on one of its previously top-ranked ‘charities,’ the Against Malaria Foundation (AMF), which focuses on well-tracked distributions of bednets.  Givewell “find[s] outstanding giving opportunities and publish the full details of our analysis to help donors decide where to give.” This approach seems to have succeeded in moving donors beyond using tragic stories and heart-wrenching images to raise funds, looking rather at effectiveness and funding gaps.

In their latest list, AMF does not rank amongst the top three recommended charities.  Here, based on the experience with AMF, we outline the seeming result of Givewell’s attention on AMF, consider the possible lessons and ask whether Givewell seems to have learnt from this episode, taking clear steps towards changing their ranking methods to avoid similar mishaps in future. As it stands, around US$ 10m now lie parked (transparently and hopefully temporarily) with AMF as a result of its stalled distributions, a fact for which Givewell shares some responsibility.

Givewell lays out its thinking on revising AMF’s recommendation in detail.  As a quick re-cap of that blog post: when Givewell looked at AMF two years ago, AMF was successfully delivering bednets at the small- to medium-scale (up to hundreds of thousands in some cases) through partnerships with NGOs (only the delivery of health products such as bednets and cash transfers meet Givewell’s current eligibility criteria). Following Givewell’s rating, a whole bunch of money came in, bumping AMF into a new scale, with new stakeholders and constraints. The big time hasn’t been going quite so well (as yet).

This is slippery ground for a rating service seeking credibility in the eyes of its donors. Currently, Givewell ranks charities on several rating criteria, including: strong evidence of the intervention’s effectiveness and cost-effectiveness of intervention; whether a funding gap exists and resources can be absorbed; and the transparency of activities and accountability to donors.

In its younger/happier days, AMF particularly shone on transparency and accountability. Recognizing that supplies of bednets are often diverted and don’t reach the intended beneficiaries, AMF is vigilant about providing information on ‘distribution verification’ as well as household continued use and upkeep of nets.

These information requirements – shiny at the small scale – create a glare at large-scale, which is part of the problem AMF now faces. ‘Scale’ generally means ‘government’ unless you are discussing a country like Bangladesh with nationwide NGO networks. The first hurdle between information and governments is that the required data can be politically sensitive.  Distribution and use of information is great for donors’ accountability but it can be threatening  to government officials, who want to appear to be doing a good job (and/or may benefit from distributing nets to particular constituents or adding a positive price, etc).

As a second, equally important, hurdle: even if government agencies intend to carry out the distribution as intended (proper targeting etc), data collection has high costs (monetary, personnel, and otherwise)  – especially when carried out country-wide. AMF doesn’t actually fund or support collection of the data on distribution and use that they require of the implementing agencies. AMF is probably doing this to keep its own costs low, instead passing collection costs and burdens on to the local National Malaria Control Programmes (NMCP), which is definitely not the best way make friends with the government. Many government bureaucracies in Sub-Saharan Africa are constrained not only by funds but also capacity to collect and manage data about their own activities.

What do these data needs mean for donors and what do they mean for implementers? For donors, whose resources are scarce, information on transparency and delivery can guide where to allocate money they wish to give. Givewell, by grading on transparency of funding flows and activities, encourages NGOs to compete on these grounds. Donors feel they have made a wise investment and the NGOs that have invested in transparency and accountability benefit from increased visibility.

At issue is that there seems to exist a tension between focusing on transparency and the ability to achieve impact on the ground. If the donor, and possibly Givewell, do not fully take into account institutions (formal and informal), organizational relationships and bureaucratic politics, the problem of a small organization not being able to replicate their own successful results at scale may resurface. Givewell says that it vets a given charity but it is not clear what role potential implementing partners play in this process. Givewell likely needs to account for the views of stakeholders critical to implementation, including those people and organizations that may become more important stakeholders given a scale-up. The fact that NMCPs (or the relevant counterpart) as well as bilaterals and multilaterals are hesitant to work with AMF could have been weighed into Givewell’s algorithm.

Givewell seems to be listening and recognizing these challenges, first by its publicly reasoned response to AMF’s performance, second by posting reviews (in particular, this recent review by Dr. de Savigny) and third, updating its selection criteria for 2013, including a consideration of scalability. de Savingny’s review raises AMF’s strategies in working with governments, both coordinating with donor governments and supporting ‘recipient’ governments with determining data needs and collecting data.

What else can Givewell do now? Expand the criteria beyond need, evidence-base (intervention and organization) and commitment to transparency by also including:

  1. Feedback from previous implementing partners.

  2. Specific project proposals from applicants, in which they lay out a plan to implement their activity in a specific country. Potential funding recipients should think through and detail their government engagement strategy and gain statements of buy-in from likely implementing partners  – global and local – in that context.

  3. Givewell should more carefully calibrate how much money goes to organizations for proposed projects. Funding based on engagement in a particular country can help avoid problems of getting too much too fast: funding can be pegged to the requirements of the specific project that has been put up, for which the organization has need and absorptive capacity.