Back to Basics — Trusting Whether and How The Data are Collected and Coded

This is a tangential response to the lacour and #lacourgate hubbub (with hats off to the summaries and views given here and here). While he is not implicated in all of the comments, below, I am mostly certainly indebted to Mike Frick for planting the seed of some of the ideas presented below, particularly on member-checking (hopefully our under-review paper on the same will be out sometime in the future…). Salifu Amidu and Abubakari Bukari are similarly motivational-but-not-implicated, as are Corrina Moucheraud, Shagun Sabarwal and Urmy Shukla.

To a large extent, the lacour response is bringing a new angle on an increasingly familiar concern: trusting the analysis. This means additional (and important) calls for replication and other forms of post-publication peer review (as Broockman calls for) as a guard against significance-hungry, nefarious researchers. Pre-analysis plans, analytic/internal replications, and so on, are all important steps towards research transparency. But they miss the fundamental tendency to treat data as ‘true’ once it makes it into the familiar, rectangular format of a spreadsheet.

Given lacour, it seems clear that we may need to take an additional step back to get into the heart of research: the data. We place a lot of trust in data themselves — between advisers and advisees, between research collaborators, and between producers and users of large, public data sets. and, in turn, between PIs and research assistants and the actual team collecting the data. This trust is about, of course, whether the data exist at all and whether they measure what they purport to measure. (Green seems to have had a hunch about this?)

We should be clear about the foundations of this trust and what we might do to strengthen it. Ultimately, the lacour story is a story about the production of data, not its analysis. The transparency agenda needs to expand accordingly, to address the fundamental constancy that ‘shit in leads to shit out.’

Here’s a few thoughts:

  • Start to teach data collection like it matters. Survey design and data collection are weirdly absent from many graduate programs — even those oriented towards research. You may pick these up in electives but they are rarely required, to my knowledge. Learning about construct validity, validating test instruments in new contexts, questionnaire design, the potential for interview effects, some of the murky and inchoate contents of the activity labelled as ‘formative work*,’ etc, need not be re-discovered by each new graduate student or research assistant who takes on field work. If a course-work model won’t work, then a much more explicit apprenticeship model should be sought for those pursuing primary empirical work. in terms of teaching, one occasionally might be forgiven for thinking that impact evaluators had discovered data collection and that there aren’t mounds of resources on household surveys, psychometric’s, and questionnaire design that can be used to better ensure the quality and truthfulness of the data being collected. Interdisciplinary work needs to start with what and by what means and measures data are collected to answer a particular question.
  • Report on data quality practices. Lots of survey firms and researchers employee strategies such as data audits and back-checks. Good on you. Report it. This almost never makes it into actual publications but these are not just internal operations processes. Researchers do need to put forth some effort to make their readers trust their data as well as their analysis but so much less work seems to go into this. With the rise of empirical work in economics in other fields, this needs to be given more documented attention. If you threw out 5% of your data because of failed back-checks, tell me about it. I’d believe the remaining 95% of your data a lot more. The onus is on the researchers to make the reader trust their data.
  • Treat surveyors as a valuable source of information. It is increasingly common to at least have surveyors fill a question at the end of questionnaire about whether the respondent was cooperative (usually a Likert scale item) or other brief reflection on how the interview went. I have no idea what happens to responses to the data so produced — if they are used to throw out or deferentially weight responses, do please tell the reader about it. Moreover, you can systematically ask your surveyors questions (including anonymously) about question items that they don’t trust. For example, I asked (in written form) this question of surveyors and most reported that it was incredibly embarrassing for them to ask their elders to play certain memory games related to short-term recall. This might be a good sign to tread lightly with those data, if not discount them completely (whether or not the surveyors faithfully asked the embarrassing question, it still suggests that it created a tense social interaction that may not have generated trustworthy data, even if it didn’t fall in the traditional space of ‘sensitive questions.’). If nothing else, the surveyors’ assessments may be given as part of the ‘methods’ or ‘results’ attended to in publications. And, in general, remembering that surveys are human interactions, not matrix populators, is important.
  • Member-check. Member-checking is a process described by Lincoln and Guba (and others) that involves taking results and interpretations back to those surveyed to test interpretative hypotheses, etc. if some results really fly in the face of expectations, this process could generate some ‘red flags’ about which results and interpretations should be treated with care. And these can be reported to readers.
  • Coding. As with ‘formative work,’ the nuances of ‘we coded the open-ended data’ is often opaque, though this is where a lot of the interpretive magic happens. This is an important reason for the internal replication agenda to start with the raw data. In plenty of fields, it would be standard practice to use two independent coders and to report on inter-rater reliability. This does not seem to be standard practice in much of impact evaluation. This should change.
  • Check against other data-sets. It would not take much time for researchers to put into context their own findings by comparing (as part of a publication) the distribution of results on key questions to the distribution from large data-sets (especially when some questionnaire items are designed to mimic the dhs, lsms, or other large public data-sets for precisely this reason). This is not reported often enough. This does not mean that the large, population-linked data-set will always trump your project-linked data-set but it seems only fair to alert your readers to key differences, for the purposes of internal believability as well as external validity.
  • Compare findings with findings from studies on similar topics (in similar contexts) — across disciplines. Topics and findings do not end with the boundaries of a particular method of inquiry. Placing the unexpectedness of your findings within this wider array of literature would help.
  • Treat all types of data with similar rigor and respect. (Cue broken record.) If researchers are going to take such care with quantitative data and then stick in a random quote as anec-data in the analysis without giving any sense of where it came from or whether it should be taken as representative of the entire sample or some sub-group… well, it’s just a bit odd. However you want to label these different types of data — quant and qual or data-set-observations and causal-process observations — they are empirical data and should be treated with the highest standards known in each field of inquiry.

I can’t assess whether any of these measures, singly or together, would have made a major difference in the lacour case — especially since it remains nebulous how the data were generated, let alone with what care. But the lacour case reveals that we need to be more careful. A big-name researcher was willing to trust that the data themselves were real and collected to the best of another researcher’s ability — and focused on getting the analysis right. In turn, other researchers bought into both the analysis and the underlying data because of the big-name researcher. This suggests we need to do a bit more to establish trust in the data themselves — and that the onus for this is on the researchers — big names or no — claiming to have led the data collection and cleaning processes. This is especially true given the unclear role for young researchers as potential replicators and debunkers, highlighted here. I hope the transparency agenda steps up accordingly.

*If on occasion a researcher reported on what happened during the ‘formative phase’ and about how the ‘questionnaire was changed in response,’ that would be really interesting learning for all of us. Report it. Also, if you are planning to do ‘qualitative formative work’ to improve your questionnaire, it would be good if you built in time in your research timeline to actually analyze the data produced by that work, report on that analysis, and explain how the analysis led to changing certain questionnaire items…

Buffet of Champions: What Kind Do We Need for Impact Evaluations and Policy?

This post is also cross-posted here and here.

I realize that the thesis of “we may need a new kind of champion” sounds like a rather anemic pitch for Guardians of the Galaxy. Moreover, it may lead to inflated hopes that i am going to propose that dance-offs be used more often to decide policy questions. While I don’t necessarily deny that this is a fantastic idea (and would certainly boost c-span viewership), I want to quickly dash hopes that this is the main premise of this post.

Rather, I am curious why “we” believe that policy champions will be keen on promoting and using impact evaluation (and subsequent evidence syntheses of these) and to suggest that another range of actors, which I call “evidence” and “issue” champions may be more natural allies. there has been a recurring storyline in recent literature and musings on (impact) evaluation and policy- or decision-making:

  • First, The aspiration: the general desire of researchers (and others) to see more evidence used in decision-making (let’s say both judgment and learning) related to aid and development so that scarce resources are allocated more wisely and/or so that more resources are brought to bear on the problem.
  • Second, The dashed hopes: the realization that data and evidence currently play a limited role in decision-making (see, for example, the report on the evidence on evidence-informed policy-making as well as here).
  • Third, The new hope: the recognition that “policy champions” (also “policy entrepreneurs” and “policy opportunists”) may be a bridge between the two.
  • Fourth, The new plan of attack: bring “policy champions” and other stakeholders in to the research process much earlier in order to get up-take of evaluation results into the debates and decisions. this even includes bringing policy champions (say, bureaucrats) on as research PIs.

There seems to be a sleight of hand at work in the above formulation and it is somewhat worrying in terms of equipoise and the possible use of the range of results that can emerge from an impact evaluation study. Said another way, it seems potentially at odds with the idea that the answer to an evaluation is unknown at the start of the evaluation. .

While I am not sure that “policy champion” has been precisely defined (and, indeed, this may be part of the problem), this has been done for the policy entrepreneur concept. So far as I can tell, the first time to articulate the entrepreneurial (brokering, middle-man, risk-taking) role in policy-making comes from David E. Price in 1971. The idea was repeated and refined in the 1980s and then became more commonplace in 1990s’ discussions of public policy, in part through the work of John Kingdon. (There is also an formative and informative 1991 piece by Nancy Roberts and Paula King.)

Much of the initial discussion, it seems, came out of studying US national and state-level congressional politics but the ideas have been repeatedly shown to have merit in other deliberative settings. Much of the initial work also focused on agenda-setting — which problems and solutions gain attention — but similar functions are also important in the adoption and implementation of policy solutions. Kingdon is fairly precise about the qualities of a policy entrepreneur — someone who has, as Kingdon calls it, a pet policy that they nurture over years, waiting for good moments of opportunity to suggest their policy as the solution to a pressing problem.

  • First, such a person must have a “claim to a hearing” — that is, at least behind-the-scenes, people must respect and be willing to be listen to this person on this topic (especially if this person is not directly in a position with decision-making power).
  • Second, such a person must have networks and connections as well as an ability to bargain and negotiate within them. this is a person that can broker ideas across diverse groups of people, can “soften-up” people to the entrepreneur’s preferred policy solution, etc.
  • Third, such a person must have tenacity, persistence and a willingness to risk personal reputation and resources for a policy idea.

In Kingdon’s and others’ conception, a policy entrepreneur has to work at selling their idea over a long period of time (which is presumably why Weissert (1991) also introduced the idea of policy opportunists, who only start to champion ideas once they make it to the deliberating table and seem likely to move forward.) In short, policy entrepreneurs (and through the sloppy use of near-synonyms, policy champions,) believe strongly in a policy solution and for some reason and have put in time, effort, and reputation into moving the idea forward. Note the nebulous use of “some reason” — I have not found a definition that specifies that policy entrepreneurs must come to promote a policy through a particular impetus. Glory, gold, God, goodness, and (g’)evidence also seem to be viable motivators to fit the definition. .

My question is: is this what we need to support the use of research (and, specifically impact evaluations and syntheses thereof) on decision-making. It is not clear to me that we do. Policy entrepreneurs are people already sold on a particular policy solution, whereas the question behind much evaluation work is ‘is this the best policy solution for this context?’ (Recognizing the importance of contextual and policy, if not clinical, uncertainty about the answer in order for an evaluation to be worthwhile. It seems to me, then, that what we (researchers and evaluator’s) actually need, then, are people deeply committed to one of two things:

(1) The use of data and evidence, in general, (“evidence champions” or, at least loosely, technocrats) as an important tool in sound decision-making and/or

(2) a particular issue or problem (“issue champions” — no doubt a sexier phrase is available). i’ll spend more time on the second. .

An “issue champion,” for example, may be someone who has similar qualities of a policy entrepreneur but, rather than using claims to a hearing, a network, and tenacity to bring forward a policy solution, s/he uses these tools to bring attention to a problem — say, malaria mortality. This person feels that malaria is a problem that must be solved — and is open to finding the most (cost-) effective solution to the problem (or means to do a good job with implementing that solution).

S/He is not, by contrast, someone already committed to believing that prevention, diagnostics, or treatment in any particular form or at any particular price are the best way forward until s/he has seen evidence of this in a relevant context. This is different from a “policy champion” who has, for example, been pushing for universal bednet coverage for the past 20 years. This is not to say that you don’t want the bednet champion to be well aware of your study and to even have input into defining the research questions and approving the research design (in fact, this seems vital in lending credibility and usefulness to the results). But, the way the study is structured will be important to whether the bednet champion is open to taking up the range of possible results from your study.

If your question is: does approach A or approach B result in more efficient distribution of bednets, then yes, both sets of results will be interesting to the bednet champion.

But if the question is more of the type: are bednets the most cost-effective approach to addressing malaria mortality in our country? then the bednet champion is likely to only be particularly interested in trumpeting about one set of results: those that are significantly in favor of bednets as a solution to the malaria problem.

The malaria/issue champion (or general evidence enthusiast), on the other hand, may be more open to thinking about how to interpret and use the range of possible results from the study, which may also be mixed, inconclusive, or even negative. (Throughout this discussion, I recognize that malaria, like all problems in human and economic development, don’t have silver bullet answers and that, therefore, “A or not-A”-type evaluation questions will only get us so far in getting the right mix of tools in the right place at the right time. i.e. the answer is likely neither that bednets do not good nor that they are the only thing needed to tackle malaria.) .

The worrisomeness, then, of the policy champion is that they are already committed to a policy solution. Will they change their mind on the basis of one study? Probably not (nor, necessarily, should they. But a meta-analysis may not sway them either.) But insofar as “we” want decision-makers to learn about our evidence and to consider it in the deliberations, it may be issue, rather than policy, champions that are particularly important. They may make use of the results regardless of what they are. We cannot necessarily expect the same of the policy champion. Of course, a small army of evidence champions is also helpful. I do want to stress that it is critical to have policy champions and other stakeholders involved early in the research-design process, so that the right questions can be asked and the politically and contextually salient outcomes and magnitudes considered. But as an ally in the evaluation process and, say, a potential PI on an evaluation, it seems that the issue champions are the folks likely to stick with it. .

And, yes, issue champions should probably have some moves ready, in case of a dance-off (as there will always be factors beyond evidence and data influencing decisions).

data systems strengthening

i have been saying for some time that my next moves will be into monitoring and vital registration (more specifically, a “poor richard” start-up to help countries to measure the certainties of life: (birth), death, and taxes. (if village pastors could get it done with ink and scroll in the 16th c across northern Europe, why aren’t we progressing with technology??! surely this is potentially solid application of the capacity of mobile phones as data collection and transmission devices?).

i stumbled onto a slightly different idea today, of building backwards from well-financed evaluation set-ups for specific projects to more generalized monitoring systems. this would be in contrast to the more typical approach of skipping monitoring all together or only working first to build monitoring systems (including of comparison groups), followed at some point by an (impact) evaluation, when monitoring is adequately done.

why don’t more evaluations have mandates to leave behind data collection and monitoring systems ‘of lasting value,’ following-on an impact or other extensive, academic (or outsider)-led evaluation? in this way, we might also build from evaluation to learning to monitoring. several (impact) evaluation organisations are being asked to help set up m&e systems for organizations and, in some cases, governments. moreover, many donors talk about mandates for evaluators to leave behind built-up capacity for research as part of the conditions for their grant. but maybe it is time to start to talking about mandates to leave behind m&e (and MeE) systems — infrastructure, plans, etc.

a potentially instructive lesson (in principle if not always in practice) is of ‘diagonal’ health interventions, in which funded vertical health programs (e.g. disease-specific programs, such as an HIV-treatment initiative) be required to also engage in overall health systems strengthening (e.g.).

still a nascent idea but i think one worth having more than just me thinking about how organisations that have developed (rightly or not) reputations for collecting and entering high-quality data for impact evaluation could build monitoring systems backwards, as part of what is left behind after an experiment.

(also, expanding out from DSS sites an idea worth exploring.)

the onion theory of communication (doing surveys)

without too much detail, i’ll just note that i spent more time in the hospital in undergrad than i would have preferred. often times, i, being highly unintelligent, would wait until things got really bad and then finally decide one night it was time to visit the ER – uncomfortable but not non-functional or incoherent. on at least one occasion – and because she’s wonderful, i suspect more – alannah (aka mal-bug, malice, malinnus) took me there and would do her homework, sometimes reading out loud to me to keep me entertained and distracted. in one such instance, she was studying some communications theories, one of which was called or nicknamed the onion theory of two-way communication. the basic gist is that revealing information in a conversation should be a reciprocal unpeeling. i share something, shedding a layer of social divide, then you do and we both feel reasonably comfortable.

it didn’t take too long to connect that this was the opposite of how my interaction with doctor was about to go. the doctor would, at best, reveal her name and i would be told to undress in order to be examined, poked and prodded. onion theory, massively violated.

i mention all this because i have just been reading about assorted electronic data collection techniques, namely here, via here. first, i have learned a new word: ‘paradata.’ this seems useful. these are monitoring and administrative data that go beyond how many interviews have been completed. rather, they focus on the process of collecting data. it can include the time it takes to administer the questionnaire, how long it takes a surveyor to locate a respondent, details about the survey environment and the interaction itself (i’d be particularly interested in hearing how anyone actually utilizes this last piece of data, in particular, in analyzing the survey data itself. e.g. would you give less analytic weight to an interview marked ‘distracted’ or ‘uncooperative’ or ‘blatantly lying?’).

the proposed process of monitoring and adjustment bears striking resemblance to other discussions (e.g. pritchett, samji and hammer) about the importance of collecting and using monitoring data to make mid-course corrections in research and project implementation. it does feel like there is a certain thematic convergence underway about giving monitoring data its due. in the case of surveying, it feels like there is a slight shift towards the qualitative paradigm, where concurrent data collection, entry and analysis and iterative adjustment are the norm. not a big shift but a bit.

but on the actual computer bit, i am less keen. a survey interview is a conversation. a structured conversation, yes. potentially an awkward conversation and almost certainly one that violates the onion theory of communication. but even doctors – some of the ultimate violators – complain about the distance created between themselves and a patient by having a computer between them during an examination (interview), as is now often required to track details for pay-for-performance schemes (e.g.). so, while i appreciate and support the innovations of responsive survey design and recognize the benefits of speed and aggregation over collecting the same data manually, i do wish we could also move towards a mechanism that doesn’t have the surveyor behind a screen (certainly a tablet would seem preferable to a laptop). could entering data rely on voice more than keying in answers to achieve the same result? are there other alternatives to at least maintain some semblance of a conversation? are there other possibilities to both allow the flexibility of updating a questionnaire or survey design while also re-humanizing ‘questionnaire administration’ as a conversation?

update 2 Feb 2014: interesting article  on interviews as human, interpersonal interactions.