Back to Basics — Trusting Whether and How The Data are Collected and Coded

This is a tangential response to the lacour and #lacourgate hubbub (with hats off to the summaries and views given here and here). While he is not implicated in all of the comments, below, I am mostly certainly indebted to Mike Frick for planting the seed of some of the ideas presented below, particularly on member-checking (hopefully our under-review paper on the same will be out sometime in the future…). Salifu Amidu and Abubakari Bukari are similarly motivational-but-not-implicated, as are Corrina Moucheraud, Shagun Sabarwal and Urmy Shukla.

To a large extent, the lacour response is bringing a new angle on an increasingly familiar concern: trusting the analysis. This means additional (and important) calls for replication and other forms of post-publication peer review (as Broockman calls for) as a guard against significance-hungry, nefarious researchers. Pre-analysis plans, analytic/internal replications, and so on, are all important steps towards research transparency. But they miss the fundamental tendency to treat data as ‘true’ once it makes it into the familiar, rectangular format of a spreadsheet.

Given lacour, it seems clear that we may need to take an additional step back to get into the heart of research: the data. We place a lot of trust in data themselves — between advisers and advisees, between research collaborators, and between producers and users of large, public data sets. and, in turn, between PIs and research assistants and the actual team collecting the data. This trust is about, of course, whether the data exist at all and whether they measure what they purport to measure. (Green seems to have had a hunch about this?)

We should be clear about the foundations of this trust and what we might do to strengthen it. Ultimately, the lacour story is a story about the production of data, not its analysis. The transparency agenda needs to expand accordingly, to address the fundamental constancy that ‘shit in leads to shit out.’

Here’s a few thoughts:

  • Start to teach data collection like it matters. Survey design and data collection are weirdly absent from many graduate programs — even those oriented towards research. You may pick these up in electives but they are rarely required, to my knowledge. Learning about construct validity, validating test instruments in new contexts, questionnaire design, the potential for interview effects, some of the murky and inchoate contents of the activity labelled as ‘formative work*,’ etc, need not be re-discovered by each new graduate student or research assistant who takes on field work. If a course-work model won’t work, then a much more explicit apprenticeship model should be sought for those pursuing primary empirical work. in terms of teaching, one occasionally might be forgiven for thinking that impact evaluators had discovered data collection and that there aren’t mounds of resources on household surveys, psychometric’s, and questionnaire design that can be used to better ensure the quality and truthfulness of the data being collected. Interdisciplinary work needs to start with what and by what means and measures data are collected to answer a particular question.
  • Report on data quality practices. Lots of survey firms and researchers employee strategies such as data audits and back-checks. Good on you. Report it. This almost never makes it into actual publications but these are not just internal operations processes. Researchers do need to put forth some effort to make their readers trust their data as well as their analysis but so much less work seems to go into this. With the rise of empirical work in economics in other fields, this needs to be given more documented attention. If you threw out 5% of your data because of failed back-checks, tell me about it. I’d believe the remaining 95% of your data a lot more. The onus is on the researchers to make the reader trust their data.
  • Treat surveyors as a valuable source of information. It is increasingly common to at least have surveyors fill a question at the end of questionnaire about whether the respondent was cooperative (usually a Likert scale item) or other brief reflection on how the interview went. I have no idea what happens to responses to the data so produced — if they are used to throw out or deferentially weight responses, do please tell the reader about it. Moreover, you can systematically ask your surveyors questions (including anonymously) about question items that they don’t trust. For example, I asked (in written form) this question of surveyors and most reported that it was incredibly embarrassing for them to ask their elders to play certain memory games related to short-term recall. This might be a good sign to tread lightly with those data, if not discount them completely (whether or not the surveyors faithfully asked the embarrassing question, it still suggests that it created a tense social interaction that may not have generated trustworthy data, even if it didn’t fall in the traditional space of ‘sensitive questions.’). If nothing else, the surveyors’ assessments may be given as part of the ‘methods’ or ‘results’ attended to in publications. And, in general, remembering that surveys are human interactions, not matrix populators, is important.
  • Member-check. Member-checking is a process described by Lincoln and Guba (and others) that involves taking results and interpretations back to those surveyed to test interpretative hypotheses, etc. if some results really fly in the face of expectations, this process could generate some ‘red flags’ about which results and interpretations should be treated with care. And these can be reported to readers.
  • Coding. As with ‘formative work,’ the nuances of ‘we coded the open-ended data’ is often opaque, though this is where a lot of the interpretive magic happens. This is an important reason for the internal replication agenda to start with the raw data. In plenty of fields, it would be standard practice to use two independent coders and to report on inter-rater reliability. This does not seem to be standard practice in much of impact evaluation. This should change.
  • Check against other data-sets. It would not take much time for researchers to put into context their own findings by comparing (as part of a publication) the distribution of results on key questions to the distribution from large data-sets (especially when some questionnaire items are designed to mimic the dhs, lsms, or other large public data-sets for precisely this reason). This is not reported often enough. This does not mean that the large, population-linked data-set will always trump your project-linked data-set but it seems only fair to alert your readers to key differences, for the purposes of internal believability as well as external validity.
  • Compare findings with findings from studies on similar topics (in similar contexts) — across disciplines. Topics and findings do not end with the boundaries of a particular method of inquiry. Placing the unexpectedness of your findings within this wider array of literature would help.
  • Treat all types of data with similar rigor and respect. (Cue broken record.) If researchers are going to take such care with quantitative data and then stick in a random quote as anec-data in the analysis without giving any sense of where it came from or whether it should be taken as representative of the entire sample or some sub-group… well, it’s just a bit odd. However you want to label these different types of data — quant and qual or data-set-observations and causal-process observations — they are empirical data and should be treated with the highest standards known in each field of inquiry.

I can’t assess whether any of these measures, singly or together, would have made a major difference in the lacour case — especially since it remains nebulous how the data were generated, let alone with what care. But the lacour case reveals that we need to be more careful. A big-name researcher was willing to trust that the data themselves were real and collected to the best of another researcher’s ability — and focused on getting the analysis right. In turn, other researchers bought into both the analysis and the underlying data because of the big-name researcher. This suggests we need to do a bit more to establish trust in the data themselves — and that the onus for this is on the researchers — big names or no — claiming to have led the data collection and cleaning processes. This is especially true given the unclear role for young researchers as potential replicators and debunkers, highlighted here. I hope the transparency agenda steps up accordingly.

*If on occasion a researcher reported on what happened during the ‘formative phase’ and about how the ‘questionnaire was changed in response,’ that would be really interesting learning for all of us. Report it. Also, if you are planning to do ‘qualitative formative work’ to improve your questionnaire, it would be good if you built in time in your research timeline to actually analyze the data produced by that work, report on that analysis, and explain how the analysis led to changing certain questionnaire items…

Advertisements

Oops, Got Long-Winded About ‘Median Impact Narratives’

*A revised version of this post is also available here.

I finally got around to reading a post that had been flagged to me awhile ago, written by Bruce Wydick. While I don’t think the general idea of taking sampling and representatives seriously is a new one, the spin of a ‘median narrative’ may be quite helpful in making qualitative and mixed work more mainstream and rigorous in (impact) evaluation.

Anyway, I got a bit long-winded in my comment on the devimpact blog site, so I am sticking it below as well, with some slight additions:

First, great that both Bruce and Bill (in the comments) have pointed out (again) that narrative has a useful value in (impact) evaluation. This is true not just for a sales hook or for helping the audience understand a concept — but because it is critical to getting beyond ‘did it work?’ to ‘why/not?’

I feel Bill’s point (“telling stories doesn’t have to be antithetical to good evaluation“) should be sharper — it’s not just that narrative is not antithetical to good evaluation but, rather, it is constitutive of good evaluation and any learning and evidence-informed decision-making agenda. And Bill’s right, part of the problem is convincing a reader that it is a median story that’s being told when an individual is used as a case study — especially when we’ve been fed outlier success stories for so long. This is why it is important to take sampling seriously for qualitative work and to report on the care that went into it. I take this to be one of Bruce’s key points and why his post is important.

I’d also like to push the idea of a median impact narrative a bit further. The basic underlying point, so far as I understand it, is a solid and important one: sampling strategy matters to qualitative work and for understanding and explaining what a range of people experienced as the result of some shock or intervention. It is not a new point but the re-branding has some important sex appeal for quantitative social scientists.

One consideration for sampling is that the same observable’s (independent vars) that drive sub-group analyses can also be used to help determine a qualitative sub-sample (capturing medians, outliers in both directions, etc). To the extent that theory drives what sub-groups are examined via any kind of data collection method, all the better. Authur Kleinman once pointed out that theory is what helps separate ethnography from journalism — an idea worth keeping in mind.

A second consideration is in the spirit of Lieberman’s call for nested analyses (or other forms of linked and sequential qual-quant work), using quantitative outcomes for the dependent variable to drive case selection, iterated down to the micro-level. The results of quantitative work can be used to inform sampling of later qualitative work, targeting those representing the range of outcomes values (on/off ‘the line’).

Both these considerations should be fit into a framework that recognizes that qualitative work has its own versions of representativeness (credibility) as well as power (saturation) (which I ramble about here).

Finally, in all of this talk about appropriate sampling for credible qualitative work, we need to also be talking about credible analysis and definitely moving beyond cherry-picked quotes as the grand offering from qualitative work. Qualitative researchers in many fields have done a lot of good work on synthesizing across stories. This needs to be reflected in ‘rigorous’ evaluation practice. Qualitative work is not just for pop-out boxes (I go so far as to pitch the idea of a qualitative pre-analysis plan).

Thanks to both Bruce and Bill for bringing attention to an important topic in improving evaluation practice as a whole — both for programmatic learning and for understanding theoretical mechanisms (as Levy-Paluck points out in her paper). I hope this is a discussion that keeps getting better and more focused on rigor and learning as a whole in evaluation, rather than quant v qual.

Center and Peripherary in Doing Development Differently

I have spent almost three weeks back in TX, which was supposed to be, in part, a time of immense productivity in front of our fireplace (yes, it is chilly here. Probably not enough to warrant a fire but still. I am sitting in front of the fireplace and paying for carbon credits to mitigate the guilt.) I brought home big batches of reading but am taking back far more of it with me to Delhi than I had planned.

Nevertheless, I did finally make it through Duncan Green’s post on his immediate thoughts on Doing Development Differently from Matt Andrews and team. So, that’s only three months behind schedule.

Many things are, of course, striking and exciting about this movement, including the idea of rapid iterations to promote (experiential) learning and tweaks, the importance of morale and relationships, and the time horizon.

But the most striking thing had to do with immersion, deep study and deep play*.

deep study of the system, based on continuous observation and listening. In Nicaragua, UNICEF sent public officials out to try and access the public services they were administering, and even made the men carry 30lb backpacks to experience what it’s like being pregnant! This is all about immersion, rather than the traditional ‘fly in, fly out’ consultant culture.

The idea is, it seems, to strike a blow at the ‘consultant culture’ of folks from D.C., London and Geneva parachuting in to solve problems (there’s probably an interesting discussion to be had about the relevance of area studies in this approach). But that is for another time. What is most immediately striking is that Duncan doesn’t report on UNICEF folks making consultants visiting Nicaragua from NYC head out to remote areas and try to access services with pregnant-backpacks.

If I read the anecdote correctly (is there more written about this somewhere?), the target was public officials, which I take to mean Nicaraguan civil servants and politicians based in the capital or another metropolis. Which is an important (re-)lesson. Being from X country doesn’t automatically make you knowledgeable about all areas and details of X country (duh). Probably many of us have sat with civil servants who talk about ‘the hinterlands’ and ‘backwards’ areas and who seem quite surprised at what they find there, if they visit at all. There is a vast difference between the high-level and the street-level, between big decisions about adopting and championing a policy and the many small decisions involved in implementing that idea. Implementation is, as always, profoundly local. (This idea, incidentally, also applies to study design and the relationships between PIs, their research assistants and the field teams.)

This all suggests that, maybe, doing development differently (and probably doing evaluation differently) also has to do with shifting ideas about center and periphery (globally as well as nationally), about who has relevant knowledge, and thinking about immersion for program designers and decision-makers of a variety of types, whether from the country in question or not. This, in part, raises questions about who is doing the iteration and learning and how lessons are passed up as well as down different hierarchies (and spread horizontally). looking forward to hearing and thinking more.

*It’s hard to resist a Geertz reference, since ‘continual observation and listening’ sounds an awful lot like ‘participant-observation,’ a study technique that almost *never* comes up in “mixed-methods’ evaluation proposals.

the onion theory of communication (doing surveys)

without too much detail, i’ll just note that i spent more time in the hospital in undergrad than i would have preferred. often times, i, being highly unintelligent, would wait until things got really bad and then finally decide one night it was time to visit the ER – uncomfortable but not non-functional or incoherent. on at least one occasion – and because she’s wonderful, i suspect more – alannah (aka mal-bug, malice, malinnus) took me there and would do her homework, sometimes reading out loud to me to keep me entertained and distracted. in one such instance, she was studying some communications theories, one of which was called or nicknamed the onion theory of two-way communication. the basic gist is that revealing information in a conversation should be a reciprocal unpeeling. i share something, shedding a layer of social divide, then you do and we both feel reasonably comfortable.

it didn’t take too long to connect that this was the opposite of how my interaction with doctor was about to go. the doctor would, at best, reveal her name and i would be told to undress in order to be examined, poked and prodded. onion theory, massively violated.

i mention all this because i have just been reading about assorted electronic data collection techniques, namely here, via here. first, i have learned a new word: ‘paradata.’ this seems useful. these are monitoring and administrative data that go beyond how many interviews have been completed. rather, they focus on the process of collecting data. it can include the time it takes to administer the questionnaire, how long it takes a surveyor to locate a respondent, details about the survey environment and the interaction itself (i’d be particularly interested in hearing how anyone actually utilizes this last piece of data, in particular, in analyzing the survey data itself. e.g. would you give less analytic weight to an interview marked ‘distracted’ or ‘uncooperative’ or ‘blatantly lying?’).

the proposed process of monitoring and adjustment bears striking resemblance to other discussions (e.g. pritchett, samji and hammer) about the importance of collecting and using monitoring data to make mid-course corrections in research and project implementation. it does feel like there is a certain thematic convergence underway about giving monitoring data its due. in the case of surveying, it feels like there is a slight shift towards the qualitative paradigm, where concurrent data collection, entry and analysis and iterative adjustment are the norm. not a big shift but a bit.

but on the actual computer bit, i am less keen. a survey interview is a conversation. a structured conversation, yes. potentially an awkward conversation and almost certainly one that violates the onion theory of communication. but even doctors – some of the ultimate violators – complain about the distance created between themselves and a patient by having a computer between them during an examination (interview), as is now often required to track details for pay-for-performance schemes (e.g.). so, while i appreciate and support the innovations of responsive survey design and recognize the benefits of speed and aggregation over collecting the same data manually, i do wish we could also move towards a mechanism that doesn’t have the surveyor behind a screen (certainly a tablet would seem preferable to a laptop). could entering data rely on voice more than keying in answers to achieve the same result? are there other alternatives to at least maintain some semblance of a conversation? are there other possibilities to both allow the flexibility of updating a questionnaire or survey design while also re-humanizing ‘questionnaire administration’ as a conversation?

update 2 Feb 2014: interesting article  on interviews as human, interpersonal interactions.

 

back (and forward) from ‘the big push forward’ – thoughts on why evidence is political and what to do about it

i spent the beginning of the week in brighton at the ‘big push forward‘ conference, on the politics of evidence (#evpolitics) which mixed the need for venting and catharsis (about the “results agenda” and “results-based management” and “impact evaluation”) with some productive conversation, though no immediate concreteness on how the evidence from the conference would itself be used.

in the meantime, i offer some of my take-aways from the conference – based on some great back-and-forths with some great folks (thanks!), below.

for me, the two most useful catchphrases were trying to get to “relevant rigor” (being relevantly rigorous and rigorously relevant) and to pay attention to both “glossy policy and dusty implementation.” lots of other turns-of-phrase and key terms were offered, not all of them – to my mind – terribly useful.

there was general agreement that evidence could be political in multiple dimensions. these included in:

  • what questions are asked (and in skepticism of whose ideas they are directed), by whom, of whom, with whom in mind (who needs to be convinced), for whom – and why
  • the way questions are asked and how evidence is collected
  • how evidence is used and shared – by whom, where and why
  • how impact is attributed – to interventions or to organizations (and whether this fuels competitiveness for funds and recognition)
  • whether the originators of the idea (those who already ‘knew’ something was working in some way deemed insufficiently rigorous) or the folks who analyze evidence receive credit for the idea

questions and design. in terms of what evidence is collected and what questions are asked, a big part of the ‘push back’ relates to what questions are asked and whether they help goverments and organizations improve their practice. this requires getting input from many stakeholders on what questions are important to ask. in addition, it requires planning for how the evidence will be used, including what will be done if results are (a) null, (b) mixed, confused or inconclusive, and (c) negative. more generally, this requires recognizing that policy-makers aren’t making decisions about ‘average’ situations but rather decisions for specific situations. as such, impact evaluation and systematic reviews need to help them figure out what evidence applies to their situation. the sooner expectations are dispelled that an impact evaluation or a systematic review will provide a clear answer on the what should be done next, the better.

my sense, which was certainly not consensus, is that to be useful and to avoid being blocked by egos, impact questions need to shift away from “does X work?” to “does X work better than Y?” and/or “how an X be made to work better?” this also highlights the importance of monitoring and feedback of information into learning and decision-making (i.e.).

two more points on results for learning and decision-making. first, faced with the assertion that ‘impact evaluation doesn’t reveal *why* something works,’ it is unsatisfactory to say something along the lines of ‘we look for heterogenous treatment effects.’ it absolutely also requires asking front-line workers and program recipients why they think something is and is not working — not as the final word on the matter but as a very important source of information. second, as has been pointed about many places (e.g.), designing a good impact evaluation requires explication of a clear “Theory of Change” (still not my favorite term but apparently one that is here to stay). further, it is important to recognize that articulating a ToC (or LogFrame or use of any similar tool) should never be one person’s all-nighter for a funding proposal. rather, the tool is useful as a way of collectively building consensus around mission and why & how a certain idea is meant to work. as such, time and money need to allocated for a ToC to be developed.

collection. as for the actual collection of data, there was a reasonable amount of conversation about whether the method is extractive or empowering, though probably not enough on how to shift towards empowerment and the fact that extractive/empowering are not synonymous with quant/qual. an issue that received less attention than it should have was that data collection needs to align with an understanding of how long a program should take to work (and funding cycles should be realigned accordingly).

use. again, the conversation of the use of evidence was not as robust as i had hoped. however, it was pointed out early on (by duncan green) that organizations that have been comissioning systematic reviews in fact have no plan to use that evidence systematically. moreover, there was a reasonable amount of skepticism around whether such evidence would actually be used to make decisions to allocate resources to specific organizations or projects (for example, to kill or radically alter ineffective programs). rather, there is a sense that much impact evaluation is actually policy-based evidence-making, used to justify decisions already taken. alternatively, though, there was concern that the more such evidence was used to make specific funding decisions, the more organization would be incentivized to make ‘sausage‘ numbers that serve no one. thus, the learning, feedback and improving aspects of data need emphasis.

empowerment in the use of data (as opposed to its collection) was not as much a part of the conversation as i would have hoped, though certainly people raised issues of how monitoring and evaluation data were fed-back to and used by front-line workers, implementers, and ‘recipients.’  a few people stressed the importance of near-automated feedback mechanisms from monitoring data to generate ‘dashboards’ or other means of accessable data display, including alternatives to written reports.

a big concern on use of evidence was ownership and transparency of data (and results), including how this leads to the duplication/multiplication of data collection. surprisingly, with regards to transparency of data and analysis, no one mentioned the recent reinhart & rogoff mess, nor anything about mechanisms for improving data accessibility (e.g.)

finally, there was a sense that data collected needs to be useful – that the pendulum has swung too far from a dearth of data about development programs and processes to an unused glut, such that the collection of evidence feels like ‘feeding the beast.’ again, this loops back to planning how data will be broadly used and useful before it is collected.