Back to Basics — Trusting Whether and How The Data are Collected and Coded

This is a tangential response to the lacour and #lacourgate hubbub (with hats off to the summaries and views given here and here). While he is not implicated in all of the comments, below, I am mostly certainly indebted to Mike Frick for planting the seed of some of the ideas presented below, particularly on member-checking (hopefully our under-review paper on the same will be out sometime in the future…). Salifu Amidu and Abubakari Bukari are similarly motivational-but-not-implicated, as are Corrina Moucheraud, Shagun Sabarwal and Urmy Shukla.

To a large extent, the lacour response is bringing a new angle on an increasingly familiar concern: trusting the analysis. This means additional (and important) calls for replication and other forms of post-publication peer review (as Broockman calls for) as a guard against significance-hungry, nefarious researchers. Pre-analysis plans, analytic/internal replications, and so on, are all important steps towards research transparency. But they miss the fundamental tendency to treat data as ‘true’ once it makes it into the familiar, rectangular format of a spreadsheet.

Given lacour, it seems clear that we may need to take an additional step back to get into the heart of research: the data. We place a lot of trust in data themselves — between advisers and advisees, between research collaborators, and between producers and users of large, public data sets. and, in turn, between PIs and research assistants and the actual team collecting the data. This trust is about, of course, whether the data exist at all and whether they measure what they purport to measure. (Green seems to have had a hunch about this?)

We should be clear about the foundations of this trust and what we might do to strengthen it. Ultimately, the lacour story is a story about the production of data, not its analysis. The transparency agenda needs to expand accordingly, to address the fundamental constancy that ‘shit in leads to shit out.’

Here’s a few thoughts:

  • Start to teach data collection like it matters. Survey design and data collection are weirdly absent from many graduate programs — even those oriented towards research. You may pick these up in electives but they are rarely required, to my knowledge. Learning about construct validity, validating test instruments in new contexts, questionnaire design, the potential for interview effects, some of the murky and inchoate contents of the activity labelled as ‘formative work*,’ etc, need not be re-discovered by each new graduate student or research assistant who takes on field work. If a course-work model won’t work, then a much more explicit apprenticeship model should be sought for those pursuing primary empirical work. in terms of teaching, one occasionally might be forgiven for thinking that impact evaluators had discovered data collection and that there aren’t mounds of resources on household surveys, psychometric’s, and questionnaire design that can be used to better ensure the quality and truthfulness of the data being collected. Interdisciplinary work needs to start with what and by what means and measures data are collected to answer a particular question.
  • Report on data quality practices. Lots of survey firms and researchers employee strategies such as data audits and back-checks. Good on you. Report it. This almost never makes it into actual publications but these are not just internal operations processes. Researchers do need to put forth some effort to make their readers trust their data as well as their analysis but so much less work seems to go into this. With the rise of empirical work in economics in other fields, this needs to be given more documented attention. If you threw out 5% of your data because of failed back-checks, tell me about it. I’d believe the remaining 95% of your data a lot more. The onus is on the researchers to make the reader trust their data.
  • Treat surveyors as a valuable source of information. It is increasingly common to at least have surveyors fill a question at the end of questionnaire about whether the respondent was cooperative (usually a Likert scale item) or other brief reflection on how the interview went. I have no idea what happens to responses to the data so produced — if they are used to throw out or deferentially weight responses, do please tell the reader about it. Moreover, you can systematically ask your surveyors questions (including anonymously) about question items that they don’t trust. For example, I asked (in written form) this question of surveyors and most reported that it was incredibly embarrassing for them to ask their elders to play certain memory games related to short-term recall. This might be a good sign to tread lightly with those data, if not discount them completely (whether or not the surveyors faithfully asked the embarrassing question, it still suggests that it created a tense social interaction that may not have generated trustworthy data, even if it didn’t fall in the traditional space of ‘sensitive questions.’). If nothing else, the surveyors’ assessments may be given as part of the ‘methods’ or ‘results’ attended to in publications. And, in general, remembering that surveys are human interactions, not matrix populators, is important.
  • Member-check. Member-checking is a process described by Lincoln and Guba (and others) that involves taking results and interpretations back to those surveyed to test interpretative hypotheses, etc. if some results really fly in the face of expectations, this process could generate some ‘red flags’ about which results and interpretations should be treated with care. And these can be reported to readers.
  • Coding. As with ‘formative work,’ the nuances of ‘we coded the open-ended data’ is often opaque, though this is where a lot of the interpretive magic happens. This is an important reason for the internal replication agenda to start with the raw data. In plenty of fields, it would be standard practice to use two independent coders and to report on inter-rater reliability. This does not seem to be standard practice in much of impact evaluation. This should change.
  • Check against other data-sets. It would not take much time for researchers to put into context their own findings by comparing (as part of a publication) the distribution of results on key questions to the distribution from large data-sets (especially when some questionnaire items are designed to mimic the dhs, lsms, or other large public data-sets for precisely this reason). This is not reported often enough. This does not mean that the large, population-linked data-set will always trump your project-linked data-set but it seems only fair to alert your readers to key differences, for the purposes of internal believability as well as external validity.
  • Compare findings with findings from studies on similar topics (in similar contexts) — across disciplines. Topics and findings do not end with the boundaries of a particular method of inquiry. Placing the unexpectedness of your findings within this wider array of literature would help.
  • Treat all types of data with similar rigor and respect. (Cue broken record.) If researchers are going to take such care with quantitative data and then stick in a random quote as anec-data in the analysis without giving any sense of where it came from or whether it should be taken as representative of the entire sample or some sub-group… well, it’s just a bit odd. However you want to label these different types of data — quant and qual or data-set-observations and causal-process observations — they are empirical data and should be treated with the highest standards known in each field of inquiry.

I can’t assess whether any of these measures, singly or together, would have made a major difference in the lacour case — especially since it remains nebulous how the data were generated, let alone with what care. But the lacour case reveals that we need to be more careful. A big-name researcher was willing to trust that the data themselves were real and collected to the best of another researcher’s ability — and focused on getting the analysis right. In turn, other researchers bought into both the analysis and the underlying data because of the big-name researcher. This suggests we need to do a bit more to establish trust in the data themselves — and that the onus for this is on the researchers — big names or no — claiming to have led the data collection and cleaning processes. This is especially true given the unclear role for young researchers as potential replicators and debunkers, highlighted here. I hope the transparency agenda steps up accordingly.

*If on occasion a researcher reported on what happened during the ‘formative phase’ and about how the ‘questionnaire was changed in response,’ that would be really interesting learning for all of us. Report it. Also, if you are planning to do ‘qualitative formative work’ to improve your questionnaire, it would be good if you built in time in your research timeline to actually analyze the data produced by that work, report on that analysis, and explain how the analysis led to changing certain questionnaire items…

Thinking About Building Evaluation Ownership, Theories of Change — Back From Canadian Evaluation Society

This week I had the pleasure of attending the Canadian Evaluation Society (#EvalC2015) meeting in Montreal, which brought together a genuinely nice group of people thinking not just hard a-boot evaluation strategies and methodologies but also how evaluation can contribute to better and more transparent governance, improving our experience as global and national citizens — that is, evaluation as a political (and even social justice) as much as a technical act.

Similarly, there was some good conversation around the balance between the evaluation function being about accountability versus learning and improvement and concern about when the pendulum swings too far to an auditing rather than an elucidating and improving role.

For now, I want to zoom in on a two important themes and more own nascent reflections on them. I’ll be delighted to get feedback on these thoughts, as I am continuing to firm them up myself. my thoughts are in italics, below.

  1. Collaboration, neutrality and transparency
    1. There were several important calls relating to transparency, including a commitment to making evaluation results public (and taking steps to make sure citizens see these results (without influencing their interpretation of them or otherwise playing an advocacy role)) and for decision-makers claiming to have made use of evidence to inform their decisions to be more open about how and which evidence played this role. This is quite an important point and it echoes some of the points Suvojit and I made about thinking about the use of evaluative evidence ex ante. We’re continuing to write about this, so stay tuned.
    2. There was quite a bit of push back about whether evaluation should be ‘neutral’ or ‘arm’s length’ from the program — apparently this is the current standard practice in Canada (with government evaluations). This push back seems to echo several conversations in impact evaluation about beginning stakeholder engagement and collaboration far earlier in the evaluation process, including Howard White’s consideration of evaluative independence.
    3. Part of the push back on ‘arm’s length neutrality’ came from J. Bradley Cousins, who will have a paper and ‘stimulus document’ coming out in the near future on collaborative evaluation that seems likely to be quite interesting. In another session, it was noted that ‘collaboration has more integrity than arm’s length approaches. I particularly liked the idea of thinking about how engagement between researchers and program/implementation folks could improve a culture of evaluative thinking and organizational learning — a type of ‘capacity building’ we don’t talk about all that often. Overall, I am on board with the idea of collaborative evaluation, with the major caveat that evaluators need to report honestly about the role the play vis-a-vis refining program theory, refining the program contents, assisting with implementing the program, monitoring, etc.
  2. Building a theory of change and fostering ownership in an evaluation.
    1. There was a nice amount of discussion around making sure that program staff, implementers, and a variety of stakeholders could “see themselves” in the theory of change and logic model/results chain. This not only included that they could locate their roles but also that these planning and communication tools reflected the language with which they were used to talking about their work. Ideally, program staff can also understanding their roles and contributions in light of their spheres of direct and indirect influence.
    2. John Mayne and Steve Montague made some very interesting points about building a theory of change, which I will have to continue to process over the upcoming weeks. they include:
      1. Making sure to think about ‘who’ in addition to ‘what’ and ‘why’ — this includes, I believe, who is doing what (different types and levels of implementer’s) as well as defining intended reach, recognizing that some sub-groups may require different strategies and assumptions in order for an intervention to reach them.
      2. As was noted “frameworks that don’t consider reach conspire against equity and fairness” because “risks live on the margin.” I haven’t fully wrapped my head around the idea of ‘theories of reach’ embedded or nested within the theory of change but am absolutely on-board with considering distributional expectations and challenges from the beginning and articulating assumptions about when and why we might expect heterogeneous treatment effects — and deploying quantitative and qualitative measurement strategies accordingly.
    3. John Mayne advocated his early thoughts that for each assumption in a theory of change, the builders should articulate a justification for its:
      1. Necessity — why is this assumption needed?
      2. Realization — why is this assumption likely to be realized in this context?
      3. This sounds like an interesting way to plan exercises towards collaborative building of theories of change
    4. a productive discussion developed (fostered by John Mayne, Steve Montague and Kaireen Chaytor, among others) around how to get program staff involved in articulating the theory of change. A few key points were recurring — with strong implications for how long a lead time is needed to set up an evaluation properly (which will have longer-term benefits even if it seems to be slightly inefficient upfront):
      1. Making a theory of change and its assumptions explicit is part of a reflective practice of operations and implementation.
      2. Don’t try to start tabula rasa in articulating the theory of change (‘the arrows’) with the implementing and program staff. Start with the program documents, including their articulation of the logic model or results chain (the ‘boxes’ in a diagrammatic theory of change) and use this draft as the starting point for dialogue.
      3. It may help to start with one-on-ones with some key program informants, trying to unpack what lies in the arrows connecting the results boxes. This means digging into the ‘nitty girtty’ micro-steps and assumptions, avoiding magical leaps and miraculous interventions. Starting with one-on-ones, rather than gathering the whole group to consider the results chain, can help to manage some conflict and confusion and build a reasonable starting point — despite the fact that:
      4. Several commentators pointed out that it is unimportant whether the initial results chain was validated or correct — or was even set up as a straw-person. Rather, what is important was having something common and tangible that could serve as a touchstone or boundary object in bringing together the evaluators and implementer’s around a tough conversation. In fact, having some flaws in the initial evaluators’ depiction of the results chain and theory of change allows opportunities for program staff to be the experts and correct these misunderstandings, helping to ensure that program staff are not usurped in the evaluation design process.
      5. Initial disagreement around the assumptions (all the stuff behind the arrows) in the theory of change can be productive if they are allowed to lead to dialogue and consensus-building. Keep in mind that the theory of change can be a collaborative force. As Steve Montague noted, “building a theory of change is a team sport,” and needs to be an iterative process between multiple stakeholders all on a ‘collective learning journey.’
        1. One speaker suggested setting up a working group within the implementing agency to work on building the theory of change and, moreover, to make sure that everyone internally understands the program in the same way.
    5. This early engagement work is the time to get construct validity right.
    6. The data collection tools developed must, must, must align with the theory of change developed collectively. This is also a point Shagun and i made in our own presentation at the conference, where we discussed our working paper on meaningfully mixing methods in impact evaluation. stay tuned!
    7. The onus is on the evaluator to make sure that the theory of change is relevant to many stakeholders and that the language used is familiar to them.
    8. There was also a nice discussion about making sure to get leadership buy-in and cooperation early in the process on what the results reporting will look like. Ideally the reporting will also reflect the theory of change.

Overall, much to think about and points that I will definitely be coming back to in later work. Thanks again for a great conference.

Oops, Got Long-Winded About ‘Median Impact Narratives’

*A revised version of this post is also available here.

I finally got around to reading a post that had been flagged to me awhile ago, written by Bruce Wydick. While I don’t think the general idea of taking sampling and representatives seriously is a new one, the spin of a ‘median narrative’ may be quite helpful in making qualitative and mixed work more mainstream and rigorous in (impact) evaluation.

Anyway, I got a bit long-winded in my comment on the devimpact blog site, so I am sticking it below as well, with some slight additions:

First, great that both Bruce and Bill (in the comments) have pointed out (again) that narrative has a useful value in (impact) evaluation. This is true not just for a sales hook or for helping the audience understand a concept — but because it is critical to getting beyond ‘did it work?’ to ‘why/not?’

I feel Bill’s point (“telling stories doesn’t have to be antithetical to good evaluation“) should be sharper — it’s not just that narrative is not antithetical to good evaluation but, rather, it is constitutive of good evaluation and any learning and evidence-informed decision-making agenda. And Bill’s right, part of the problem is convincing a reader that it is a median story that’s being told when an individual is used as a case study — especially when we’ve been fed outlier success stories for so long. This is why it is important to take sampling seriously for qualitative work and to report on the care that went into it. I take this to be one of Bruce’s key points and why his post is important.

I’d also like to push the idea of a median impact narrative a bit further. The basic underlying point, so far as I understand it, is a solid and important one: sampling strategy matters to qualitative work and for understanding and explaining what a range of people experienced as the result of some shock or intervention. It is not a new point but the re-branding has some important sex appeal for quantitative social scientists.

One consideration for sampling is that the same observable’s (independent vars) that drive sub-group analyses can also be used to help determine a qualitative sub-sample (capturing medians, outliers in both directions, etc). To the extent that theory drives what sub-groups are examined via any kind of data collection method, all the better. Authur Kleinman once pointed out that theory is what helps separate ethnography from journalism — an idea worth keeping in mind.

A second consideration is in the spirit of Lieberman’s call for nested analyses (or other forms of linked and sequential qual-quant work), using quantitative outcomes for the dependent variable to drive case selection, iterated down to the micro-level. The results of quantitative work can be used to inform sampling of later qualitative work, targeting those representing the range of outcomes values (on/off ‘the line’).

Both these considerations should be fit into a framework that recognizes that qualitative work has its own versions of representativeness (credibility) as well as power (saturation) (which I ramble about here).

Finally, in all of this talk about appropriate sampling for credible qualitative work, we need to also be talking about credible analysis and definitely moving beyond cherry-picked quotes as the grand offering from qualitative work. Qualitative researchers in many fields have done a lot of good work on synthesizing across stories. This needs to be reflected in ‘rigorous’ evaluation practice. Qualitative work is not just for pop-out boxes (I go so far as to pitch the idea of a qualitative pre-analysis plan).

Thanks to both Bruce and Bill for bringing attention to an important topic in improving evaluation practice as a whole — both for programmatic learning and for understanding theoretical mechanisms (as Levy-Paluck points out in her paper). I hope this is a discussion that keeps getting better and more focused on rigor and learning as a whole in evaluation, rather than quant v qual.

Aside

Dear Indian (and other) Pharmaceutical Manufacturers Who Produce Birth Control

One of the greatest behavioral nudges of all time is the week of different-colored placebo pills in a package of birth control to allow you to just keep taking pills and always be on the right schedule.

Please adopt this innovation. It cannot alter costs that much.

Sincerely, spending too much time scheduling on google calendar now that my US birth control pills have run out.

small sunday morning thoughts on external validity, hawthorne effects

*an updated version of this post, in which i try to answer some of my musings below, can be found here.

recently, 1 had the privilege of speaking about external validity at the CLEAR South Asia M&E Roundtable (thank you!), drawing on joint work i am doing with vegard iversen on this question of when and how to generalize lessons across settings.

.

my main musing for today is how the conversation at the round table, as well as so many conversations that I have had on external validity, always bend back to issues of monitoring and mixed-methods work (and reporting on the same) throughout the course of the evaluation.

.

my sense is that this points to a feeling that a commitment to taking external validity seriously in study design is about more than site and implementing partner selection (doing so with an eye towards representativeness and generalization, especially if the evaluation has the explicit purpose of informing scale-up).

.

it is also about more than measuring hawthorne effects or trying to predict the wearing off of novelty effects and the playing out of general equilibrium effects should the program be scaled-up — though all these things are clearly important.

.

the frequency with which calls for better monitoring come up as an external validity concern suggests to me that we need to take a hard look at what we mean by internal validity. in a strict sense, internal validity relates to the certainty about the causal claim of program ‘p’ on interesting outcome ‘y.‘ but surely this also includes a clear understanding of what program ‘p’ itself is — that is, what is packed into that beta treatment variable in the regression, which is likely not to have been static over the course of an evaluation or uniform across all implementation sites.

.

this is what makes monitoring using a variety of data collection tools and types so important — so that we know what a causal claim is actually about (as cartwright, among others, have discussed). this is both important in understanding what happened at the study site itself, as well as trying to learn from a study site ‘there’ for any work another implementer or researcher may want to do ‘here.‘ some calls for taking external validity seriously seem to me to be veiled calls for re-considering the requirements of internal validity (and issues of construct validity).

.

as a side musing, here’s a question for the blogosphere: we usually use ‘hawthorne effects‘/observer effects (please note, named for the factory where the effect was first documented, not for some elusive dr. hawthorne) to refer to the changes in participant/subject/beneficiary behavior strictly because they are being observed (outside of the behavior changes intended by the intervention itself).

.

but in much social science and development research, (potential) beneficiaries are not the only ones being observed. so too are implementer’s, who may feel more pressure to implement with fidelity to protocol, even if the intervention doesn’t explicitly alter their incentives to do so. can we also consider this a hawthorne effect? Is there another term in place for observer-effects-on-implementers? surely the potential for such an effect must be one of the potential lessons from the recent paper on how impact evaluations help deliver development projects?