Aside

On Science, from Eula Biss’s On Immunity

A nice reminder from Eula Biss (via On Immunity: An Inoculation) that science is a series of building blocks, with small tests and then bigger ones to see if each brick helps us reach higher and see farther.

Science is, as scientists like to say, “self-correcting,” meaning that errors in preliminary studies are, ideally, revealed in subsequent studies. One of the primary principles of the scientific method is that the results of a study must be reproducible. Until the results of a small study are duplicated in a larger study, they are little more than a suggestion for further research. Most studies are not incredibly meaningful on their own, but gain or lose meaning form the work that has been done around them… This doesn’t mean that published research should be disregarded but that, as John Ioannidis concludes, “what matters is the totally of the evidence” (p. 133)…

Thinking of our knowledge as a body suggests the harm that can be done when one part of that body is torn from its context. Quite a bit of this sort of dismemberment goes on in discussions about vaccination, when individual studies are often used to support positions or ideas that are not supported by the body as a whole… When one is investigating scientific evidence, on must consider the full body of information (p. 135).

small thoughts on transparency in research (descriptions of methods, analysis)

there is currently a good deal of attention on transparency of social science research – as there should be. much of this is focused on keeping the analysis honest, including pre-analysis plans (e.g.) and opening up data for re-analysis (internal replication, e.g. here and here). some of this will hopefully receive good discussion at an upcoming conference on research transparency, among other fora.

but, it seems at least two points are missing from this discussion, both focused on the generation of the analyzed data itself.

 

intervention description and external replication

first: academic papers in “development” rarely provide a clear description of the contents of an intervention / experiment, such that it could be, plausibly, reproduced. growing up with a neuroscientist / physiological psychologist (that’s my pop), i had the idea that bench scientists had this part down. everyone (simultaneously researchers and implementers) has lab notebooks and they take copious notes. i know because I was particularly bad at that part when interning at the lab.*

then, the researchers report on those notes: for example, on the precise dimensions of a water maze they built (to study rodent behavior in stressful situations) and gave you a nice diagram so that you could, with a bit of skill, build your own version of the maze and follow their directions to replicate the experiment.

pop tells me i am overly optimistic on the bench guys getting this totally right. he agrees that methods sections are meant to be exact prescriptions for someone else to reproduce your study and its results. for example, they are very detailed on exactly how you ran the experiment, description of the apparatus used , where reagents (drugs) were purchased from, etc. he also notes that one thing that makes this easier in bench science is that “most experimental equipment is purchased from a manufacturer which means others can buy exactly the same equipment. gone are the dark days when we each made our own mazes and such. reagents are from specific suppliers who keep detailed records on the quality of each batch…”

then he notes: “even with all this, we have found reproducibility to be sketchy, often because the investigators are running a test for the first time. a reader has to accept that whatever methodological details were missed (your grad student only came in between 1 and 3AM when the air-conditioning was off) were not critical to the results.” or maybe this shouldn’t go unreported and accepted.

the basic idea holds in and out of the lab: process reporting on the intervention/treatment needs to get more detailed and more honest. without it, the reader doesn’t really understand what the ‘beta’ in any regression analysis means – and with any ‘real world’ intervention, there’s a chance that beta contains a good deal of messiness, mistakes, and iterative learning resulting in tweaks over time.

as pop says: “an investigator cannot expect others to accept their results until they are reproduced by other researchers.” and the idea that one can reproduce the intervention in a new setting (externally replicate) is a joke unless detailed notes are kept about what happens on a daily or weekly basis with implementation and, moreover, these notes are made available. if ‘beta’ contained some things at one time in a study and a slightly different mix at a different time, shouldn’t this be reported? if research assistants don’t / can’t mention to their PIs when things get a bit messy in ‘the field’, and PIs in turn don’t report glitches and changes to their readers or other audiences, then there’s a problem.

 

coding and internal replication

as was raised not-so-long-ago by the nice folks over at political violence at a glance, the cleaning and coding of data for analysis is critical to interpretation – and therefore critical to transparency. there is not enough conversation happening about this – with “this,” in large part, being about construct validity. there are procedures for coding, usually involving independent coders working with the same codebook and then doing a check for inter-rater reliability. and reporting the resultant kappa or other relevant statistic. the reader really shouldn’t be expected to believe the data otherwise, on the whole “shit in, shit out” principle.

in general, checks on data that i have seen relate to double-entry of data. this is important but hardly sufficient to assure the reader that the findings reported are reasonable reflections of the data collected and the process that generated them. the interpretation of the data prior to the analysis – that is, coding and cleaning — is critical, as pointed out by political violence at a glance, for both quantitative and qualitative research. and, if we are going to talk about open data for reanalysis, it should be the raw data, so that it can be re-coded as well as re-analyzed.

 

in short, there’s more to transparency in research than allowing for internal replication of a clean dataset. i hope the conversation moves in that direction — the academic, published conversation as well as the over-beers conversation.

 

*i credit my background in anthropology, rather than neuroscience, with getting better with note-taking. sorry, pop.

we’re experimenting! also, clarifying types of replications

a nice article from chris said, discussing how we might alter publication rules (and the granting requirements of donor organizations)  in a way to move us closer to good, useful research – specifically, looking more toward the importance of the question and the rigor of the method to answer it. i am, of course, fully in favor of focusing on important (in this case, policy-relevant) questions, rigorous design implementation (in this case, with an eye toward considering scale-up potential), solid data collection (no really, good regressions don’t fix bad data) — as well as publishing results that aren’t necessarily the sexiest but will ultimately move our understanding of what works forward in important ways.

Granting agencies should reward scientists who publish in journals that have acceptance criteria that are aligned with good science. In particular, the agencies should favor journals that devote special sections to replications, including failures to replicate. More directly, the agencies should devote more grant money to submissions that specifically propose replications. Moreover — and this is a fairly radical step that many good scientists I know would disagree with — I would like to see some preference given to fully “outcome-unbiased” journals that make decisions based on the quality of the experimental design and the importance of the scientific question, not the outcome of the experiment. This type of policy naturally eliminates the temptation to manipulate data towards desired outcomes.

(addition 30.04.2012: http://www.overcomingbias.com/2012/04/who-wants-unbiased-journals.html)

if we start taking replications more seriously in social science experiments, we may need to start being more precise with terms. there are a few possible variants/meanings of replications, potentially making it difficult for experimenters, donors, consumers of research, and other stakeholders to speak clearly with one another and set expectations.

  • one potential meaning is a program/experiment conducted in one location with one set of implementers, repeated in the same place with different implementers (say, the government versus an NGO). call this internal replication (?).
  • another type of replication would be transplanting the program/experiment to a different context, making either minor adjustments (such as language) or more substantive adjustments based on lessons learned from the first pass and a local stakeholder analysis. some range of this is external replication; it’s hard to know at what degree of modification we should really stop calling it a replication and just call it a new or extension experiment inspired by another, rather than selling it as a replication.
  • (of course, an internal replication, depending on the number of lessons learned on the first go-round and the modifications required for the second set of implementers to have a go, might itself actually be a new or extended experiment, rather than a replication. again, the line would be fuzzy but presumably some simple criteria/framework could be delineated)

h/t marginal revolution & rachel strohm

wait, we’re experimenting, right?

many of the descriptions of the ideal next World Bank president – at least the ones with which I agree – have called for a little more humility about how much we actually know about economic & human development and poverty reduction.

so it’s frustrating to see articles like this, which imply a low level of humility about the work we are doing and an unclear commitment to learning what actually does and does not work (regardless of felt commitment to poverty reduction & development).

a large part of the reason that experiments and impact evaluations in development have become popular is that we weren’t getting as far as we needed with theory, intuition or observation alone. money and other resources were being put into programs when we don’t know if they are effective (even if things seemed to be changing in the presence of the program), let alone how they compared to other programs in terms of efficacy or cost-efficiency. process and implementation evaluations that could have improved subsequent program interventions were not being conducted and/or shared.

it seems like we need to pause and think about how and why we are experimenting.

  • we experiment because we don’t know what works – or whether something that works in one location will work in another. if we knew what worked, we would potentially be under some ethical obligation to do that thing for all people in all places we thought it would work. when we don’t know what works, or when there is at least genuine disagreement about the best approaches, an experimental design is justified. in short, we need to bring equipoise into social science research. in part, this means that we should be testing our new (experimental) idea against the best known or available intervention with a similar goal. new drugs are usually tested against a placebo and a regularly used treatment.
  • because we are experimenting, we should encourage the publication of null findings and laud these as equally important learning experiences. this requires funders recognizing such reporting as essential for reporting on the accountability of studies and program implementations.  it also requires changing the strong bias of journal editors and reviewers to only publish significant findings. confidence intervals aside, null findings may be just as “significant” for our understanding of what works and doesn’t work in development as reporting statistically significant results.
  • evaluations probably need to start to look more like programs that could be scaled-up. there are good experimental reasons for manipulating only one or two key variable(s) at  time and trying to limit all other contamination, but there has to be increasing movement toward learning what works in situ, even if that means there is more than one moving part. and if it is really unclear how the findings from an experiment would be scaled-up in a program or policy, then the experiment likely needs to be re-thought.
  • also, we need to think more about the ethics of doing social science experiments in low- and middle-income countries. there are increasing obligations for clinical research by large pharmaceutical or academic institutions, if the drug proves effective, to – at a minimum – make the drug available to – again, at a minimum – the host community. this is because the host community bore some risk in participating in an experimental intervention — but more generally because any intervention alters biological and social patterns that will remain changed after the trial ends and the researchers leave the community to publish their results in scientific journals.
  • experimenting is good in a context in which we aren’t sure what works. NGO- and state-run programs need to be linked with evaluation efforts. there are roughly a bajillion graduate students interested in program evaluation, development economics, and so on and there are a large number of programs that are being run by governments or NGOs without any rigorous evaluation or clear delineation of ‘lessons learned’ – or at least evaluations that get talked about in the technocratic space. none of these programs will offer a perfect experimental design but, hey, that’s where the complex statistics come in. all we need is a yenta to link grad students to programs (and evaluation funding) and we’re set.
  • experiments, programs, policies, etc, need to allow us to learn about the implementation process as well as the outcomes. deviations from initial design and unexpected hurdles along the way should be reported so that everyone can learn from them. yes, the reality of actually running these programs may make it more difficult to make causal inference with certainty – but these aren’t just aberrations in an experimental design, they’re part of the reality into which any scaled-up effort would be plugged. this is similar to the distinction between “efficacy” and “effectiveness” in clinical research. knowing how an intervention performs under ideal experimental conditions (efficacy) may not tell us how the same intervention program performs applied under real world circumstances or scaled up to other communities (effectiveness).
  • replication is central to the natural sciences, but still largely under-utilized the social sciences and development research. but we need to recognize the importance of replication in confirming or dis-confirming the results from program implementation studies and encourage greater publication of replication studies.

*see, for example, “Moral standards for research in developing countries: from ‘reasonable availability’ to ‘fair benefits’” or “What makes clinical research in developing countries ethical? The benchmarks of ethical research”

*other inspiration

**big thanks to Mike for helping me sort through a lot of the ideas in this post