There’s been a lot of controversy over the attempts to replicate Simone Schnall’s research on disgust, cleanliness, and moral reasoning. Specifically, there’s been stress over a failed replication by Johnson, Cheung and Donnellan that Schnall responded to in the same commentary issue of Social Psychology.
In a previous piece, I covered features of Schnall’s original paper from 2008. Now, let’s take a look at her response to the replicators.
NOTE: Dr. Deborah Mayo touches on this specific case in a blog post here.
Schnall proposed that Johnson et al. failed to replicate her work due to a statistical artifact called a ceiling effect (2). Such an effect occurs when the survey scale is not adequate to capture the full response range; if the measurement scale were to include a higher ceiling, responses would vary throughout that space, but with a lower ceiling ratings are forced to bunch artificially at the maximum.
Looking at the data for the replication experiments, she notes that the mean (moral) severity ratings are far higher in the neutral conditions than they are in the original study, with a higher percentage of scores occurring at the extremes (Schnall 2). She converts all the effect sizes into percentages, and finds that they are lower in the high-rated neutral conditions; this is assumed to be the result of a ceiling.
She concludes that such an effect prevented the replication study from getting significant findings. One could get “an observed lack of effect…merely from a lack in variance that would normally be associated with a manipulation,” and it follows that “the analyses reported by Johnson et al. are invalid and allow no conclusions about the reproducibility of the original findings” (ibid.).
What do others think?
In their response to Schnall, replicating authors Johnson, Cehung and Donnellan actually run another analysis of the data (4). They attempt to mitigate the influence of any possible ceiling effect by removing the highest ratings. They claim that this will have the effect of reducing bias from any artificial decrease in variance (ibid. 5). After performing a new, item-by-item analysis, they do not find any significant results, and take this to confirm the lack of an effect.
As Simonsohn highlights in a recent blog post (“Ceiling Effects and Replications”), eliminating the extreme data points has the effect of lowering power, since the sample sizes are decreased. The replicators maintain that this is acceptable, however, as their sample sizes still exceed those of the original experiment (Johnson, Cheung and Donnellan 4).
Simonsohn has, of course, performed his own analysis of whether the data is consistent with Schnall’s interpretation. His method examines all of the individual observations to find what proportion accords with each value of the dependent variable (in this case, the moral severity rating). In a case with a real treatment effect obscured by a ceiling effect, one could see just from a graph of the data that it was the case: there would be a between-group difference in proportions for the values up to the ceiling level, at which point they would overlap (“Ceiling Effects and Replications”).
Looking at the two studies, Simonsohn sees the original experiment as finding a real effect, and the replication as finding no effect – but no evidence at all of a ceiling effect (ibid.). He thinks Schnall’s rebuttal analysis was confused by its use of percentages to compute effect size. (His own analysis uses the difference of means to compute effect size.) It’s not surprising that she got lower percentages for scenarios with higher baseline ratings, he explains: in those cases, she had to divide by higher numbers. Because of this, lower percentages would be found regardless of the existence of any genuine ceiling effect (ibid.).
So there doesn’t seem to be a artifactual explanation for the results. Of course, it’s difficult to see why one couldn’t just dismiss the this failure to replicate as a possible result of chance: not to be ignored, necessarily, but also not to be seen as immediate evidence that the disgust/morality connection does not exist. This seems especially tempting in light of the other, successful replications.
(I maintain that the original research was flawed, as seen in my previous post – but that is a different matter.)
1. Johnson, David J, Felix Cheung and M Brent Donnellan. “Hunting for Artifacts: The Perils of Dismissing Inconsistent Replication Results.” Social Psychology (2014): 4-6. 6 July 2014.
2. Schnall, Simone. “Clean Data: Statistical Artifacts Wash Out Replication Efforts.” Social Psychology (2014): 1-6.
3. Simonsohn, Uri. “Ceiling Effects and Replications.” 4 June 2014. <http://datacolada.org/2014/06/04/23-ceiling-effects-and-replications >