Complication with common suicide prevention slogan [semi-academic post]

A common saying declares: Suicide is a permanent solution to a temporary problem.”

This criticizes a form of flawed problem-solving assumed to underlie suicidal ideation. Putting aside the accuracy of that assumption in general, the claim is clearly irrelevant to the class of situations where the salient problem is in fact permanent.

An obvious example would be when a chronic and incurable medical condition reduces a person’s quality of life to an intolerable level, or is going to do so in the future. In these cases the pain, discomfort, or loss of autonomy motivating the decision to commit suicide may only be temporary in the vacuous sensethat it ends when the person dies or is rendered unconscious. 

It is actually quite easy to imagine cases where the problem identified by the person is a permanent one. My point is that in these situations, comments like the above do not help and may actually decrease morale (as they demonstrate that the speaker fails to understand the severity of the situation). I highly suggest refraining from this slogan if one does not know what is motivating the person.


“Thinking, Fast and Slow” – Introductory Gripes (2/?)

In my last post I promised myself I’d critique Risk Savvy and Thinking, Fast and Slow – but I ended up being totally “out of it” for a month due to surgery recovery. So the bits and pieces will be coming up, but slowly! TFS came out first, so I’ll review it first. I hereby dedicate this post to the more minor gripes.

1) It might just be the Kindle version, but the endnote style is infuriating. There’s no in-text numbering system or anything like that; instead, one can look up the citation based on the first few words of the passage. Gigerenzer’s book uses a more fact-check friendly end-note system, where you can see the sources at the end of each chapter. I count that as a minor point in his favor.

2) This is more a frustration than something I’d blame Kahneman for – high reliance on research that’s been pretty hotly contested in the last few years. This includes Bargh’s 1996 research on the ideomotor effect (an “instant classic”), which Kahneman uses as an example and illustration of System 1’s associative properties [1]. The elderly-priming study became a source of conflict after Doyel et al. failed to replicate the effect [2]. Similarly, Zhong and Liljenquist’s research on the Macbeth effect came into question after a failure to replicate [3].

He may not have been able to anticipate the controversy, but the potential to do so is why I take issue with his call to uncritically accept these results. “The idea you should focus on… is that disbelief is not an option,” he writes. “The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true” [4]. The body of research on priming does support the existence of these effects, and I agree we should acknowledge that we are subject to them. But the results absolutely can be statistical flukes, and they absolutely can be made up. We know for a fact that psychological researchers sometimes fabricate data. Should we really suspend all disbelief about them?  

(Kahneman isn’t ignorant of the priming replication issues; he’s been a major voice in subsequent discussions about replication procedure and etiquette.)


[1] Kahneman, Daniel. Thinking, Fast and Slow. New York: Farrar, Straus and Giroux (2011): LOC 896-897/9362.

[2] Bargh, John A. Nothing in Their Heads: Debunking the Doyen et al. Claims Regarding the Elderly-Priming Study. 5 March 2012. Blog Post. 8 June 2014.

[3] Earp, Brian D, et al. “Out, Damned Spot: Can the “Macbeth Effect” Be Replicated?” Basic and Applied Social Psychology 36.1 (2014): 91-98.

[4] Kahneman, Thinking, Fast and Slow. LOC 958/9362.


Kahneman and Gigerenzer, Pt 1: “Risk Savvy” and “Thinking, Fast and Slow”

I recently read Gerd Gigerenzer’s  Risk Savvy: How to Make Good Decisions. In it, he argues that intuition is part of human rationality, and that simple rules surpass complex models in uncertain conditions – along with many other points.

The book is an obvious response to Daniel Kahneman, who wrote Thinking, Fast and Slow and co-founded prospect theory. I was going to write up Risk Savvy by itself,  but in the end, decided I couldn’t do it justice without discussing the first book.

In Thinking, Fast and Slow (from hereon out, TFS) Kahneman argues that human decision-making involves a dance between two systems of cognitive processes. One of these, the part we tend to associate ourselves with, involves conscious processes that can be rational and deliberative but take a lot of time and effort. The other  system is unconscious, and is motivated by conserving energy. It reasons according to heuristics, and is comparably quite fast. In the vast majority of cases, we work with using these unconscious processes (System 1); we are reluctant to put in cognitive effort, and so allow the conscious mind (System 2) to rely on what System 1 tells it. Unfortunately, that will often be the a different answer from the kind that we need. Since System 1 is largely associative, we often find ourselves giving an answer to a question that is simply related.

The interaction between these systems helps explain why human behavior does not conform to expected utility theory, which assumes we are rational agents with a mind for optimization. While rational agent models can successfully account for a great deal of decision-making behavior, including loss aversion, it cannot explain other tendencies, such as risk-seeking in dire situations. Decision-making can be better accounted for by Kahneman and Tversky’s already established prospect theory (1979); we are disproportionally sensitive to losses, for example, and seem to evaluate decisions based on an adaptation level that takes past situations into account (TFS 281, 284). However, we also have unrelated motivations that go beyond the limits of prospect theory. There is evidence, for example, that we are motivated by fear of regret (288) and desire for a coherent narrative (199) to make decisions that seem dramatically counter-intuitive.

The non-rational nature of human decision making isn’t helped at all by our poor ability to reason about probability, a notion supported by a wealth of empirical evidence. Kahneman gives examples illustrating our inability to calculate probabilities properly, our insensitivity to minor changes in them, our tacit assumptions that they do not apply to us, and our struggles not to act in accordance with intuition even when we have been corrected with accurate information.

Our inability to understand probabilities and our reliance on “System 1” processes set us up to commit major errors. They also set us up to be taken advantage of. We need some level of protection from our poor understanding of probability, as well as from those who would use our heuristics against us; Kahneman, therefore, endorses a form of paternalistic libertarianism, in which policies are designed to subtly influence people’s behavior.

Risk Savvy: How to Make Good Decisions

Gigerenzer has a different take on human decision-making and risk analysis, that is largely marked by a different attitude towards the role of intuition. One sees this divergence early in Risk Savvy in his treatment of a shared analogy. Kahneman, in TFS, had appealed to visual illusions to explain cognitive illusions and biases. He highlights our inability to prevent unconscious processes from creating a visual illusion – knowing that the image is wrong doesn’t stop you from seeing it anyway (27). The only way to deal with this is to realize you can’t trust the perception – and by analogy, that you can’t always trust an intuition or assumption. Overcoming these illusions requires cognitive effort.

Gigerenzer too draws on visual illusions, but the point that he makes is radically different. In the case of optical illusions, he agrees that it’s clear your visual system has provided you with a distorted version of the stimulus. But it is a useful kind of distortion, he argues (RS 46). The brain, faced with limited information, draws on context to infer the content of the image, then finds a way to convey that in a way that is more of an information-rich representation than a perfectly faithful reproduction. We can only process so much information at a time, and the errors of both cognitive illusions and visual illusions demonstrate the ways that the brain extrapolates to make informed assumptions (ibid. 47).

These processes are efficient, but are they truly dependable? Kahneman sees reliance on “System 1” as lazy and prone to error; he goes so far as to claim that intuition is simply a process of recognition (TFS 237). Gigerenzer, on the other hand, views intuition as a form of unconscious intelligence essential to navigating the world. Rationality isn’t just a feature of the slow and deliberative processes: “Calculated intelligence may do the job for known risks,” he writes, “but in the face of uncertainty, intuition is indispensable” (RS 30).

Which marks another important difference. From the standpoint of Gigerenzer’s argument, Kahneman and his colleagues fail to adequately distinguish considerations of risk from considerations of uncertainty. Risk refers to cases where probabilities are known; uncertainty refers to cases in which they are not. The distinction is important because while complex probability models  are very good at explaining events retroactively and making predictions in cases where probabilities are already known, they are terrible at prediction in uncertain circumstances.

Conflating risk and uncertainty can lead to irrational behavior – for example, in cases of perceiving certainty where it does not exist and the misinterpretation of unknown probabilities for known ones (32). People place trust in forecasts, assuming that they have been given an accurate grasp of all relevant probabilities; but such decision-making tends to occur in the world of uncertainty, not in the world of risk.

Fortunately, with regard to uncertain situations, rules of thumb and cognitive heuristics excel. Gigerenzer draws on the bias-variance dilemma as justification: complex methods, involving multiple factors, will tend to be subject to higher variance. A simple method will often serve you better, since it has to deal with less (or no) variance (97).

In addition, he disputes Kahneman’s conclusion that humans are ill-equipped to reason about probability, and naturally limited in such a way that they will inevitably make poor decisions. He thinks it is inaccurate to place so much blame on natural limitations, or cognitive laziness. Instead, he argues, much of the problem lies with the ambiguous and confusing ways in which probabilities and statistical information are communicated. Research indicates that while both laypeople and experts struggle at understanding probability statements and reasoning based on those statements, they perform far better when the probabilities are expressed in the form of natural frequencies (169). In light of this, he suggests using tools like fact and icon boxes to communicate risk accurately (192). Icon boxes use graphics to summarize the results of well-designed studies in the form of natural frequencies; fact boxes do the same thing without pictures (203). These tools can be used to compare patient outcomes more clearly than mortality or survival rates, as they put the statistics into an easy-to-read format with a drastically higher rate of success for patient and doctor understanding (206).

If simply using a different presentation format can assist with probabilistic reasoning, then our ability to learn is probably more malleable than Kahneman implies. Because of this, Gigerenzer promotes the establishment of curriculums to teach risk literacy at a young age. He appeals to studies showing that even elementary school-age children have the capacity to reason about probability when it is presented in the form of natural frequencies (246).

He then goes on to change challenge those who argue, based on our poor grasp of probability, that we should establish paternalistic policies. Instead of influencing peoples’ behavior in minor ways, Gigerenzer supports educating people so they can make informed decisions. He also argues in favor of creating an environment where it is safe for people to admit errors, to increase safety and awareness and reduce the motivation for problematic behaviors like the ones practiced in defensive medicine.

“We don’t need more paternalism,” he snipes, “we need to teach people tools for thinking” (169). This is targeted at Kahneman, who supports a position of paternalistic libertarianism in the conclusion of Thinking, Fast and Slow.

I find myself agreeing with Gigerenzer about this; in fact, I find his position compelling in general. I get the eerie suspicion that my evaluation of these books would be different if I had read them in a different order. Although I read Thinking, Fast and Slow when it first came out, that was years ago, and since then it feels like I’ve read mountains of books on human failures to be rational. After so much exposure, Gigerenzer is refreshingly optimistic.

So when I began rereading Kahneman, I suddenly was bothered by what I saw as hints of pessimism and paternalism. My comparably harsh appraisal of his position might be influenced by having been primed to associate it with those qualities; I might feel more motivated to scrutinize his points, to nit-pick about the qualities of the studies he picks and fret over seemingly small inconsistencies.

But worrying about this might just be the consequence of reading two excellent books dealing with cognitive bias in the same week. At the very least, I think I am warranted in asking for consistency and decent supporting evidence – currencies that both authors fail to give at various points in their books. In the next post, I’ll take a closer look at the arguments in Risk Savvy and Thinking, Fast and Slow. I’ll poke around to see where the authors have failed to make their points, and also make the argument that Kahneman and Gigerenzer occupy two highly compatible positions.


Gigerenzer, Gerd. Risk Savvy: How to Make Good Decisions. New York: Viking, 2014. Kindle format.

Kahneman, Daniel and Amos Tversky. “Prospect Theory: An Analysis of Decision under Risk.” Econometrica 47.2 (1979): 263-292.

Kahneman, Daniel. Thinking, Fast and Slow. New York: Farrar, Straus and Giroux, 2011. Kindle format.

Upcoming: Gigerenzer and Kahneman

In the course of the last week, I read Gigerenzer’s “Risk Savvy”, and I loved it. That’s not to say that I agree with all of his argument, but I feel like it’s a refreshingly different take on the issue. This prompted me to reread “Thinking, Fast and Slow,” and I’ve decided it really isn’t fair to review one without the other.

So it’s in the pipeline. And I intend to actually engage with people on this one, so – here we go.

Measurement issues and Likert scales: Part 2/2

(A continuation of these posts.)

In a previous post I wandered into the debate concerning whether Likert scales can be interpreted as interval scales. I was mostly inspired Carifio and Perla’s “Ten Common Misunderstandings”, which is itself a response, to Jamieson’s 2004 article “Likert Scales: How to (Ab)use Them”.

I want to give Likert scales a fighting chance; they’ve enjoyed a lot of success across multiple fields, and I could be misinterpreting the debate entirely. (Honestly, I think this topic goes way beyond my understanding of psychological measurement issues. But that makes it more fun to think about, before I dive into a totally different topic with Gigerenzer’s “Risk Savvy”.)

I’ll start by granting the assumptions that 1) the scales in the individual items reflect a genuine rank ordering of participant attitudes and 2) the overall scales can be interval-type, given the right construction (and summative scoring). In that case, I’m interested in the whether subtle features of scales can have substantive effects on how data should be interpreted.

Consider the number of response items, which could have an effect on two different levels. First, in a return to last post’s topic, there are several reasons to suspect it might affect the homogeneity of intervals between categories. For example, it seems like “neutral” refers to a narrower attitude category than either “slightly agree” or “slightly disagree”; any deviation in attitude level should immediately violate the definition of neutrality, meaning it will cover an extremely small range. The comparative size difference between the “middle” category and the rest of the surrounding intervals might be a function of just how many others there are.

Equality of interval sizes is key for approaching an interval scale; and a Likert scale researcher who seeks to avoid error from inappropriate testing should want to do all they can to approximate an interval scale for their scoring (Stevens 679). She would want to determine, then, whether the psychological distances among response categories are equal; if they are, it should be easier to swallow that equal differences in one’s summative score correspond to equal differences in the intensity of one’s underlying attitude.

Wakita, Ueshima and Noguchi claim to have constructed a way to evaluate the psychological distance between response options (2012). They use a generalized partial credit model to develop a formula that lets you calculate scale values for each category. GPCM is a model developed under Item Response Theory; although Likert scales originally grew out of a different paradigm, it’s an established practice by now to design and analyze them from an IRT perspective.

IRT models rely on three statistical assumptions (DeMars 2010, 33). One is, of course, the requirement that the data you have actually follows whatever model you’re using. Another is that there is only one underlying attribute or dimension θ of interest; in the Likert scale case, we would expect that to be a general attitude towards some target. Finally, one assumes that responses to items are independent from one another after controlling for θ. They are only correlated because of that dimension of interest.

Since Likert scales have, in essence, an ordered multiple-choice format, a polytonous model (such as GPCM) is needed for evaluation (ibid. 22). These models give functions for the probability of responding with a certain response category (or within a certain category range). The parameters of interest are typically calculated with Bayesian methods, generating a likelihood function (61).

In their study, Wakita et al. administer 4-, 5-, and 7-point Likert scale formats to undergraduates to see if the different sizes yield data sets with different qualities (538). They assume that the relevant attitude dimension exists on a continuum that can be carved into open intervals; each response category corresponds to a certain range of that continuum, and is divided from another by the midpoint between intervals (ibid. 536). The actual category parameters are reflected by the points of intersection between adjacent categories, where they are equally probable. From these assumptions, the researchers are able to calculate the IRT scale values (538).

They find that although the 4 and 5-point scale values are distributed as expected, the 7-point values deviate markedly from expectations; and from this they conclude that the psychological distance between response categories has become distorted (544).

Other effects from amount of response categories

Although Wikita et al. didn’t find any notable changes in mean responses attributable to the number of response categories, other studies suggest that the level of fine-grainedness can directly impact descriptive statistics. The reasoning here is that if a scale is too wide or too narrow, participants might be biased to answer in a certain way, reflecting influences like social desirability or tendencies towards extreme responding.

Preston and Colman (1999) investigated this possibility for formats ranging from between 2 to 11 category options. They found not only substantial variation in the reliability of the tests, but an increase in indices as the category numbers increased (11). However, this may not be an issue for the sort of strictly defined tests that researchers like Carifio and Perla are concerned with. This is because Preston and Coleman’s scales actually differ from traditional Likert scales: instead of asking subjects about their attitudes towards certain services, they ask directly about services’ quality (4).

This “not quite Likert” problem occurs throughout the literature on response-number effects. For example, the oft-cited Finn (1972) found such an effect across 3, 5, 7, and 9-point scales – but while his items weren’t really Likert items at all, considering the linear nature of the scale, and the fact that participants were assigning difficulty levels to jobs, not responding to ‘attitude’ questions (257).

Marketing researchers like Dawes (2007) have also pursed the question. In a recent study he finds no significant difference between 5-point and 7-point response formats, but significantly lower means when 10 categories are used (75). However, the “scale” was administered over the telephone, which introduces a massive amount of possible complications.

(If I seem picky, it’s because psychological scales are designed such that they only give meaningful scores if certain assumptions are met; similarly, we can expect one’s data model to countenance errors whenever it is improperly specified. Carifio and Perla, for all their condescension, are right that proper operation will depend on a scale that is well-designed across levels (2007, 109). Frustratingly, the “Likert scale” label is often fixed to only vaguely similar kinds of scales. It is often hard to tell what sorts of differences would preclude inquiries about them from being relevant to number-of-categories issue.)


Even vs. odd

A closely related issue is whether or not the data can be affected by whether the number of response categories is even or odd – in other words, whether a neutral option is present. In support of a neutral option, having such a category could help create even intervals between response choices, which in turn might reduce the riskiness of running parametric tests (Stevens 679). On the other hand, some researchers suspect the presence of a neutral option encourages subjects to cluster their responses around a central option, due to factors like a desire to avoid expressing extreme opinions, or misinterpreting the neutral category as an opt-out or catch-all option. For example, marketing researcher Ron Garland found significant differences among responses to 4 vs 5 point scales, and interpreted them to show that neutral options cause bias from socially desirable responding (1991).

Guy and Norvell (2010), observing that Likert scales often produce results skewed towards the extremes, also found a difference in responses depending on the presence or absence of a neutral category; Wong et al. (2011) did not. Neither did Kulas, Stachowski and Haynes (2008), though they did find evidence that responders used neutral response options to express ‘not applicable’ or ‘I don’t know’. I find this possibility quite interesting – if the neutral response is taken to include more than just an in-between attitude, then it’s hard to see how individual Likert items would be on an ordinal scale.

But would removing the option be worse than keeping it in? My uninformed suspicion is that that forcing participants to choose a side when they have neutral feelings on the subject will yield made-up answers or might even generate a new opinion; this will either give a misleading picture of respondent attitudes, or actively influence them in one direction or another.

Some last thoughts

I might return to this subject in the future. I’d like to talk more about the ways summative Likert can be interpreted; about unidimensionality; about differences between Likert and Thurstone, and whether it matters.  But for now, I’ll end by emphasizing that the Likert scale issues seem to occur on several different “levels”.

Even if attitudes have the sort of underlying quantitative structure needed for interval scale measurements, and even if a given Likert scale has excellent reliability across the board, there are still many factors a researcher will have to take into consideration. They’ll want to check that respondents interpret response categories as intended, and ask whether the responses that they do give have been distorted by social desirability. And these are just a few examples of the unique considerations social scientists have to deal with over the course of an inquiry.



Carifio, James and Rocco J. Perla. “Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes.” Journal of Social Sciences 3.3 (2007): 106-116.

DeMars, Christine. Item Response Theory. New York: Oxford University Press, 2010.

Edwards, Allen L. Techniques of Attitude Scale Construction. 2nd. New York: Irvington Publishers Inc., 1983.

Finn, R.H. “Effects of Some Variations in Rating Scale Characteristics on the Means and Reliabilities of Ratings.” Educational and Psychological Measurement 32 (1972): 255-265. Document.

Garland, Ron. “The Mid-Point on a Rating Scale: Is it Desirable?” Marketing Bulletin 2 (1991): 66-70.

Guy, Rebecca F and Melissa Norvell. “The Neutral Point on a Likert Scale.” The Journal of Psychology: Interdisciplinary and Applied 95.2 (1977): 199-204. Web.

Kulas, John T, Alicia A Stachowski and Brad A Haynes. “Middle Response Functioning in Likert-Responses to Personality Items.” Journal of Business Psychology 22 (2008): 252-259. Web.

Likert, Rensis. “A Technique for the Measurement of Attitudes.” Archives of Psychology 22 (1932): 5-55. Document.

Preston, Carolyn C and Andrew M. Colman. “Optimal number of response categories in rating scales: reliability, validity, discriminating power, and respondent preferences.” Acta Psychologica 104 (2000): 1-15. Document.

Stevens, S S. “On the Theory of Scales of Measurement.” Science (New Series) 103.2684 (1946): 677-680. Document.

Wakita, Takafumi, Natsumi Ueshima and Hiroyuki Noguchi. “Psychological Distance Between Categories in the Likert Scale: Comparing Different Numbers of Opinions.” Educational and Psychological Measurement 72.4 (2012): 533-546.

Wong, Chi-Sum, et al. “Differences Between Odd Number and Even Number Response Formats: Evidence From Mainland Chinese Respondents.” Asia Pacific Journal of Management 28 (2011): 379–399. Web.



ASIDE: some musings on ordinal rankings and individual Likert items

[In reference to this post.]

In “Measurement Issues and Likert Scales: Part 1” I suggested that the attitude scales given for individual Likert items may not actually meet the criteria for ordinal scales. Specifically, the ordering may not represent any natural rank; there are reasons to suspect that agreement and disagreement represent separate conceptual axes, or that there are actually two underlying constructs determining subject responses.

To clarify what I mean, imagine that instead of asking a subject to rate statements on a scale from strongly disagree to strongly agree, I was asking her to rate glamour shots on a scale from very ugly to very sexy. There seems to be a natural ranking there, or at the least it may end up being of practical use to someone in marketing research to force subjects to pick between the two; but it’s entirely possible that participants will encounter photos of individuals that they find both sexy and ugly. That won’t be captured in the data, and I’d suggest it would make it inappropriate to run certain statistical transformations on whatever information you get.

We do, colloquially, often state that we both agree and disagree with a statement, for various reasons. In a Likert test, one might be instructed to choose the neutral option in such a situation, or go with whichever feeling seems stronger in a forced-choice situation. And, once again, that might have practical uses. But if it is the case, I also think it would mean the response format is not truly ordinal.

In the test’s defense, a researcher might argue that the “unnaturalness” of such a scale should not matter to its use, if the overall scale is mathematically unidimensional. As DeMars (2010) observes, “for test responses to be multidimensional, different items have to tap into different combinations of the constructs, and examinees have to vary on both constructs” (39). It’s hard to see how the actual multidimensionality would matter otherwise; it would just be a sort of compound construct.

However, the methods used to test for this property vary in effectiveness, and the information they yield may be of the entirely wrong variety for our interests. We might find that a Likert scale seems to measure some practically unidimensional “attitude” without the item response categories representing an actual ranking, since the notion that disagreement and agreement are on the same scale has already been built into the possible item responses, and these techniques appear to be unable to check the validity of such assumptions.



DeMars, Christine. Item Response Theory. New York: Oxford University Press, 2010.

Measurement Issues and Likert Scales: Part 1 of 2

One of my major interests in philosophy of science has to do with how understanding the separation between various levels of inquiry can help pinpoint sources of error and other stumbling blocks in scientific research. I’m particularly fond of Mayo’s modified version of the Suppean hierarchy, which divides an inquiry into several levels of models, including separate levels for data analysis versus data generation and experimental design. Each level deals with different kinds of questions and different sources of potential error.

I’m curious about how these errors might affect each other, and whether they can have a compounding affect across different levels of inquiry. With this in mind, I’ve been looking at measurement issues in psychology, and the long-standing debate over the proper use of Likert scales.

Likert scales have always been used as a gauge of subject attitudes, beginning with attitudes towards African-Americans in the 1930’s. The individual items each present a continuum of possible attitudes towards a target statement, which must involve a value judgement, not a matter of fact (Likert 12). The statement is accompanied by a horizontal line or box, with extreme attitudes (e.g., strongly disagree, strongly agree) placed at either end and intermediate attitudes placed at regular intervals in-between (17). Each possible response has a numerical value, and items typically come in a 5 or 7-point format. They are traditionally odd-numbered so that a ‘neutral’ option is available. However, even-numbered ones do exist.

Out of these properties, two major issues have developed for researchers. The first is whether different versions of the response format affect participant responses. Are the results affected by how fine-grained the scale is? Is there an appreciable difference in responses for even vs. odd-numbered formats?

The second issue concerns what kind of analyses can we run given the “scale level” of the data, while still being confident that the results are meaningful. This is the focus of our current post.

Ordinal, interval, or neither?

Individual Likert items seem to conform to an ordinal measurement scale. They model an ordering relation (in this case among levels of agreement), but the degree to which the level increases or decreases doesn’t seem to occur in equal units. Because of this quality, some researchers have argued that using parametric statistical analyses is inappropriate with Likert scale data. Parametric tests require interval or ratio scale data to yield accurate conclusions, and using them to describe and probe ordinal data will give useless results  (Kuzon Jr, Urbanchek and McCabe 1996). If it is truly the case that these scales are ordinal, it marks a problem, because Likert scale data is overwhelmingly analyzed as if it is interval in nature (Jamieson 2004).

But some researchers claim the tests truly are appropriate, because the scales really do qualify as interval measures. They claim that critics ignore that Likert scales are not interested in attitudes towards a single statement, but instead generate a value based on the sum of item responses. Uebersax refers to this kind of measurement scale as a summated rating scale (Likert Scales: Dispelling the Confusion). The recommendation to treat Likert scales as ordinal, proponents contend, indicates a conflation of scales with individual items (Carifio and Perla 110). The argument can be reconstructed roughly as follows:

  1. Likert scale items use an ordinal scale.
  2. It is not appropriate to use parametric tests to evaluate ordinal data.
  3. However, Likert scales sum the item scores to get an overall score.
  4. The overall score has the emergent property of being on an interval scale.
  5. Empirical testing confirms that Likert scales are interval scales.
  6. It is appropriate to use parametric tests with interval scale data.
  7. [conclusion] Therefore, it is appropriate to use parametric tests on Likert scale data.

To explain: let’s say you have a five-item Likert scale, using a 7-point response format. If you look any singular item alone you’d be working with ordinal data. But if you sum the values given by each test, you generate a score that falls on a range between five and 35.

So the question arises: when such a scale is well-constructed, is it successfully producing interval data? And even if it’s not strictly interval data, can we analyze it as such for the sake of empirical inquiry? To answer these questions, we need to look at what the different measurement scale types are, and why they are important.

Definitions and motivations

Stevens’ original specification of different scales of measurement (categorical/nominal, ordinal, interval, and ratio) was motivated by the desire to model different kinds of relationships among data as well as the different sorts of transformations that could be applied while maintaining the fidelity of those relationships (1946, 677).

Interval scales, for example, are marked by the determination of there being both ordering and equal intervals/differences between different levels.  The example Stevens offers is that of the Centigrade and Fahrenheit scales; essentially, these are scales with arbitrary zero points whose shape is not affected by adding constants (679). This is to be contrasted with ordinal scales, which only measure rank order, as in greater- or less-than. “In the strictest propriety,” he writes, “the ordinary statistics involving means and standard deviations ought not to be used with these scales, for these statistics imply a knowledge of something other than the relative rank-order of data” (ibid.). However, he notes that doing so can still generate useful results; it just happens that any unequal intervals will be conducive to error.

Stevens didn’t rely on the properties of any underlying structures for sustaining the use of interval or ratio scales. This led to some conceptual problems; as Joel Michell summarizes regarding one example, “the fact that experimental participants respond to instructions to judge equal ratios does not mean that there are ratios there to be judged” (103). The levels-of-measurement theory was further developed by Suppes and Zinnes, who fleshed out an account in which interval and ratio scale data imply an underlying quantitative structure with specific properties (ibid. 100). This helps better explain why only certain transformations accurately preserve data relationships, and has the practical advantage of letting us know when assumptions of interval and ratio data will get us in trouble. I’ll call their version the Stevens-Suppes-Zinnes theory, or SSZ.

From SSZ, it follows that the capacity to generate a quantitative score on some linear scale is not sufficient to qualify  Likert scales as being on the interval level. Scores are only “measurements” in their abstract relations to hypothetical constructs (here, attitudes). For the measurements to be interval-scale, then, the attitudes themselves would need to be structured such that differences in their levels are consistent and quantitative (ibid. 101).

It doesn’t seem as if we have been given justification to claim attitudes have this structure. In fact, there is reason to suspect attitudes like agreement and disagreement (or approval and disapproval) lack intuitive ordering. The “higher” end of the range might signify general happiness or complacency with a concept, while the other end might represent hatred or rejection of it; disagreement might not be “less” or “anti” agreement, but instead mark its own distinct attitude. From an SSZ perspective, there would then not even be reason to call the data ordinal; there is certainly not yet justification for calling it interval-scale. To quote Michell, “the kind of scale obtained depends entirely upon the kind of empirical structure modeled” (ibid. 102).

Does a Likert scale qualify?

Carifio and Perla insist that it is appropriate to use interval-data type analyses, and that “armchair” claims to the contrary misunderstand the logic of the scales and ignore empirical findings showing that the data is interval-type (109). They appeal to studies showing that data collected using a Likert response format correlates very strongly with data collected using a linear response format. Their discussion suggests they are working with a fundamentally different conception of what interval data is: they describe the linear format in a Carifio study as having generated data that is “empirically linear and interval in character (as both properties may be empirically tested for any scale or data set) at the subscale and full scale level” (ibid.). They seem to mean that they were able to show that the Likert scale data followed a normal distribution, and so was therefore interval-type data.

This is apparently a common belief in psychology: that if you have a normal distribution, you are using an interval scale (Thomas 198). However, this does not actually follow. Hoben Thomas, who uses the traditional criteria for the different scale types, has a demonstration of this point in “IQ, Interval Scales and Normal Distribution.” He provides counterexamples to the conditional and its contrapositive, and does so, quite relevantly to our discussion, by showing how ordinal data can produce a normal distribution (201).

In that case, Carifio and Perla do not have a valid argument for classifying Likert scales as interval – because by “empirically interval”, they simply mean that the scores can be standardized and generate a normal distribution. They fret that critics do not understand important properties of these scales, “otherwise they might clearly understand how ordinal item response formats can and usually do produce scales that are empirically interval level scales” (110). But from my limited knowledge, it seems that the misunderstanding is on their end – they don’t understand the critics’ complaint. They assume that it comes from looking at the continuum of individual scale items instead of the “underlying continuum of the collection of items that is the “scale” of the variable being measured” (ibid.). Instead, there is a concern that even looking at the scale over all, the criteria for an interval scale have not been met.

It seems we don’t have a guarantee that the intervals are evenly sized. It also may be relevant that individuals probably vary regarding their potential for extreme attitudes. Imagine you have ten 7-point items generating a scale with a possible range of 10-70. Two participants scoring at the maximum 70 might reflect two very different levels of agreement, if they have different thresholds for what qualifies as ‘strong agreement’. And if you decide participants’ subjective scales should all be judged the same way for practical reasons, it still would be a stretch to say that the points of the scale represent equidistant gradations of one’s general attitude level.

Once again, we don’t know if attitudes have that sort of underlying structure in the first place. The need to adequately model, and preserve, certain kinds of data relations motivates differentiating between scale types; improperly labeling Likert scales as interval scales means endorsing inappropriate tests. Such practices may enjoy a fair amount of success, but will ultimately yield erroneous results “to the extent that the successive intervals on the scale are unequal in size” (Stevens 679).

In the next post, I’ll discuss how variations in the item response format can affect Likert (and Likert-type) measures, and whether the presence of a neutral option causes significant changes to the data.



Carifio, James and Rocco J. Perla. “Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes.” Journal of Social Sciences 3.3 (2007): 106-116.

Jamieson, S. “Likert Scales: How to (Ab)use Them.” Medical Education 38 (2004): 1212-1218. Document.

Kuzon Jr, William M, Melanie G Urbanchek and Steven McCabe. “The Seven Deadly Sins of Statistical Analysis.” Annals of Plastic Surgery 37 (1996): 265-272.

Likert, Rensis. “A Technique for the Measurement of Attitudes.” Archives of Psychology 22 (1932): 5-55.

Michell, Joel. “Stevens’s Theory of Scales of Measurement and Its Place in Modern Psychology.” Australian Journal of Psychology 54.2 (2002): 99-104.

Stevens, S S. “On the Theory of Scales of Measurement.” Science (New Series) 103.2684 (1946): 677-680.

Thomas, Hoben. “IQ, interval Scales, and Normal Distributions.” Psychological Bulletin 91.1 (1982): 198-202.

Uebersax, John S. Likert Scales: Dispelling the Confusion. 31 August 2006. Web. 20 July 2014. <;.


Very brief thoughts about the DSM-5 and the “over-medicalization” issue in psychology

A friend linked me to an interview on Lateline this morning with Dr. Allen Frances. Frances criticizes the changes that were made in the development of the DSM-V. In particular, he thinks it will be erroneously used to classify normal behaviors as mental disorders.

Some thoughts on this (warning, I am way out of my league here):

I’m not bothered by medicalization. I don’t see anything intrinsically wrong with it: in other parts of medicine, it seems perfectly possible to have meaningful discussions about physical states that are common, or even inevitable parts of development, but that also happen to legitimately be medical conditions. The fact that something is par for the course doesn’t mean that it isn’t unpleasant, or that we don’t want to change the course of it or prevent it entirely.

What is it about psychology that makes it differ in this respect? For example, I know that grief is a near-universal experience, a natural response to a loss, but isn’t it still a deviation from usual functioning? Similarly, I don’t understand what is wrong with researching and documenting mild or transient disorders. Any complex system will have bugs from time to time, and learning about them can be really useful.

Sometimes those bugs can be really common. For an example from dermatology: we talk about keratosis pilaris as a medical condition, even though it occurs in 40% of the population. But we don’t think of it as representing an ideal condition, and we certainly have looked into ways to control it.

I think the issue is really the attitude towards psychological conditions; we’re worried about them getting inappropriate, drastic treatments. I don’t think over-medicalization is to blame for that, though.

Bem’s infamous psi study: some introductory notes (FTF Part 1/?)

Okay, so there have already been a billion analyses of “Feeling the Future.” And okay, yes, it came out three years ago. I’m a little late, and I promise I won’t retread too much – I just want to cover some basic points about the study. I’m actually most interested in Bem’s critics, some of whom are very naughty.  Although a good deal of the responses seem spot-on, others commit more grievous sins than the original research ever did. I think this is indicative of some pervasive, conceptual-level misunderstandings about statistics in psychology.

Bem (2011) proposes a new research program centered on well-controlled, easily-to-replicate experimental investigations of psi. He does acknowledge the skepticism surrounding psychic phenomena: the rates of paranormal belief among psychological researchers (and other scientists) are quite low, presumably because there is no known mechanism by which psi could plausibly occur. However, he thinks that mixed results of past research give reason for us to keep investigating, seeing how “the discovery and scientific exploration of most phenomena have preceded explanatory theories, often by decades or even centuries” (408).

The specific variety of psi Bem wants to investigate is a form of precognition called retroactive influence. His studies are largely inspired by studies on `presentiment’, where researchers record patients’ physiological responses to stimuli before they are actually presented with those stimuli. Researchers evaluate whether the subjects show activity that indicates pre-emptive emotional responses to the stimuli (408).

The idea of reversing standard procedure like this is used as a launching point for the design of the nine experiments he describes. The flip side of this choice is important: Bem specifically chooses experimental models familiar to academic psychologists, for ease of demonstration and replication. He selects phenomena like priming and recall because there is a wealth of experimental knowledge about them.

I’d like to note that Bem seems to try to be transparent in his work. It is hard to accuse him of any sort of intentional misrepresentation; he seems to go out of his way to discuss additional variables and research questions not mentioned in the main descriptions, and ways in which the results could have been affected. He understands that his work will be harshly scrutinized, and that his methods must be readily accessible and easily understood. In explaining his choices, he specifically appeals to `skeptical’ reasoning:

“If one holds low Bayesian a priori probabilities about the existence of psi – as most academic psychologists do – it might actually be more logical from a Bayesian perspective to believe that some unknown flaw or artifact is hiding in the weeds of a complex experimental procedure or an unfamiliar statistical analysis than to believe that genuine psi has been demonstrated. As a consequence, simplicity and familiarity become essential tools of persuasion.” (Bem 420)

The degree to which his choices actually assist him on that mission is up for debate. But before presenting the standard criticisms of Bem’s research, I have some miscellaneous but important notes about Bem’s methodology.

Bem describes nine experiments, which address four core topics. These are: approach and avoidance of positive and negative stimuli; priming; habituation and boredom; and facilitated word recall. I list the experiments in order of their appearance in the original paper, and would like to note a possible source of confusion for critics. As explained in a footnote, Experiment 5, concerning habituation and boredom, was actually a pilot study. It was the first of the nine experiments to be designed, executed, and analyzed, and it helped shape choices regarding many factors – including what variables to investigate – for all of the following experiments (415).

The next note regards data analysis. For the sake of simplicity, Bem primarily uses a one-sample t-test across sessions; to account for the possibility that the data isn’t normally distributed, he supplements this with nonparametric binomial tests (420). A number of transformations are imposed on the data. He makes an effort to justify them, and show how his results would be affected by the different transformations, but it is still not always clear why he would have made certain choices that are likely to have increased the strength of found effects, along with the likelihood of false positives.

I’ll get to the criticisms in FTF Part 2.



Bem, Daryl J. “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect.” Journal of Personality and Social Psychology 100.3 (2011): 407-425.