(A continuation of these posts.)
In a previous post I wandered into the debate concerning whether Likert scales can be interpreted as interval scales. I was mostly inspired Carifio and Perla’s “Ten Common Misunderstandings”, which is itself a response, to Jamieson’s 2004 article “Likert Scales: How to (Ab)use Them”.
I want to give Likert scales a fighting chance; they’ve enjoyed a lot of success across multiple fields, and I could be misinterpreting the debate entirely. (Honestly, I think this topic goes way beyond my understanding of psychological measurement issues. But that makes it more fun to think about, before I dive into a totally different topic with Gigerenzer’s “Risk Savvy”.)
I’ll start by granting the assumptions that 1) the scales in the individual items reflect a genuine rank ordering of participant attitudes and 2) the overall scales can be interval-type, given the right construction (and summative scoring). In that case, I’m interested in the whether subtle features of scales can have substantive effects on how data should be interpreted.
Consider the number of response items, which could have an effect on two different levels. First, in a return to last post’s topic, there are several reasons to suspect it might affect the homogeneity of intervals between categories. For example, it seems like “neutral” refers to a narrower attitude category than either “slightly agree” or “slightly disagree”; any deviation in attitude level should immediately violate the definition of neutrality, meaning it will cover an extremely small range. The comparative size difference between the “middle” category and the rest of the surrounding intervals might be a function of just how many others there are.
Equality of interval sizes is key for approaching an interval scale; and a Likert scale researcher who seeks to avoid error from inappropriate testing should want to do all they can to approximate an interval scale for their scoring (Stevens 679). She would want to determine, then, whether the psychological distances among response categories are equal; if they are, it should be easier to swallow that equal differences in one’s summative score correspond to equal differences in the intensity of one’s underlying attitude.
Wakita, Ueshima and Noguchi claim to have constructed a way to evaluate the psychological distance between response options (2012). They use a generalized partial credit model to develop a formula that lets you calculate scale values for each category. GPCM is a model developed under Item Response Theory; although Likert scales originally grew out of a different paradigm, it’s an established practice by now to design and analyze them from an IRT perspective.
IRT models rely on three statistical assumptions (DeMars 2010, 33). One is, of course, the requirement that the data you have actually follows whatever model you’re using. Another is that there is only one underlying attribute or dimension θ of interest; in the Likert scale case, we would expect that to be a general attitude towards some target. Finally, one assumes that responses to items are independent from one another after controlling for θ. They are only correlated because of that dimension of interest.
Since Likert scales have, in essence, an ordered multiple-choice format, a polytonous model (such as GPCM) is needed for evaluation (ibid. 22). These models give functions for the probability of responding with a certain response category (or within a certain category range). The parameters of interest are typically calculated with Bayesian methods, generating a likelihood function (61).
In their study, Wakita et al. administer 4-, 5-, and 7-point Likert scale formats to undergraduates to see if the different sizes yield data sets with different qualities (538). They assume that the relevant attitude dimension exists on a continuum that can be carved into open intervals; each response category corresponds to a certain range of that continuum, and is divided from another by the midpoint between intervals (ibid. 536). The actual category parameters are reflected by the points of intersection between adjacent categories, where they are equally probable. From these assumptions, the researchers are able to calculate the IRT scale values (538).
They find that although the 4 and 5-point scale values are distributed as expected, the 7-point values deviate markedly from expectations; and from this they conclude that the psychological distance between response categories has become distorted (544).
Other effects from amount of response categories
Although Wikita et al. didn’t find any notable changes in mean responses attributable to the number of response categories, other studies suggest that the level of fine-grainedness can directly impact descriptive statistics. The reasoning here is that if a scale is too wide or too narrow, participants might be biased to answer in a certain way, reflecting influences like social desirability or tendencies towards extreme responding.
Preston and Colman (1999) investigated this possibility for formats ranging from between 2 to 11 category options. They found not only substantial variation in the reliability of the tests, but an increase in indices as the category numbers increased (11). However, this may not be an issue for the sort of strictly defined tests that researchers like Carifio and Perla are concerned with. This is because Preston and Coleman’s scales actually differ from traditional Likert scales: instead of asking subjects about their attitudes towards certain services, they ask directly about services’ quality (4).
This “not quite Likert” problem occurs throughout the literature on response-number effects. For example, the oft-cited Finn (1972) found such an effect across 3, 5, 7, and 9-point scales – but while his items weren’t really Likert items at all, considering the linear nature of the scale, and the fact that participants were assigning difficulty levels to jobs, not responding to ‘attitude’ questions (257).
Marketing researchers like Dawes (2007) have also pursed the question. In a recent study he finds no significant difference between 5-point and 7-point response formats, but significantly lower means when 10 categories are used (75). However, the “scale” was administered over the telephone, which introduces a massive amount of possible complications.
(If I seem picky, it’s because psychological scales are designed such that they only give meaningful scores if certain assumptions are met; similarly, we can expect one’s data model to countenance errors whenever it is improperly specified. Carifio and Perla, for all their condescension, are right that proper operation will depend on a scale that is well-designed across levels (2007, 109). Frustratingly, the “Likert scale” label is often fixed to only vaguely similar kinds of scales. It is often hard to tell what sorts of differences would preclude inquiries about them from being relevant to number-of-categories issue.)
Even vs. odd
A closely related issue is whether or not the data can be affected by whether the number of response categories is even or odd – in other words, whether a neutral option is present. In support of a neutral option, having such a category could help create even intervals between response choices, which in turn might reduce the riskiness of running parametric tests (Stevens 679). On the other hand, some researchers suspect the presence of a neutral option encourages subjects to cluster their responses around a central option, due to factors like a desire to avoid expressing extreme opinions, or misinterpreting the neutral category as an opt-out or catch-all option. For example, marketing researcher Ron Garland found significant differences among responses to 4 vs 5 point scales, and interpreted them to show that neutral options cause bias from socially desirable responding (1991).
Guy and Norvell (2010), observing that Likert scales often produce results skewed towards the extremes, also found a difference in responses depending on the presence or absence of a neutral category; Wong et al. (2011) did not. Neither did Kulas, Stachowski and Haynes (2008), though they did find evidence that responders used neutral response options to express ‘not applicable’ or ‘I don’t know’. I find this possibility quite interesting – if the neutral response is taken to include more than just an in-between attitude, then it’s hard to see how individual Likert items would be on an ordinal scale.
But would removing the option be worse than keeping it in? My uninformed suspicion is that that forcing participants to choose a side when they have neutral feelings on the subject will yield made-up answers or might even generate a new opinion; this will either give a misleading picture of respondent attitudes, or actively influence them in one direction or another.
Some last thoughts
I might return to this subject in the future. I’d like to talk more about the ways summative Likert can be interpreted; about unidimensionality; about differences between Likert and Thurstone, and whether it matters. But for now, I’ll end by emphasizing that the Likert scale issues seem to occur on several different “levels”.
Even if attitudes have the sort of underlying quantitative structure needed for interval scale measurements, and even if a given Likert scale has excellent reliability across the board, there are still many factors a researcher will have to take into consideration. They’ll want to check that respondents interpret response categories as intended, and ask whether the responses that they do give have been distorted by social desirability. And these are just a few examples of the unique considerations social scientists have to deal with over the course of an inquiry.
Carifio, James and Rocco J. Perla. “Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes.” Journal of Social Sciences 3.3 (2007): 106-116.
DeMars, Christine. Item Response Theory. New York: Oxford University Press, 2010.
Edwards, Allen L. Techniques of Attitude Scale Construction. 2nd. New York: Irvington Publishers Inc., 1983.
Finn, R.H. “Effects of Some Variations in Rating Scale Characteristics on the Means and Reliabilities of Ratings.” Educational and Psychological Measurement 32 (1972): 255-265. Document.
Garland, Ron. “The Mid-Point on a Rating Scale: Is it Desirable?” Marketing Bulletin 2 (1991): 66-70.
Guy, Rebecca F and Melissa Norvell. “The Neutral Point on a Likert Scale.” The Journal of Psychology: Interdisciplinary and Applied 95.2 (1977): 199-204. Web.
Kulas, John T, Alicia A Stachowski and Brad A Haynes. “Middle Response Functioning in Likert-Responses to Personality Items.” Journal of Business Psychology 22 (2008): 252-259. Web.
Likert, Rensis. “A Technique for the Measurement of Attitudes.” Archives of Psychology 22 (1932): 5-55. Document.
Preston, Carolyn C and Andrew M. Colman. “Optimal number of response categories in rating scales: reliability, validity, discriminating power, and respondent preferences.” Acta Psychologica 104 (2000): 1-15. Document.
Stevens, S S. “On the Theory of Scales of Measurement.” Science (New Series) 103.2684 (1946): 677-680. Document.
Wakita, Takafumi, Natsumi Ueshima and Hiroyuki Noguchi. “Psychological Distance Between Categories in the Likert Scale: Comparing Different Numbers of Opinions.” Educational and Psychological Measurement 72.4 (2012): 533-546.
Wong, Chi-Sum, et al. “Differences Between Odd Number and Even Number Response Formats: Evidence From Mainland Chinese Respondents.” Asia Pacific Journal of Management 28 (2011): 379–399. Web.