Views > Business modeling > The ecological correlation fallacy

If you ask a group of data analysts and data visualization experts to choose the most important chart type to display data, most probably “The scatterplot” would be the response you’ll get. And they have a point. Edward Tufte, in the Visual Display of Quantitative Information, crowned the scatterplot—and its variants—as the greatest of all graphical designs. The scatterplot encourages the viewer to assess relationships by showing how one variable affects another.

But at the same time, Edward Tufte warned that “…statistical graphs, just like statistical calculations, are only as good as what goes into them. An ill-specified or preposterous model or a puny data set cannot be rescued by a graphic (or by calculation), no matter how clever or fancy. A silly theory means a silly graphic.”

Source: “The Visual Display of Quantitative Information”, Edward Tufte, based on Edward R. Dewey and Edwin F. Dakin, Cycles: The Science of Prediction (New York, 1947), p. 144.

Here’s another silly graphic showing a spurious correlation between chocolate consumption and Nobel laureates.

THE ECOLOGICAL CORRELATION FALLACY

But when it comes to spotting spurious correlations, there are much more important issues than the trivial meaningless relationships shown in the schemes above—that we all make fun of. Most of us will obviously not pick stocks based on the intensity of solar r adiation or expect to be a Nobel Prize recipient by increasing ones intake of chocolate. There is one dangerous type of spurious correlation, however, that is difficult to spot. This type of correlation may be statistically significant at the aggregate level but ultimately meaningless at the individual level. This is called the problem of “Ecological Fallacy”.

Let’s first define what an “Ecological Fallacy” is. An Ecological Fallacy is a logical fallacy that may occur when an observed relationship between aggregated variables differs from the true association at an individual level. This fallacy was first introduced by the late William S. Robinson in 1950 when he published his “Ecological Correlations and the Behaviors of Individuals.” The paper ( click here) became an all-time classic and it is one of the most influential methodological papers in social sciences.

Here is a simple example. Say we’ve measured two variables—X and Y—related to 40 randomly selected individuals, 10 from each of 4 different states as shown in the table below.

If we draw all the individual measures on the scatterplot below and calculate the linear correlation coefficient we’ll see that we have a relatively strong negative correlation between variable X and variable Y. The correlation between the two variables is -0.786 for the set of 40 individual observations.

This example shows how easy it is to make contradictory inferences—depending on whether we look at individual data or aggregated data. And that’s dangerous.

Let’s take another example—this time a real one—from Alberto Cairo’s book The Functional Art. In Chapter 6 Alberto wants to test the validity of the hypotheses that “ Obesity is, on average, inversely proportional to the average education of the population”. For that, Alberto Cairo pulls—from different data sources—two publically available data sets and draws the dot plot as shown in the graph below. The first data set is from the US Census Bureau and shows the percentage of people by state holding BA degree or higher. The second data set is from the Centers for Disease Control and Prevention (CDC) and shows the percentage of people who are obese by state.

Source: “Chocolate Consumption, Cognitive Function, and Nobel Laureates”, by Franz H. Messerli, M.D., The New England Journal of Medicine, October 10, 2012.

Source: “The Functional Art”, Alberto Cairo.

Source: Centers for Disease Control and Prevention

“Statistics isn't about discovering correlations, it's about eliminating coincidence.”

Nassim Nicholas Taleb

Lebanese-American philosopher

Source: “The Functional Art”, Alberto Cairo.

However, if we look at correlations of smaller aggregations—say states—then the scatterplot will be different, and its associated correlation will be different. If we aggregate the data and represent the averages by state instead of individuals we’ll see that the strength of the association between variable X and variable Y is much stronger and is in the opposite direction. The correlation for the set of four dots shown in the scatterplot below is 0.980.

Alberto Cairo then designs the scatterplot shown below and calculates the linear correlation coefficient r. Based on the result of r = -0.67, Alberto Cairo concludes that there’s a solid negative correlation between obesity and education.

What Alberto Cairo calculated is called the Ecological Correlation—because the unit of analysis is not an individual person but a group of people, the residents of a state. However, it’s all too easy to draw incorrect conclusions from aggregate data. Alberto Cairo fell into the Ecological Fallacy trap. He made the inference that relationships observed for groups necessarily hold for individuals: in other words if states with higher educational attainment tend to have lower obesity rates, then uneducated people must be more likely to be obese. These inferences may be correct, but are only weakly supported by the aggregate data. In reality, as we’ll see next, the correlation computed at the individual level is -0.111. The sign of the correlation is negative—as Alberto Cairo predicted—but not as strong as he suggested. The ecological correlation gives the wrong inference.

The CDC survey data already included the obesity rate by education level and state. Here below is a sample for the state of Alaska.

In 2011 the CDC survey included a total of 470,700 respondents of which 128,972—or 27.4%—were obese. Of that total, 162,648 respondents were college graduates of which 33,505—or 20.6%—were obese. Let’s calculate the individual correlation. The table below is a fourfold table showing for the overall sample the correlation between obesity and educational attainment (College graduate or higher) considered as properties of individuals rather than geographic areas. The Pearsonian (fourfold-point) correlation—the individual correlation—is -0.111, slightly less than one-sixth of the corresponding ecological correlation as calculated by Alberto.

In his paper “Ecological Correlations and the Behaviors of Individuals”, William S. Robinson made an important closing statement that ecological correlations cannot validly be used as substitutes for individual correlations and he added:

“I am aware that this conclusion has serious consequences, and that its effect appears wholly negative because it throws serious doubt upon the validity of a number of important studies made in recent years. The purpose of this paper will have been accomplished, however, if it prevents the future computation of meaningless correlations and stimulates the study of similar problems with the use of meaningful correlations between the properties of individuals.”

As you’ve seen, seven decades after William S. Robinson’s finding, people are still computing meaningless correlations. It’s time to get serious and rigorous about analytical and statistical data analysis. Otherwise, charlatanism and sophistry will be on the rise.

I’m aware of the ecological fallacy and amalgamation paradoxes. I describe them in later books! However, the obesity vs. education graphic isn’t a good example of them, I believe. According to researchers I consulted, the association exists down to the individual level (it’s weaker, but this common when we aggregate or disaggregate data). In general, and with obvious exceptions, the more educated a person is, the less likely he or she is to be obese; the causal links are very complicated, of course. This may end up being wrong, but it’s what I got from experts (1).

Another matter is the wording I used. I agree with you that this needs attention, as the description I wrote of what the charts show is sloppy. If I remember well, the first time you sent me your articles I thanked you for pointing it out.

Throughout the years I’ve become more aware of how important it is to correctly describe what a chart shows, as doing it wrong may bias our perception of them. In this case, a better wording would be “at the state level there’s a positive association between education and obesity —and vice versa; but that doesn’t mean that the association is causal, and it may disappear or even reverse at lower levels of aggregation”. Clunkier, but perhaps closer to the truth.

Why didn’t I refer to confounders, ecological fallacies, amalgamation/Simpson’s paradoxes, causality, etc. in that section? This is essential to understanding why I think your critiques are a bit off, although I still consider them valid and useful: ‘The Functional Art’ isn’t a book about analytics or reasoning. It’s about the visual design of charts and infographics: choosing graphic forms, colors, typography, layout, and so forth.

Can we separate one from the other? You’ll argue that we can’t, and we can have a chat about it at some point. However, and as in the rest of the book, I didn’t make an assertion based on data myself, as I didn’t analyze anything. The education vs. obesity exercise is simply based on taking *somebody else’s* assertion and think about how to visualize it in different ways. In fact, that example appears in a chapter about visual perception and Cleveland’s scale of encodings.

(1) I do this in all books, as I’m painfully aware of my own knowledge gaps, and terrified of mistakes. I’m no statistician and, as you write, quoting Taleb, “statistics are hard”. I couldn’t agree more. In ‘The Truthful Art’, even if I discuss some elementary summary stats, my main recommendation —which I hammer constantly— is to “always consult with experts”.

ALBERTO CAIRO'S REPLY