twitter-mono
facebook-mono
linkedin-mono
youtube-mono

Views > Edward Tufte's fundamental principles of analytical design

If you ask a group of data analysts and data visualization experts to choose the most important chart type to display data, most probably “The scatterplot” would be the response you’ll get. And they have a point. Edward Tufte, in the Visual Display of Quantitative Information, crowned the scatterplot—and its variants—as the greatest of all graphical designs. The scatterplot encourages the viewer to assess relationships by showing how one variable affects another.

But at the same time, Edward Tufte warned that …statistical graphs, just like statistical calculations, are only as good as what goes into them. An ill-specified or preposterous model or a puny data set cannot be rescued by a graphic (or by calculation), no matter how clever or fancy. A silly theory means a silly graphic.

Source: “The Visual Display of Quantitative Information”, Edward Tufte, based on Edward R. Dewey and Edwin F. Dakin, Cycles: The Science of Prediction (New York, 1947), p. 144.

Here’s another silly graphic showing a spurious correlation between chocolate consumption and Nobel laureates.

EDWARD TUFTE'S FUNDAMENTAL PRINCIPLES OF ANALYTICAL DESIGN

EDWARD TUFTE'S FUNDAMENTAL PRINCIPLES  OF ANALYTICAL DESIGN

But when it comes to spotting spurious correlations, there are much more important issues than the trivial meaningless relationships shown in the schemes above—that we all make fun of. Most of us will obviously not pick stocks based on the intensity of solar r adiation or expect to be a Nobel Prize recipient by increasing ones intake of chocolate. There is one dangerous type of spurious correlation, however, that is difficult to spot. This type of correlation may be statistically significant at the aggregate level but ultimately meaningless at the individual level. This is called the problem of “Ecological Fallacy”.

Let’s first define what an “Ecological Fallacy” is. An Ecological Fallacy is a logical fallacy that may occur when an observed relationship between aggregated variables differs from the true association at an individual level. This fallacy was first introduced by the late William S. Robinson in 1950 when he published his “Ecological Correlations and the Behaviors of Individuals.” The paper ( click here) became an all-time classic and it is one of the most influential methodological papers in social sciences.

Here is a simple example. Say we’ve measured two variables—X and Y—related to 40 randomly selected individuals, 10 from each of 4 different states as shown in the table below.

If we draw all the individual measures on the scatterplot below and calculate the linear correlation coefficient we’ll see that we have a relatively strong negative correlation between variable X and variable Y. The correlation between the two variables is -0.786 for the set of 40 individual observations.

This example shows how easy it is to make contradictory inferences—depending on whether we look at individual data or aggregated data. And that’s dangerous.

Let’s take another example—this time a real one—from Alberto Cairo’s book The Functional Art. In Chapter 6 Alberto wants to test the validity of the hypotheses that “ Obesity is, on average, inversely proportional to the average education of the population”. For that, Alberto Cairo pulls—from different data sources—two publically available data sets and draws the dot plot as shown in the graph below. The first data set is from the US Census Bureau and shows the percentage of people by state holding BA degree or higher. The second data set is from the Centers for Disease Control and Prevention (CDC) and shows the percentage of people who are obese by state.

Source: “Chocolate Consumption, Cognitive Function, and Nobel Laureates”, by Franz H. Messerli, M.D., The New England Journal of Medicine, October 10, 2012.

Source: “The Functional Art”, Alberto Cairo.

Source: Centers for Disease Control and Prevention

“Statistics isn't about discovering correlations, it's about eliminating coincidence.”

Nassim Nicholas Taleb
Lebanese-American philosopher

Source: “The Functional Art”, Alberto Cairo.

However, if we look at correlations of smaller aggregations—say states—then the scatterplot will be different, and its associated correlation will be different. If we aggregate the data and represent the averages by state instead of individuals we’ll see that the strength of the association between variable X and variable Y is much stronger and is in the opposite direction. The correlation for the set of four dots shown in the scatterplot below is 0.980.

Alberto Cairo then designs the scatterplot shown below and calculates the linear correlation coefficient r. Based on the result of r = -0.67, Alberto Cairo concludes that there’s a solid negative correlation between obesity and education.

What Alberto Cairo calculated is called the Ecological Correlation—because the unit of analysis is not an individual person but a group of people, the residents of a state. However, it’s all too easy to draw incorrect conclusions from aggregate data. Alberto Cairo fell into the Ecological Fallacy trap. He made the inference that relationships observed for groups necessarily hold for individuals: in other words if states with higher educational attainment tend to have lower obesity rates, then uneducated people must be more likely to be obese. These inferences may be correct, but are only weakly supported by the aggregate data. In reality, as we’ll see next, the correlation computed at the individual level is -0.111. The sign of the correlation is negative—as Alberto Cairo predicted—but not as strong as he suggested. The ecological correlation gives the wrong inference.

The CDC survey data already included the obesity rate by education level and state. Here below is a sample for the state of Alaska.

In 2011 the CDC survey included a total of 470,700 respondents of which 128,972—or 27.4%—were obese. Of that total, 162,648 respondents were college graduates of which 33,505—or 20.6%—were obese. Let’s calculate the individual correlation. The table below is a fourfold table showing for the overall sample the correlation between obesity and educational attainment (College graduate or higher) considered as properties of individuals rather than geographic areas. The Pearsonian (fourfold-point) correlation—the individual correlation—is -0.111, slightly less than one-sixth of the corresponding ecological correlation as calculated by Alberto.

In his paper “Ecological Correlations and the Behaviors of Individuals”, William S. Robinson made an important closing statement that ecological correlations cannot validly be used as substitutes for individual correlations and he added:

I am aware that this conclusion has serious consequences, and that its effect appears wholly negative because it throws serious doubt upon the validity of a number of important studies made in recent years. The purpose of this paper will have been accomplished, however, if it prevents the future computation of meaningless correlations and stimulates the study of similar problems with the use of meaningful correlations between the properties of individuals.

As you’ve seen, seven decades after William S. Robinson’s finding, people are still computing meaningless correlations. It’s time to get serious and rigorous about analytical and statistical data analysis. Otherwise, charlatanism and sophistry will be on the rise.