When making a scatter plot, the two variables should not be placed arbitrarily. There is a rule governing this: the outcome variable should be shown on the vertical axis (also called y-axis), and the explanatory variable on the horizontal (or x-) axis.

This chart from the archives of the Economist has this reversed:


The title of the accompanying article is "Ice Cream and IQ"...

In a Trifecta Checkup (link), it's a Type DV chart. It's preposterous to claim eating ice cream makes one smarter without more careful studies. The chart also carries the xyopia fallacy: by showing just two variables, readers are unwittingly led to explain differences in "IQ" using differences in per-capita ice-cream consumption when lots of other stronger variables will explain any gaps in IQ.

In this post, I put aside my objections to the analysis, and focus on the issue of assigning variables to axes. Notice that this chart reverses the convention: the outcome variable (IQ) is shown on the horizontal, and the explanatory variable (ice cream) is shown on the vertical.

Here is a reconstruction of the above chart, showing only the dots that were labeled with country names. I fitted a straight regression line instead of a curve. (I don't understand why the red line in the original chart bends upwards when the data for Japan, South Korea, Singapore and Hong Kong should be dragging it down.)


Note that the interpretation of the regression line raises eyebrows because the presumed causality is reversed. For each 50 points increase in PISA score (IQ), this line says to expect ice cream consumption to raise by about 1-2 liters per person per year. So higher IQ makes people eat more ice cream.


If the convention is respected, then the following scatter plot results:


The first thing to note is that the regression analysis is different here from that shown in the previous chart. The blue regression line is not equivalent to the black regression line from the previous chart. You cannot reverse the roles of the x and y variables in a regression analysis, and so neither should you reverse the roles of the x and y variables in a scatter plot.

The blue regression line can be interpreted as having two sections, roughly, for countries consuming more than or less than 6 liters of ice cream per person per year. In the less-ice-cream countries, the correlation between ice cream and IQ is stronger (I don't endorse the causal interpretation of this statement).


When you make a scatter plot, you have two variables for which you want to analyze their correlation. In most cases, you are exploring a cause-effect relationship.

Higher income households cares more on politics.
Less educated citizens are more likely to not register to vote.
Companies with more diverse workforce has better business performance.

Frequently, the reverse correlation does not admit a causal interpretation:

Caring more about politics does not make one richer.
Not registering to vote does not make one less educated.
Making more profits does not lead to more diversity in hiring.

In each of these examples, it's clear that one variable is the outcome, the other variable is the explanatory factor. Always put the outcome in the vertical axis, and the explanation in the horizontal axis.

The justification is scientific. If you are going to add a regression line (what Excel calls a "trendline"), you must follow this convention, otherwise, your regression analysis will yield the wrong result, with an absurd interpretation!



Comments are closed.