The punditry has spoken: the most important data question at the Republican Convention is where different states are located. Here is the FiveThirtyEight take on the matter:

They crunched some numbers and argue that Trump's margin of victory in the state primaries is the best indicator of how close to the front that state's delegation is situated.

Others have put this type of information on a map:

The scatter plot with the added "trendline" is often misleading. Your eyes are drawn to the line, and distracted from the points that are far away from the line. In fact, the R-squared of the regression line is only about 20%. This is quite obvious from the distribution of green shades in the map below.

***

So, I wanted to investigate the question of how robust this regression line is. The way statisticians address this question is as follows: imagine that the seating has been assigned completely at random - how likely would the actual seating plan have arisen from random assignment?

Take the seating assignments from the scatter plot. Then randomly shuffle the assignment to create simulated random seating plans. We keep the same slots, for example, four states were given #1 positions in the actual arrangement. In every simulation, four states got #1 positions - it's just that which four states were decided by flipping coins.

I did one hundred simulated seating plans at a time. For each plan, I created the scatter plot of seating position versus Trump margin (mirror image of the FiveThirtyEight chart), and fitted a regression line. The following shows the slopes of the first 200 simulations:

The more negative the slope, the more power Trump margin has in explaining the seating arrangement.

Notice that even though all these plans are created at random, the magnitude of the slopes range widely. In fact, there is one randomly created plan that sits right below the actual RNC plan shown in red. So, it is possible--but very unlikely--that the RNC plan is randomly drawn up.

Another view of this phenomenon is the histogram of the slopes:

This again shows that the actual seating plan is very unlikely to be produced by a random number generator. (I plotted 500 simulations here.)

In statistics, we measure rarity by "standard errors". The actual plan is almost but not quite three standard errors away from the average random plan. A rule of thumb is that 3 standard errors or more is rare. (This corresponds to over 99% confidence.)

PS. Does anyone have the data corresponding to the original scatter plot? There are other things I want to do with the data but I'd need to find (a) the seating position by state and (b) the primary results nicely set in a spreadsheet.

Comments are closed.