Against all logic, Cornell announced last week it would re-open in the fall because a mathematical model under development by several faculty members and grad students predicts that a "full re-opening" would lead to 80 percent fewer infections than a scenario of full virtual instruction. That's what was reported by the media.
The model is complicated, with loads of assumptions, and the report is over 50 pages long. I will put up my notes on how they attained this counterintuitive result in the next few days. The bottom line is - and the research team would agree - it is misleading to describe the analysis as "full re-open" versus "no re-open". The so-called full re-open scenario assumes the entire community including students, faculty and staff submit to a full program of test-trace-isolate, including (mandatory) PCR diagnostic testing once every five days throughout the 16-week semester, and immediate quarantine and isolation of new positive cases, as well as those in contact with such persons, plus full compliance with this program. By contrast, it assumes students do not get tested in the online instruction scenario. In other words, the researchers expect Cornell to get done what the U.S. governments at all levels failed to do until now.
The report takes us back to the good old days of best-base-worst-case analysis. There is no data for validating such predictions so they performed sensitivity analyses, defined as changing one factor at a time assuming all other factors are fixed at "nominal" (i.e. base case) values. In a large section of the report, they publish a series of charts of the following style:
Each line here represents one of the best-base-worst cases (respectively, orange-blue-green). Every parameter except one is given the "nominal" value (which represents the base case). The parameter that is manpulated is shown on the horizontal axis, and for the above chart, the variable is the assumption of average number of daily contacts per person. The vertical axis shows the main outcome variable, which is the percentage of the community infected by the end of term.
This flatness of the lines in the above chart appears to say that the outcome is quite insensitive to the change in the average daily contact rate under all three scenarios - until the daily contact rises above 10 per person per day. It also appears to show that the blue line is roughly midway between the orange and the green so the percent infected is slightly less-than halved under the optimistic scenario, and a bit more than doubled under the pessimistic scenario, relative to the blue line.
The vertical axis is presented in log scale, and only labeled at values 1% and 10%. About midway between 1 and 10 on the horizontal axis, the outcome value has already risen above 10%. Because of the log transformation, above 10%, each tick represents an increase of 10% in proportion. So, the top of the vertical axis indicates 80% of the community being infected! Nothing in the description or labeling of the vertical axis prepares the reader for this.
The report assumes a fixed value for average daily contacts of 8 (I rounded the number for discussion), which is invariable across all three scenarios. Drawing a vertical line about eight-tenths of the way towards 10 appears to signal that this baseline daily contact rate places the outcome in the relatively flat part of the curve.
Since there exists exactly one tick beyond 10 on the horizontal axis, the right-most value is 20. The model has been run for values of average daily contacts from 1 to 20, with unit increases. I can think of no defensible reason why such a set of numbers should be expressed in a log scale.
For the vertical axis, the outcome is a proportion, which is confined to within 0 percent and 100 percent. It's not a number that can explode.
Every log scale on a chart is birthed by its designer. I know of no software that automatically performs log transforms on data without the user's direction. (I write this line with trepidation wishing that I haven't planted a bad idea in some software developer's head.)
Here is what the shape of the original data looks like - without any transformation. All software (I'm using JMP here) produces something of this type:
At the baseline daily contact rate value of 8, the model predicts that 3.5% of the Cornell community will get infected by the end of the semester (again, assuming strict test-trace-isolate fully implemented and complied). Under the pessimistic scenario, the proportion jumps to 14%, which is 4 or 5 times higher than the base case. In this worst-case scenario, if the daily contact rate were about twice the assumed value (just over 16), half of the community would be infected in 16 weeks!
I actually do not understand how there could only be 8 contacts per person per day when the entire student body has returned to 100% in-person instruction. (In the report, they even say the 8 contacts could include multiple contacts with the same person.) I imagine an undergrad student in a single classroom with 50 students. This assumption says the average student in this class only comes into contact with at most 8 of those. That's one class. How about other classes? small tutorials? dining halls? dorms? extracurricular activities? sports? parties? bars?
Back to graphics. Something about the canonical chart irked the report writers so they decided to try a log scale. Here is the same chart with the vertical axis in log scale:
The log transform produces a visual distortion. On the right side, where the three lines are diverging rapidly, the log transform pulls them together. On the left side, where the three lines are close together, the log transform pulls them apart.
Recall that on the log scale, a straight line is exponential growth. Look at the green line (worst case). That line is approximately linear so in the pessimistic scenario, despite assuming full compliance to a strict test-trace-isolate regimen, the cases are projected to grow exponentially.
Something about that last chart still irked the report writers so they decided to birth a second log scale. Here is the chart they ultimately settled on:
As with the other axis, the effect of the log transform is to squeeze the larger values (on the right side) and spread out the smaller values (on the left side). After this cosmetic surgery, the left side looks relatively flat while the right side looks steep.
In the next version of the Cornell report, they should replace all these charts with ones using linear scales.
Upon discovering this graphical mischief, I wonder if the research team received a mandate that includes a desired outcome.