Chris P. tipped me about this wonderful webpage containing an analysis of high-grossing movies. The direct link is here.

First, a Trifecta checkup: This thoughtful web project integrates beautifully rendered, clearly articulated graphics with the commendable objective of bringing data to the conversation about gender and race issues in Hollywood, an ambitious goal that it falls short of achieving because the data only marginally address the question at hand.

There is some intriguing just-beneath-the-surface interplay between the Q (question) and D (data) corners of the Trifecta, which I will get to in the lower half of this post. But first, let me talk about the Visual aspect of the project, which for the most part, I thought, was well executed.

The leading chart is simple and clear, setting the tone for the piece:


I like the use of color here. The colored chart titles are inspired. I also like the double color coding - notice that the proportion data are coded not just in the lengths of the bar segments but also in the opacity. There is some messiness in the right-hand-side labeling of the first chart but probably just a bug.

This next chart also contains a minor delight: upon scrolling to the following dot plot, the reader finds that one of the dots has been labeled; this is a signal to readers that they can click on the dots to reveal the "tooltips". It's a little thing but it makes a world of difference.


I also enjoy the following re-imagination of those proportional bar charts from above:


This form fits well with the underlying data structure (a good example of setting the V and the D in harmony). The chart shows the proportion of words spoken by male versus female actors over the course of a single movie (Tin Men from 1987 is the example shown here). The chart is centered in the unusual way, making it easy to read exactly when the females are allowed to have their say.

There is again a possible labeling hiccup. The middle label says 40th minute which would imply the entire movie is only 80 minutes long. (A quick check shows Tin Men is 110 minutes long.) It seems that they are only concerned with dialog, ignoring all moments of soundtrack, or silence. The visualization would be even more interesting if those non-dialog moments are presented.


The reason why the music and silence are missing has more to do with practicality than will. The raw materials (Data) used are movie scripts. The authors, much to their merit, acknowledge many of the problems that come with this data, starting with the fact that directors make edits to the scripts. It is also not clear how to locate each line along the duration of the movie. An assumption of speed of dialog seems to be required.

I have now moved to the Q corner of the Trifecta checkup. The article is motivated by the #OscarSoWhite controversy from a year or two ago, although by the second paragraph, the race angle has already been dropped in favor of gender, and by the end of the project, readers will have learned also about ageism but  the issue of race never returned. Race didn't come back because race is not easily discerned from a movie script, nor is it clearly labeled in a resource such as IMDB. So, the designers provided a better solution to a lesser problem, instead of a lesser solution to a better problem.

In the last part of the project, the authors tackle ageism. Here we find another pretty picture:


At the high level, the histograms tell us that movie producers prefer younger actresses (in their 20s) and middle-aged actors (forties and fifties). It is certainly not my experience that movies have a surplus of older male characters. But one must be very careful interpreting this analysis.

The importance of actors and actresses is being measured by the number of words in the scripts while the ages being analyzed are the real ages of the actors and actresses, not the ages of the characters they are playing.

Tom Cruise is still making action movies, and he's playing characters much younger than he is. A more direct question to ask here is: does Hollywood prefer to put younger rather than older characters on screen?

Since the raw data are movie scripts, the authors took the character names, and translated those to real actors and actresses via IMDB, and then obtained their ages as listed on IMDB. This is the standard "scrape-and-merge" method executed by newsrooms everywhere in the name of data journalism. It often creates data that are only marginally relevant to the problem.





Comments are closed.