If you’d like to avoid a fatal traffic accident what should you do? Drive slower? Avoid drinking alcohol? Roberts and Winters have found this correlation between traffic accidents and acacia trees.
So what’s going on? Is someone planting acacia trees at junctions and blocking the view? Do acacias drop leaves in a startling manner causing accidents? You wouldn’t expect a falling leaf to cause an accident, but maybe it’s that unexpectedness that causes the problem. Will targeted lumberjacking make roads safer? The correlation is real and that’s what it is: a correlation.
Just because one result correlates with another, it doesn’t mean you can draw a line implying causation. In my case I’ve implied acacia trees cause traffic accidents. Could it be the other way round. Do fatal accidents cause acacia trees? Maybe people plant acacias as a memorial to those who’ve died. Often there is a deeper reason for a connection.
Roberts and Winters paper, Linguistic Diversity and Traffic Accidents: Lessons from Statistical Studies of Cultural Traits is aimed at people looking for correlations in linguistic and cultural data, but their warnings apply to anyone working with complex data, particularly if you don’t define a research question when you start your study.
One feature they highlight is historical accident. They find a correlation between acacia trees and tonal languages. Does one cause the other? Tonal languages are most commonly found in Africa, and languages tend to cluster because they have common historical roots. Acacias are (mainly) found in Africa. There’s no huge insight into the correlation, simply that you have two thing common in Africa. If you think about the comparative safety of roads in Africa, it becomes clear why there’s a correlation between Acacias and traffic accidents.
Another feature of making connections between data sets is that correlations can happen by chance. You can quantify how likely it is that a result is due to chance, but that by itself tells you little about the meaning of the result. If a result has just a 1% probability of being due to chance, but you’ve run 100 tests, then you should expect around one freak result. The more things you look at the more chance there is of finding spurious results. The original paper has a handy quote from Nassim Nicholas Taleb: “This is the tragedy of big data: The more variables, the more correlations that can show significance. Falsity also grows faster than information; it is nonlinear (convex) with respect to data.”
This is how Roberts and Winters can put together a chain of spurious correlations. It’s a valuable paper to refer to, next time you’re confronted with a paper that produces peculiar results. You can also read their blog post about the paper.
Roberts S. & Winters J. (2013). Linguistic diversity and traffic accidents: lessons from statistical studies of cultural traits., PloS one, DOI: 10.1371/journal.pone.0070902
The mean number of annual road fatalities per 100,000 people within a country as a function of the presence of Acacia nilotica. Image by Seán Roberts and James Winters. This image licensed under a Creative Commons by licence.
Correlation. Image by Randall Munroe/xkcd. This image licensed under a Creative Commons by-nc licence.