Tim De Chant at Per Square Mile has an interesting post discussing an article by Ruth Mace and Mark Pagel in which they did a statistical analysis of the distribution of Native languages at European contact in North America and found that the density of languages correlates inversely with latitude (when controlling for land area) and directly with habitat diversity (even when controlling for latitude). You can see Tim’s post for more details on how they actually did the analysis. Here I just want to point out a few considerations that aren’t really addressed in the article, interesting though the result is.
First, I should note that the article itself seems fine for what it is. I don’t see any glaring problems with the statistical analyses, and the authors are clearly aware of the potential issues with their data. They don’t try to push the statistics too far, which is a welcome change form how studies like this often go. Furthermore, I suspect the conclusions they reach do in fact reflect a real phenomenon despite the issues I raise below. As they point out, species distributions are generally known to follow a similar pattern, with more species per land area closer to the equator and in areas with more diverse habitats, all else being equal. It makes intuitive sense that this would be true for human cultural groups as well; more ecological niches to exploit should tend to result in more specialization and therefore more groups, generally at higher population densities, within a given area when these conditions obtain than when they don’t. Since there is also a general tendency for different cultural groups to have different languages (for a variety of reasons), in the aggregate language density should also show these patterns.
That said, I have some concerns about the data underlying this study. The language distributions they discuss appear to come from an atlas of the world’s languages published by a trade publisher and presumably intended for a popular rather than a scholarly audience. There’s nothing wrong with that, of course, but it makes me really want to know how the authors of that atlas got their data. Mace and Pagel don’t discuss this issue at all, merely taking the data from the atlas as given, but it’s important to note that determining these distributions is a much more difficult problem than it seems at first sight.
First of all, defining where one language begins and another begins can be tricky. In cases where totally unrelated languages border each other, it’s pretty easy, but in cases where large areas are covered by closely related, contiguous languages, as is the case for many parts of North America, the difference between a series of related languages and a chain of dialects within a single language becomes very important, and this can be very difficult to determine, especially for poorly documented languages such as those spoken in many parts of North America. There are reasonably consistent ways to do this, but for a study like this I’d really like to see them spelled out to know which decisions are behind the data being analyzed.
Furthermore, this distribution of languages is apparently intended to represent the situation at “European Contact,” but Mace and Pagel don’t specify what they mean by that. Contact came at different times in different parts of North America, which is a rather large area. The linguistic situation in 1500 was really quite different from that at 1800, but both of these are reasonable dates for initial contact in different parts of the continent. Since contact came at different times in different places (i.e., it was “time-transgressive”), just mapping the situation in each subregion whenever contact occurred there, which is what I suspect the authors of this atlas did, creates a highly artificial construct when viewed at a continental scale. Trying to control for this and fix language boundaries at a single point continent-wide would be really quite difficult and require either a lot of guesswork or some sort of way to account for the effects of European colonization, depending on which time point you chose. Nevertheless, the kind of analysis Mace and Pagel are doing here really requires a single time point to make sense.
Like I said above, I don’t think this is a terrible paper given what it tries to do, and I suspect that its conclusions do actually point to a real pattern despite the important methodological shortcomings mentioned above. Without more detail on the underlying classification that was the basis for the statistical analysis, though, I don’t see it as being all that useful in actually establishing the reality of that pattern.
Mace, R., & Pagel, M. (1995). A Latitudinal Gradient in the Density of Human Languages in North America Proceedings of the Royal Society B: Biological Sciences, 261 (1360), 117-121 DOI: 10.1098/rspb.1995.0125