In the previous post I pointed out that biologists do not use hypothetical taxa and their inferred features as data. But wait, linguistics is a different discipline. Perhaps in linguistics it’s perfectly legal to use asterisked reconstructions (like, say, Proto-Germanic *wulfaz ‘wolf’) as data on which higher-order reconstructions can be based? The structure of language families is hierarchical. We traditionally group uncontroversially related languages into “branches”, and for the members of each branch a respective ancestral protolanguage is reconstructed by applying the comparative method, right? Then we compare the “proto-branch” languages to reconstruct the most recent common ancestor of the whole family, don’t we?
No, we don’t. Proto-Indo-European was not reconstructed by comparing Proto-Indo-Iranian, Proto-Slavic, Proto-Italic, Proto-Celtic, Proto-Germanic, Proto-Anatolian, etc., with one another. It has always been reconstructed by comparing data extracted from a multitude of documented languages such as Vedic, Avestan, Old Church Slavonic, Serbo-Croatian, Latin, Old Irish, Middle Welsh, Albanian, Classical Greek, Biblical Gothic, Old Norse, Old High German, Hittite, Luwian, and so forth. Proto-branch languages are reconstructed first and foremost for the sake of quality control. The nodes of the family tree are where the most conservative features of the whole branch roughly coalesce, and where it is convenient to check the consistency of the reconstruction. Proto-Germanic is not reconstructed just by comparing English with German, Dutch, Icelandic, Gothic, etc. The reconstruction is informed by the rest of the family tree as well. There is considerable feedback from reconstructed PIE to reconstructed PGmc. To give a historically important example (one among many), for more than fifty years in the 19th century a large set of exceptions to Grimm’s Law remained unexplained, until it occurred to Karl Verner to look for evidence in outgroup languages such as Classical Greek and Vedic. The conditioning environment of the Proto-Germanic process now known as Verner’s Law was obliterated in Germanic itself, but preserved elsewhere.
Linguistic reconstruction is not conducted consistently in a bottom-up fashion, by piecing together smaller units before handling larger ones. A phylogeny is assembled as a whole, and if it becomes part of a grander phylogeny, incorporating outgroup evidence, it may have to be reorganised, and hypothetical protolanguages may have to be redefined in order to optimise the enlarged model. This is what happened to Proto-Indo-European after the Tocharian and Anatolian languages had been added to the family tree. The common ancestor of the remaining IE languages is not the familiar Brugmannian reconstruction from the last decades of the 19th century. It has been deeply affected by new discoveries despite the fact that most Indo-Europeanists place both Tocharian and Anatolian outside the “crown group” containing all the modern IE languages.
Thus, protolanguage reconstructions are not “data”. They are forever provisional and hypothetical. Using them as data is a category error. There are some thoroughly studied families like Indo-European whose protolanguages are reconstructible in considerable detail. In such cases even a historical linguist may be tempted to believe that the great success of the model makes it “real”, and so a reconstructed PIE word is as valid a piece of data as a documented word from a documented language. Such a belief is perhaps partly justified when a reconstructed form is used as a piece of shorthand intelligible to experts (who do not have to bother to list the obvious cognates) – and only if the reconstruction is straighforward and uncontroversial. Even in IE, however, reconstructions may contain questionable elements or require special explanations. Other “Eurasiatic” families are even worse off. Reconstructible Proto-Uralic lexemes would barely fill a Swadesh list. The very validity of Altaic as well-defined clade is disputed, and even assuming optimistically that Proto-Altaic is a valid concept, little of its structure can be reconstructed with any precision.
The use of *protoforms in datasets is not justifiable in any way if the reconstructions are highly conjectural, if they might be biassed (“improved” to make a point without sufficient evidence), or if they represent preliminary, speculative research whose quality remains controversial. The Languages of the World Etymological Database (LWED), produced by the Tower of Babel project, is precisely such a pioneering enterprise. Little wonder that the “Eurasiatic” reconstructions therein make liberal use of wildcard symbols, optional segments, variants and reconstructed features poorly supported by the comparative data – hallmarks of questionable comparison. They are also based on material drawn from a haphazard collection of sources, including some hopelessly outdated etymological dictionaries. And yet the compilers of the database claim that those reconstructions “may be used for regular comparative purposes – establishing phonetic correspondences and reconstruction – by future researchers”.
It’s a dangerous declaration: researchers, especially scientists with no linguistic training, may take it literally and believe that the etymologies in principle encapsulate reliable data, so that all the dirty work of actual linguistic analysis can be outsourced to the Tower of Babel team; the scientists can then use the condensed final product and need not worry about the rest. In the PNAS study of the Eurasiatic protolexicon the “final product” is then used as the basis for the determination of the size of cognate sets with a given meaning . So if one “proto-word” generates a cognate set spanning all the seven putative members of the Eurasiatic superfamily, and another one gives rise to a cognate set of just three, this difference can be expected to correlate with something real – even if the basis for the reconstruction is extremely tenuous, e.g. if on closer inspection it turns out that one of the alleged cognates is misreconstructed, another has been assigned the wrong meaning, still another is a loanword, and most present formal problems obvious to linguists but not necessarily to non-specialists. I will show in the next post that the “cognate set sizes” based on the LWED cannot be realistically determined with any reasonable accuracy, given the information provided in the database. The error may be so gross that the sizes determined in the study are simply fictitious.
[► Back to the beginning of the Proto-World thread]
[► Back to the beginning of the Proto-World thread]