03 May 2013

The Uncertainty Principle in Linguistic Reconstruction

What is a “reconstructed protolanguage” like Proto-Indo-European? It’s customary to define it as the most recent common ancestor of a family of languages. If the structure of relationships  within the family is represented as a clearcut phylogenetic tree, it is even possible to offer a formal definition based on a pair of languages, A and B, each belonging to a different primary branch of the family (produced by the oldest split in its history). Thus, Indo-Europeanists generally agree that the Anatolian group (Hittite, Luwian and their close relatives) had split from the “core” part the family before Core IE underwent further fragmentation. By putting it in this way we ascribe a “basal” status to Anatolian and tend to see the other branch as “PIE proper”. That’s because Core IE contains all the modern languages and all the familiar (and excellently documented) “classical” ones like Latin, Ancient Greek, and Vedic. We should realise, however, that the privileged position of Core IE with respect to Anatolian is merely an artifact of the history of IE studies and of the (accidentally) unequal written attestation of the two primary branches. It is more reasonable to say that the common ancestor bifurcated into a pair of daughter languages, Proto-Anatolian and Proto-Core (rather than to insist that either of them “split off first”). The fact that Proto-Anatolian has left no contemporary descendants is an accidental consequence of the vagaries of history, not of its “basal” or “less advanced” status.

Getting blurry
Hat tip: Jo Verrent
Hittite is a member of Anatolian; Greek is a member of Core IE. Therefore, we could define PIE as the most recent common ancestor or Hittite and Greek (or Luwian and Albanian, or Lycian and Welsh, if you prefer).  Any such definition is fine as far as we accept the family-tree model as a valid representation of the history of languages. It is worth reflecting, however, that we reconstruct protolanguages by comparing a large number of sets of related replicators (words, morphemes and their phonological building-blocks)  and tracing their history back, as if reversing the changes they have undergone. In a “perfect” family tree the genealogies of replicators simply follow the genealogies of languages. They all converge on the same point – the earliest bifurcation in the family tree. We have seen, however, that every replicator has its own family tree. For the oldest lexical layer of the Indo-European languages most of those family trees roughly coincide, producing the outline of a somewhat fuzzy but reasonably distinct language phylogeny. But do they all go back to the same time and place? Quite certainly not. We have no guarantee that all the features we call “Proto-Indo-European” co-occurred in one and the same speech community. Rather than that, their individual histories coalesce within a certain temporal range. The coalescent points may be considerably older than PIE as defined above. Such “deep coalescence” is especially likely when we combine the comparative method with internal reconstruction.

Deep coalescence
Let’s consider an example. We have a theoretical model of phonological alternations in PIE nominal paradigms. We know (or, to be precise, have reasons to believe) that some nouns originally had an “acrostatic” pattern of accentuation: the root syllable was accented throughout the declensional paradigm, and the quality of the root vowel alternated between *o and *e, whereas other nouns, accentually “mobile” ones, showed a parallel pattern in which an accented root vowel (normally *e) corresponded to the acrostatic *o, and the phonetic reduction of the vowel, caused by the shift of accent to a suffixal syllable, corresponded to the acrostatic *e. Traces of the acrostatic pattern are quite rare in the documented languages. We reconstruct the pattern on the basis of its scattered remains rather than fully preserved paradigms. Still, we are pretty confident that an alternation such as shown below occurred at some stage in the stem *dorw- ‘wood, tree’ (an acrostatic neuter):

  • nom./acc.sg. *dóru, gen.sg. *déru-s

However, no IE language preserves anything remotely like this. On strictly comparative grounds, without trying to make sense of the paradigm and its relation to other similarly behaving nouns, one could reconstruct something like *dóru/*dorw-ós (Indo-Iranian disagrees, suggesting *dóru/*dréu-s instead), and leave it at that. However, we have reasons to believe that such forms are analogical, reflecting various attempts to level out vowel alternations or replace the unproductive acrostatic pattern with a more common mobile one. If the reconstruction *dóru/*déru-s is correct, it belongs to an deep chronological stratum in the prehistory of PIE itself. It was in all likelihood replaced by “regularised” variants like *dóru/*dérw-os (with a vowel inserted in the genitive suffix), *dóru/*dorw-ós (with the root alternation eliminated and the genitive ending accented, as in more productive mobile patterns), and possibly others, before the disintegration of PIE unity. Quite possibly by that time the “original” paradigm had already been abandoned and forgotten. The large amount of hesitation and paradigmatic inconsistency found even in the most conservative languages on either side of the Anatolian/Core divide suggests that much of that polymorphy was inherited from the common ancestor. Fully coalescent forms must therefore older and can’t be dated with much accuracy. Reconstructed PIE is only roughly bounded in time (and even less so geographically, since even the approximate location of the IE homeland remains a moot question). The more we rely on abstract analyses to recover “the oldest” forms of alternating paradigms and to understand the origin of the alternations, the less precisely their chronology can be determined.

Note that we are talking of a well-studied language family with 400+ extant members and a written record beginning in the second millennium BC. The PIE reconstruction is a monumental intellectual achievement, and yet it isn’t “a language” that could be ascribed to any single speech community at any time. It’s a large set of coalescent reconstructions distributed in time and possibly in space as well. Other protolanguages, even relatively uncontroversial ones, are usually still more nebulous. If we ever manage to prove that the IE languages are related to some other established family, the reconstructed features of the common ancestor will naturally be even harder to constrain, and the protolanguage itself more elusive and fragmentary. It is hard to predict how far back in time our best reconstructive methods can take us before the notion od “protolanguage” becomes too vague to be meaningful. We can only resolve this question empirically, by putting our methods to extreme tests. If we consistently fail, it may mean that we have already reached the limit. Fortunately, there is no shortage of enthusiasts undaunted by the difficulties of long-range comparative research. Their efforts are necessary and praiseworthy, but the results so far have been rather disappointing. Only time can tell if further progress can be achieved. But even if some “superfamily” groupings eventually win general acceptance, we are still light-years away from reconstructing anything resembling Proto-World. My personal view, which I have tried to justify in this series of posts, is that linguistic lineages are too ill-defined to remain identifiable at great depths of time. Language phylogenies are not real objects but products of our analysis, with all its limitations and simplifying assumptions. They are defined from a certain point of view, relative to an arbitrarily chosen historical frame of reference. If you try to extend them far beyond that frame, their boundaries, fuzzy to begin with, just blur away. There is therefore no way to define Proto-World properly as “a language”, let alone reconstruct its features.

But perhaps we can talk meaningfully of global etymologies without insisting on the reconstruction of Proto-World? This will be the topic of the next post.

[► Back to the beginning of the Proto-World thread]


  1. I think that we cannot define protolanguages as equivalent to most recent common ancestors. The most recent common ancestor of the Indo-European languages undoubtedly existed, and we know some things about it. But to suppose that it looked exactly like the product of reconstruction is not reasonable. In particular, reconstruction smooths away all irregularities, but the principle of uniformity in time (that our time is not special) says that the actual MRCA was just as irregular as its descendants.

  2. I agree. It's the coalescent nature of reconstructions that makes them look uniform, but the actual points of coalescence for different lexical and grammatical features are unlikely to coincide in time with each other, let alone the MRCA. My whole point is that the two things (the MRCA and the product of the comparative/internal reconstruction) should not be confused with each other.

  3. the (accidentally) unequal written attestation of the two primary branches

    I suspect that this sort of inequality is actually the most likely outcome. If I get a chance I'll write a little simulator that generates random binary trees such that each leaf node either goes extinct or generates two descendants with fixed probability. I suspect that most trees will be extremely unbalanced, with most of the descendants from one branch and only a few from the other at every level. Unfortunately I do not know a metric for unbalancedness offhand.

  4. It's what biological family trees usually look like too. There are exceptions -- e.g. the earliest amniotes split almost immediately into the "mammalian" and the "reptilian" branches, which remain more or less equally diverse after 300 million years (well, actually there are many more modern species on the reptile/bird side, but we mammals like to think that we have been especially successful and that we are living in the Age of Mammals, so there's no way we could be basal amniotes). But in most cases we end up with a large "favourite daughter" plus its less lucky "basal" sisters.

  5. I like that you draw attention to internal reconstruction's temporal ambiguity, but I'm not sure it's justified to then extend that to all reconstruction. The comparative method, more strictly applied, will provide features of the MRCA pretty much by definition. Or at least, it can't get at older features - parallel development and convergence are problems, but those are things that often leave more discoverable traces if you pay good attention to relative chronology (palatalization Old English and Old Frisian are a good example of a development that clearly postdates separate innovations in each daughter, and so can't predate any 'Proto-Anglo-Frisian').

    This hardly removes all chronological uncertainties, since the comparative method and internal reconstruction are often both applied hand-in-hand, but I think it's not good to gloss over just how much of the MRCA comparison really can recover. And it's also only internal reconstruction that will smooth out regularities: direct comparison of two irregular paradigms should result in the reconstruction of an irregular paradigm (maybe an acceptable example would be the reconstruction of a paradigm like *wurkjan 'to work', past *wurhtē in Proto-Germanic, where this and a few other verbs anomalously don't have a stem vowel *-i- before the preterite ending, a situation preserved directly in most of the older Germanic languages).

    The biggest problem with something like Proto-World seems to be data rather than method. If irregular features get levelled out in different ways, the only method that can be applied is the less constrained technique of internal reconstruction. This is already the case to a fairly large extent in PIE; as you say, using this itself as a basis for reconstruction gets sketchy fast.

  6. The comparative method, more strictly applied, will provide features of the MRCA pretty much by definition. Or at least, it can't get at older features...

    It provides features of the MRCA of the forms being compared. But the question arises whether the different MRCAs recovered in this way really belong to "the same language" (i.e., if they really coexisted in the same ancestral speech community). The fact that languages are non-uniform populations subdivided into varieties militates against such a view of the reconstructed protolanguage. For example, let's consider Anglo-Frisian brightening (*a > *æ). The change took place in "Proto-Anglo-Frisian", which makes it pre-English, but dialectal contrasts such as Anglian cald, calf : West Saxon ċeald, ċealf are according to some authors (Richard Hogg, for example) more parsimoniously analysed as reflecting different brightening-blocking contexts in different parts of the Proto-Anglo-Frisian dialect network. The "most recent word ancestors" *kald-, *kalβ(-r)- are in this case older than the "most recent language ancestor".

  7. P.S. I should have added that Frisian has ald, kald, half, salt, etc., like Anglian but unlike WS.

  8. Yes, dialect variation can add some uncertainty, just like parallel innovation. But I think it's fair to say that it's a kind of difficulty that the comparative method is fairly sensitive to. In the case of OE and Frisian, the data speaks directly against reconstructing a unified 'Anglo-Frisian', to which processes like palatalization had applied in a shared, single history (Patrick Stiles makes this argument in his 1995 paper). You say that *kald- and *kalv(r) are older than the most recent common ancestor of English and Frisian - but I'm not sure that's really the proper way to look at it. At least in this case, we're looking at innovations, possibly parallel and possibly genuinely connected, spreading with observable differences across meaningfully differentiated dialects - and the data tells us this is what happened. In other cases, such as with the Ingvaeonic 'Nasalspirantengesetz', an innovation seems to spread with less variation across a set of grammars (in individuals' heads ultimately, of course) without any significant observable differentiation between them. Presumably this is because it was affecting a smaller and less differentiated speech community, rather than the sprawling dialect continuum that later Ingvaeonic and then Old English would become (or, it might be better to say, that the later Ingvaeonic dialect continuum probably grew out of a smaller slice of an earlier dialect continuum).

    It seems like there is a real difference between those kinds of innovations. The success of the comparative method without too many problems in a remarkable number of cases strongly suggests that the latter situation is not terribly uncommon; the many problems it runs into indicates that the former one is also pretty common. But in cases where there _is_ a 'MRCA' (meaning small and only very lightly differentiated dialects at the most varied), the comparative method ought to get at it. Careful attention to relatively chronology and the like are pretty important to making sure the comparative method works properly, but that's hardly a new idea.

  9. But in cases where there _is_ a 'MRCA' (meaning small and only very lightly differentiated dialects at the most varied), the comparative method ought to get at it.

    Well, if. When dealing with related languages we can always apply the comparative method, no matter what the MRCA looked like (and how homogeneous it was). How do we reconstruct things like the Proto-Indo-European genitive of the 'tree' word? Hittite has nom.sg. tāru, gen. GIŠ-ruwas, most likely reflecting *dor-u, *dorw-os (but the vowel of the oblique stem is concealed by the Sumerogram); Sanskrit has dā́ru, drós, as if from *dór-u, dr-éu-s (cf. Av. dāuru, draoš, Greek has dóru, dóratos/doúratos ~ dourós. Is there any sensible way in which we can compare such paradigms directly, without resorting to internal reconstruction? And if you look at nouns like 'sun' or 'fire', there are immense complication even withing Germanic, not to mention PIE.

  10. I fully agree that in cases like the PIE accentual paradigms, direct comparison simply doesn't work (not uncommon for morphology) and internal reconstruction is necessary. And that adds more doubts and chronological uncertainties to the picture. It's basically a matter of quality of data: better data allows for the application of better methods and at least somewhat more secure reconstruction of actual 'languages'. That, I think, is at least part of the point you've been making, that for Proto-World our data is not good at all, and the methodology for reconstructing it is correspondingly tenuous.