29 April 2013

Backwards in a Perfect Tree

Let us pretend for a moment that borrowing between distinct languages does not take place and that the lexical stock of any language consists partly of words inherited whole and partly of new words produced by the recombination of inherited morphological elements. Thus, the genealogies of all morphemes form neat family trees that are more or less congruent with the family trees of languages. Every language has a lineage extending indefinitely far back in time. Splits are permanent: if two languages stemming from a common ancestor diverge, their shared lineage forks out into two branches which never merge again.

Under such assumptions, it is possible to apply “coalescent thinking” to languages. If today the total number of languages is about 7,000 (give or take a few hundred), and if we know, both on the basis of our historical observations and thanks to inference from language comparison, that languages often split and diverge (but they do not converge), and that it has always been so, as far as we can tell, today’s linguistic diversity can be reduced to a much smaller number of ancestral languages spoken, say, 2,000 years ago. For example, the whole Romance group (Portuguese, Spanish, Catalan, French, Italian, Romanian, etc.) collapses into the Latin language.

Does it mean that the languages of the world were less diverse 2,000 years ago than they are now? Not necessarily. Many of the languages that were spoken at the time and contributed to the contemporaneous global diversity have died out leaving no descendants. The expansion of Latin, for example, led to the mass extinction of minority languages in many regions incorporated into the Roman Empire. One can draw a long list of languages spoken in ancient Italy itself – Etruscan, Raetic, Sicel, Oscan, Umbrian, Faliscan, South Picene, Venetic, Messapic, Lepontic, Cisalpine Gaulish, Ligurian, etc. – that had succumbed to the prestige of Latin even before the Romans completed their conquest of Gaul or the Balkans. However, if we restrict our attention to the genealogies of the extant languages, the number of ancestral lineages drops at each forking-point in the family tree that we pass travelling back in time. By the time we reach the most recent common ancestor of all the modern Indo-European languages (which, by the way, should not be confused with Proto-Indo-European), more than 400 IE languages spoken today will have become one.

The rate of language diversification is very capricious and contiguous on extralinguistic factors – quirks of history. There are tiny families lingering on against all odds, some with a single member (like Basque or Burushaski), and there are really vast ones (like Austronesian or Niger-Congo, which make even Indo-European look puny in comparison). The retrospective coalescence of a large family dramatically decreases the number of ancestral lineages. More than 30 Polynesian languages merge into one if you travel back 2,000 years; but at a deeper date you reach the coalescent point of some 1,200 Malayo-Polynesian languages (including the already collapsed Polynesian subgroup), not to mention a still deeper Proto-Austronesian horizon. A few such mergers – and thousands of modern languages are reduced to a much smaller number of common ancestors.

Blue — extant languages
Thicker lines — their lineages
Red — Proto-World 

Surely, if we only had more information about the languages of the world, or more sophisticated reconstructive techniques, we would be able to increase the sensitivity and the time-depth of the comparative method, and go on reducing things. The 7,000 speech communities of today would merge into, say, 50 common ancestors several millennia ago, and perhaps eventually into one, in the same way as the genealogies of the Y chromosomes of currently living men coalesce on a single individual whom we call Y-Chromosomal Adam. Let’s call that hypothetical language “Proto-World” (see the diagram above). Of course there would have been many – possibly hundreds – other languages contemporary with it. It would only be special with hindsight: the descendants of those other languagues have not survived till now, not necessarily because they were in any way inferior, but simply because they were less lucky.

Is there any chance that Proto-World as defined above actually existed and that it could be reconstructed? I don’t think so, but I’ll explain my scepticism in the next post.

[► Back to the beginning of the Proto-World thread]

25 April 2013

One Big Family? Evidence, Please

Question 2: Are all the recorded languages ultimately related?
We normally say that two languages are related if they go back to a common ancestor. But we have already seen that “common ancestry” is a tricky notion in the case of entities with easily permeable fuzzy boundaries. To say that, for example, Latin and Sanskrit had a common ancestor is shorthand for saying that the most conservative core of their lexicons consists of linguistic replicators whose genealogies can be traced back to “the same” speech community, delimited in space and time. The core is quite thick, to be sure – several hundred lexical items with cognates in the other language. They show evidence of having undergone the characteristic sound changes that have affected each lineage during its separate history, they display similar inflectional patterns, etc. They are homological structures, not mere lookalikes. But the fact remains that both Latin and Sanskrit also contain thousands of lexical units whose history is more complicated. Some have been borrowed from one branch of Indo-European into another, others come from outside the family. For example, Latin has loanwords borrowed from Etruscan, a non-Indo-European language of ancient Italy (in addition to inner-IE loans from Greek, Gaulish, the extinct Italic languages, etc.); Sanskrit in turn has numerous words apparently imported from extinct and otherwise lost ancient languages – the enigmatic linguistic substrates of Central Asia and the Indian subcontinent.
The “core” vocabulary tends to erode away with the passage of time.  The branches of the family trees we reconstruct do not represent complete languages but only their most durable cores, which get thinner and thinner as we run our reconstruction back in time. Proto-Indo-European is still a solid construct because Indo-European is a vast family with some excellently documented representatives providing first-class historical evidence, but many language families are defined on a much shakier basis. Uralic is not bad at all, but its lexical core shared between the primary branches of the family amounts to something like 200 reconstructible roots. Afroasiatic, with just a few dozen uncontroversial proto-morphemes, is already a borderline case. “Relatedness” understood as shared ancestry makes sense as long as we can support it with a large number of words and morphemes showing systematic phonological correspondences. If all we can parade as evidence is a handful of imperfectly matching lexical roots and some similar-looking inflectional endings, “relatedness” evaporates and cognacy becomes indistinguishable from accidental similarity.
It is quite possible that high-frequency units for which we can predict the lowest rate of lexical replacement and the longest survival time – for example personal pronouns – may be retained via vertical inheritance long enough to suggest remote relationship between otherwise distinct language families. This might be the source of some curious cross-family correspondences like the M-T phenomenon in several language families of northern Eurasia (where the nasal /m/ tends to occur in first-person pronouns and a coronal obstruent in second-person pronouns). But such evidence, no matter how tantalising, is hardly sufficient to demonstrate a “superfamily” relationship if not backed up by a substantial amount of data to which the comparative method could be applied to rule out chance agreement.
If the applicability of the family tree model is limited in this way, perhaps we should focus on individual linguistic replicators – the stuff of which languages are made – rather than languages themselves. It could be argued that despite horizontal diffusion the genealogies of related replicators will still converge at some point in the past. Their family trees will not be isomorphic with language phylogenies, but a borrowed morpheme also has its deeper history in the language it came from. Even if the notion of “language relatedness” can’t be extended ad infinitum, it is imaginable that most replicators, whether transmitted vertically between generations of speakers or horizontally between different speech communities, eventually coalesce with their relatives in one and the same ancestral speech comunity somewhere in the deep prehistory of language. I see no easy way to disprove such a possibility, but I see no way to prove it either.
The kind of thing I do not trust at all
Relatedness can be tested for items with reconstructable histories, because we know what regular changes they can be expected to have undergone along the way, and what correspondences they should exhibit. Without that knowledge, anything could be related to anything else. A long sequence of phonological changes can distort a word beyond recognition; semantic shifts can change its meaning. With a little bit of imagination it’s easy to invent an arbitrary scenario relating a word in Basque to a word in Georgian, Hungarian, Sumerian, or any language of one’s choice. It’s a popular sport among amateur long-range comparatists, but it is not the way sound historical linguistics should be practised. It is wiser to admit out ignorance than to use dubious methods to get untestable results. So my well-considered answer to Question 2 is, “I have no clue”. The null hypothesis in such cases is always that A is not related to B unless there is sufficient evidence to conclude otherwise. I apologise if this attitude sounds unromantic.

[► Back to the beginning of the Proto-World thread]

23 April 2013

Too Many to Communicate

Question 1: Was there a time when all humans spoke the same language?
From what we know (or can infer) about  the social life of early humans in the Middle Paleolithic period (300-30 thousand years ago), our hunter-gatherer ancestors lived in small nomadic bands, each consisting of a few dozen (20-50) individuals. Several such bands may have maintained regular contacts and converged into loose ethnic units (“tribes”) totalling a few hundred members, which gathered seasonally for collective purposes such as ritual celebrations, marital exchange, etc. In such conditions a single speech community, capable of maintaining a shared linguistic code (unified by cultural transmission), can hardly grow larger than a tribe. In effect, a cluster of allied bands corresponds to a linguistic unit as well as a cultural one (with a shared system of customs and laws). Such a model is supported by studies of modern societies retaining an archaic type of organisation, such as the Indigenous Australians. At the time of first European contact, the population od Australia was probably about 300-500 thousand (its exact size is a matter of debate, but most estimates range within those limits). It supported about 250 tribal groups, each with its own language (sometimes more than one). Many of those languages were further subdivided into fairly diverse dialects. While some languages could boast one or two thousand speakers, others had just a few hundred (alas, those that have survived till now too often have only a few). It seems reasonable to assume that the speech communities of Paleolithic hunter-gatherers did not normally exceed about 1000 members.
In order for a single language to spread over the whole human population, that population would have had to be sufficiently small and geographically restricted. Population genetics can estimate past population sizes by reconstructing the family trees of a sample of DNA sequences found in the modern population. Roughly speaking, a population bottleneck at some time in the past narrows down genetic variation and reduces the number of ancestral lineages, as if forcing the genealogies of the modern variants of numerous DNA segments to coalesce within the same period. One recent study (Li &Durbin 2011) uses coalescent simulation applied to the complete human genome to show that the so-called effective population size (N) of non-African humans seems to have dropped down to about 1200 at some time 10-60 thousand years ago. The African effective population was also reduced, though less severely, to about 5,700. Does it mean that the global population of humans was less than 10,000?

Humans were here. Many humans!

Not quite. The term “effective population” refers to an ideal theoretical model and for a variety of reasons may seriously underestimate the actual number of humans living at the time in question (the “census population”), even by an order of magnitude or so. It is not supposed to be a demographic parameter (which lay people, and even science reporters, may not realise). The Middle Paleolithic census population of Homo sapiens was certainly much larger – in all likelihood, many times larger – than the estimated N values during what is supposed to have been the most serious demographic bottleneck in the history of our species. Whether the total number of humans was closer to 30,000 or to 300,000 is open to debate, but in any case they were far too many of them to constitute one speech community, especially if the Out-of-Africa migrants were already a separate sub-population somewhere in the Near East, the Arabian Peninsula, and possibly elsewhere in Eurasia and/or Australia  (depending on the exact date of the bottleneck). It’s hard to imagine that the same language was spoken in Paleolithic Sub-Saharan Africa and South Asia, no matter how strongly the latter was affected by a demographic crash. No single language, then; at any rate not in anatomically modern humans. We have always been multilingual.

21 April 2013

And Now for Something Completely Different: Proto-World!

In the next series of posts I am going to tackle the following questions:
  1. Was there a time when all humans spoke the same language? Or, in other words, was there a time when the total population of our species (presumably restricted to some part of Sub-Saharan Africa) constituted a single speech community?
    [Too Many to Communicate]
    Athanasius Kircher, Turris Babel
    Source: Heidelberg University Library
  2. Are all the recorded languages ultimately related? Note that this question is not a rephrasing of the previous one. In the past, there could have been any number of extinct languages unrelated to those known to us. The common ancestry of the known languages would be compatible with the multilingualism of early humans.
  3. If all the known language families are ultimately related, is it possible to reconstruct some features of their most recent common ancestor (variously referred to as Proto-World, Proto-Human or Proto-Sapiens)?
  4. What shall we make of the global family trees and global etymologies already proposed by some researchers? How plausible are they?
    [Related or Similar by Chance?]
    An excursus on Eurasiatic etc.
Stay tuned in, please. I’ll probably start tomorrow.

14 April 2013

Mismatch in the Family

The fact that a language can be regarded as a bundle of coevolving replicators has important consequences for the family-tree model of language evolution. The family tree of a group of languages is the sum of the genealogies of those replicators that have formed coherent bundles (or “have been in the family”) for a sufficiently long time. Any branchings in the family tree tend to correspond to branching-points in the histories of the associated replicators. But this is only a statistical effect. Individual replicators may develop competing variants in the same speech community or invade different languages across communication barriers. Let’s imagine a situation (illustrated with the diagram below) in which a speech community undergoes internal differentiation into more than two languages. If the process were abrupt, we would expect all such splits to be binary. But a speech community is normally a network of numerous local or social sub-communities. Their historical individuation as separate languages takes time and proceeds gradually; innovations spread easily between mutually intelligible dialects. We have prolonged transitional periods between “a dialect network” (say, Vulgar Latin) and “a group of languages” (say, French, Spanish, Italian, Romanian, etc.), rather than a clean separation point. Little wonder that if replicators produce variants during the dialectal period, those variants do not have to undergo neat resolution, all in the same way, in the emerging languages. Quite the opposite, a good deal of mismatch can be expected. In our example, replicator A splits into two variants: A1 and A2, and replicator B splits into B1 and B2, all of them coexisting in the same language, Proto-XYZ, which then splits into X, Y, and Z. Let’s suppose that in each of the daughter languages only one variant of A or B survives and the other is lost. The variants may well end up segregated like this:

Conflicting testimony
  • X: A1, B1
  • Y: A1, B2
  • Z: A2, B2
The distribution of {A1, A2} suggests that X and Y share a common innovation (A1) and so are more closely related to each other than either of them is to Z. But the distribution of {B1, B2} shows a common innovation (B2) suggesting a closer relationship  between Y and Z (to the exclusion of X). The mutual contradiction is only apparent, though. Not every common innovation of a cluster of languages arose after the separation of their most recent common ancestor from its relatives, and so none of them individually tells us much about the subgrouping of {X, Y, Z}. If nearly all replicators behave like A and only some are like B, we shall prefer {{X, Y}, Z}, but if the evidence is less robust, we may be unable to decide between {{X, Y}, Z} and {X, {Y, Z}} (or perhaps {Y, {X, Z}}, if we take more data into account).

As a real-world example, consider three Slavic languages: Polish, Czech and Slovene. According to handbook classifications, Polish and Czech are members of the West Slavic grouping, characterised by a cluster of shared innovations, for example the regular development of Proto-Slavic *tj, *dj into consonants traditionally transcribed *c, *ʒ ( IPA [ʦ, ʣ], a pronunciation preserved in Polish, where the spelling is c, dz; in Czech, *ʒ ends up as /z/, but *c remains /ʦ/). Slovene, a South Slavic language, shows a different phonetic development. The clusters in question are reflected as Slovene č and j (presumably via the palatal stops *, *  IPA [c, ɟ]):

*světja ‘candle’
*medja ‘boundary’
meze (< *meʒa)

There are, however, other changes that tell a different story. The Proto-Slavic non-initial sequences *or, *er, when followed by a consonant, developed in different ways in different parts of Slavic. In Polish and in some other West Slavic languages the vowel and the consonant simply swapped places, yielding *ro and *re, respectively. But in the Slavic dialects ancestral to Czech and Slovak they followed the South Slavic pattern: the outcome was *ra, *, with vowels that can be regarded, in Proto-Slavic terms, as the tense counterparts of lax *e and *o):

*morkъ ‘twilight’
*berza ‘birch’
brzoza (< *breza)
bříza (< *brěza)
brẹza (< *brěza)

It would be fair to say that the Czech-Slovak group is on the whole “West Slavic” but “South Slavic” in several respects, including the treatment of vowel + *r sequences. As could be expected, there are other complications as well. For example, the groups *tj, *dj do not yield the same otcome everywhere in South Slavic. In Serbo-Croatian we find ć, đ (= IPA [ʨ, ʥ]), which can plausibly be derived from the same source as the Slovene variants, but in Bulgarian (as well as Old Church Slavonic) the development is highly idiosyncratic: št, žd. It is unlikely that the Slovene/Serbo-Croatian type and the Bulgarian one represent a single “Proto-South Slavic” innovation (in fact, the former seems more akin to the West Slavic development). The linguistic diversity of the South Slavic dialectal network must have been considerable even before it started to break up into separate languages, and some of the pre-split variation still persists. The same is true of West Slavic, which is hardly uniform and whose separation from South Slavic on the one hand and East Slavic on the other is not entirely consistent: the family trees of individual replicators often fail to match each other. In such cases it is difficult to represent the historical relationships among the members of the linguistic grouping as a neat  phylogeny with clearly distinct branches.