08 May 2013

Related or Similar by Chance?

There are many analogies between the history of languages and biological evolution, but the differences between the two domains are also important and should be highlighted. One of them is the degree to which evolving units can be expected to remain recognisable over long periods of time (let’s assume that “long” means millions of years in biology and thousands of years in linguistics). The contrast is particularly striking if we consider biological evolution at the molecular level. How do we decide whether two fragments of DNA in the genomes of two different species have shared ancestry? The problem is by no means trivial, but it helps if the fragments in question are sufficiently long. If we align them and find a high percentage of identical base-pair sequences, it becomes likely that the partial identity is due to common origin rather than convergent evolution or pure chance.  Of course sufficient care must be taken to rule out other possible explanations. For example, a lengthy periodic sequence in which a simple motif is repeated numerous times (like CACACACACACA...) is something that may have been generated many times independently by some kind of replication slippage. Whether the similarity of two such sequences means anything depends on the context in which they occur (e.g. whether they occupy the same locus in their respective genomes).

Note that the “alphabet” of the genetic code is very simple and (almost) the same in all domains of life, and the same is true of the interpretation of the codons (the translation of nucleotide triplets into amino acid sequences). Guanine is guanine, and cytosine is cytosine, no matter if they form a GC pair in the DNA of a blue whale, an amoeba, or a forget-me-not. The unity of the genetic code leaves no interpretative leeway: two nucleotides are either identical or different, not “similar” or “almost the same”.

By contrast, morphemes and words are encoded as strings of phonological segments whose inventories differ widely from language to language. There may be universal constraints on what is permitted to function as a speech segment, but individual languages enjoy a lot of freedom in this repect. Even varieties of one and the same language (say, Scottish English, Received Pronunciation, and General American) may have quite different phonological systems. The same speech sound may play a different role in different inventories. For example, the contrast between plain [t] and aspirated [tʰ] is not employed to differentiate meaning in English (the occurrence of the two sounds is predictable from the phonetic context), but in many languages (Mandarin, Armenian, Ancient Greek, Hindi, etc.) /t/ and /tʰ/ are distinct phonemes – contrastive units perceived as different “letters” of the mental code we use to represent words in the lexicon. Because of the operation of sound change, the pronunciation of a word (and its phonological encoding) can morph into something pretty well unrecognisable in the course of several generations.

Comparing genes
A typical gene may contain about 30,000 base pairs. A typical word contains just a few phonemes. This difference in internal complexity is really significant. If there is a 50% match between the base-pair sequences of two aligned genes, the probability of this agreement being accidental is practically zero (even if they encode for rather different proteins and so have different “meanings”). If two words with “the same” meaning have “the same” form in two languages (remember that “sameness” is a tricky notion in linguistics, hence the cautionary quotation-marks), their identity means nothing by itself: it may well be accidental.

The most powerful device in the toolkit of historical linguists, the comparative method, is not interested in lookalikes but in words displaying recurring systematic correspondences – patterns resulting from the operation of historical sequences of regular sound changes. In order to undergo such changes (and to acquire their characteristic imprint, like a certificate of origin) a word must spend enough time in the company of other words in the same linguistic lineage. It may be difficult to believe that e.g. the English word daughter (RP /dɔːtə/, Gen.Am. /dɔtɚ ~ dɑtɚ/) is related to Polish córa /ʦura/ (with the same meaning), but the relationship is more or less obvious to anyone familiar with the regular sound changes in Germanic and Slavic, and with the correpondences they have left behind. Remember: not because the words are similar (they aren’t) but because the relationship between the most conservative lexical strata of Polish and English conforms to a well-defined pattern of formal correspondences, described in detail in the linguistic literature.

Greek theós and Latin deus not only have the same meaning (‘god’) but also look very similar: /tʰ/ and /d/ are both dental stops, /o/ and /u/ are both back rounded vowels (and we know that Greek -os  often corresponds to Latin -us); the remaining segments are simply identical, and there is a similar hiatus (absence of a consonant) between the vowels in each case. There is, on the other hand, no plausible common source for the initial sounds, despite their similarity. No PIE word-initial consonant (or consonant cluster) can yield /d/ in Latin and / tʰ/ in Ancient Greek. The pair violates all known patterns of correspondence between the two languages. Moreover, a careful analysis shows that there are practically certain relatives of deus in Greek (e.g. the adjective dĩos ‘divine’ and the name of Zeús, the chief Olympian god), and plausible relatives of theós in Latin (e.g. fās ‘religious law’, fēstus ‘festive’, and fānum ‘temple’). They are hardly similar to theós, but they fit nicely into the known pattern of correspondences. One could entertain other possibilities, e.g. that theós might be a loanword from an unknown IE language in which the PIE *d > *; such an assumption would account for the deviation from the expected pattern. But – apart from the fact that there would be other features of theós left unexplained – ad hoc recourse to otherwise unheard-of languages with arbitrary charcteristics falls foul of Ockham’s Razor. The hypothesis that deus and theós are unrelated, and that theós is instead related to fās etc., is more parsimonious, and on the whole more compelling. The respective prototypes of deus and theós are therefore reconstructed as *deiwós versus *dʰh₁sós, derived from different PIE roots. It follows that their similarity is deceptive and there is no genetic link between them.

The likelihood that two words from different languages will display both a “matching” meaning and a “matching” form quite by chance is much higher than most people would imagine, and increases dramatically every time we relax the criteria of what constitutes a match. Without any formal controls, such as those that allow us to recognise spurious cognates in lexicons with reconstructable histories, matches are a dime a dozen and have no intrinsic value as evidence. If two languages are distantly related, real cognates are often as dissimilar as daughter and córa, and we can’t identify them just by eyeballing the material.

Pieces should form a pattern
To be fair, the oldest words in the IE languages have often retained enough similarity to be “visibly related” even to a layperson. After all, it was the bare resemblance of some Sanskrit words to those in Greek, Latin, etc. (famously observed by Sir William Jones in 1786) that led to the discovery of the IE family. But it was the careful application of the comparative method and the reconstruction of the common “protolanguage”, as well as the unravelling of the changes transforming it into the historically known languages, that allowed linguists to progress from impressionistic speculation to something resembling a scientific model. Many of the correspondences that seemed evident to linguists in the early 19th century  have turned out to be misleading, and lots of non-evident ones have been discovered. Despite being regular, they couldn’t be spotted by a naive observer.

[To be continued in the next post.]
[► Back to the beginning of the Proto-World thread]


  1. I've been lurking for a long time in the Cybalist mailing list. I'm glad you started this blog - it's truly interesting to follow what you write here. I'd like to know what you think about this article, which was sent out through the mailing list a couple of days ago.

    "Linguists identify 15,000-year-old 'ultraconserved words'"


    The article is based on a study which can be found here:

    Best regards,
    Hakan Lindgren

  2. Hi, Håkan, great to see you here! I suppose you can guess what I think of the PNAS article, the hype surrounding it and the quality of "science reporting" in its wake. Since I was moving in my own argument towards the question of "ultraconserved vocabulary", I'm planning to tackle some of those issues soon. Anticipating future posts, let me just say that IMHO the results of the study can only be described as rubbish. Not because the method is seriously flawed but because the input data are mostly garbage. Also, the initial assumptions are questionable (the validity of Eurasiatic is just accepted in advance, not tested). The Tower of Babel project is run by staunch long-rangers, and many of the entries in the etymological database are strongly biassed in favour of the groupings supported by the editors. Pagel et al. swallow all that, hook, line, and sinker, quite uncritically. Where the hell is the linguistic material listed and discussed? We are only told how many "matches" a given lexical meaning scores in the database (e.g. the pronoun 'thou' is alleged to beconserved in all the seven members of the putative superfamily; I'd be interested to see how they figured it out, since even the ToB tendentious etymologies do not justify such a far-fetched judgement). I'm all the more appalled to see it all published by PNAS because I thought well of the early publications of basically the same team (the 2007 Nature paper was quite promising). I'm also all in favour of interdisciplinary reserach, and in particular of using the evolutionary biologists' insights and methods in studying linguistic evolution. But the only use of the most recent article is that it I can use it to show my students how long-range research must not be done, and why.

  3. BTW, Sally Thomason on Language Log beat everybody in the blogosphere to it.

    1. Piotr, thank you for your quick reply! The subject is fascinating to me, but it can easily lead to some wishful thinking, and as a layperson it can be difficult to judge what you read. I'll read the Language Log as well.

      What do you say about the idea that basic vocabulary may look similar in unrelated language groups because all humans associate certain concepts with certain sounds? "M" for "me" or "us", the vowel "i" for small things, etc (there are always exceptions of course, like English "big"). Is this rubbish? Plausible but never provable? Is there a name for this idea?

    2. It's called phonosymbolism (or sound symbolism), and great linguists, including e.g. Otto Jespersen and Roman Jakobson, wrote extensively about it. It's a real phenomenon, although, as usual with such fuzzy notions as the "expressive value" of a word, it's somewhat elusive. Tendencies like the preferential use of certain consonants in pronouns have been connected with phonosymbolism by some, though it's hard to explain the apparently non-random distribution of differnt pronoun systems in this way.

    3. Hi, folks!

      Phonosymbolism includes (but it isn't limited to) the imitation of child language by adult speakers. At phomenic level, an example of this is the extensive use in Basque of expressive palatalization to convey an affective or diminutive meaning. There're also plenty of nursery words originated in child language.

      Also at word level, there's a very large range of imitative (onomatopoeyic) lexemes. Like other words, they can evolve and pass from one language to another. For example, Latin faber 'smith' derives from an onomatopoeyic root *tap- ~ *dab- imitating the hitting of metal (Pokorný's etymology is a pile of rubbish).

      Quite often, linguits confuse expressiveness, which uses phonosymbolism at phonemic level or other devices such as reduplication, with phonosymbolism at word level. So if a word has expressive features they wrongly assume it has to be of imitative origin. Although it can't be demonstrated, possibly every word is ultimately of onomatopoeyic origin in a hypothetical Proto-World language, but some linguists (e.g. Coromines) tend to abuse of this resource when they don't know the actual etymology.

    4. Actually, reduplication is very frequent in nursery words, so I'd consider it to be a phonosymbolic feature at word level. For example, for 'sleep' we've got Catalan non-non, French dodo (reduplicated) and Basque lo < *do (non reduplicated).

      As pointed long ago by Bertil Malmberg, this kind of words can't be used in long-range comparisons à la Ruhlen.

    5. A few years ago Boë et al. (2008) tested the combinatorial validity of Ruhlen & Bengtson's method of finding "global etymologies". Their conclusion was that the etymologies can be explained by random chance, but they were also able to find two other "global etytmologies" for 'mother', in addition to R&B's AJA, by replicating the same procedure. They were -- guess what? -- MAMA and ANA, each supported by more than six families from at least three continents (and sometimes co-occurring in one and the same family). Here at least it's a safe bet that the result is not quite random but reflects children's universal articulatory preferences. I have little doubt that there were similar 'mother' words in at least some Middle Paleolithic languages, which of course doesn't mean that today's mamas derive directly from any of them.

  4. Haven't you written something about this earlier on Cybalist - demonstrative pronouns ("this", "that") often have similar sounds, as if those sounds were more or less natural for humans when we want to point to something?

  5. PS - your learning and your willingness to share what you know has made a lasting impression on me.

  6. There is phonosymbolism that is restircted to a specific group of languages, such as the nursery repduplication that Octavia is speaking of, and then there is something more general - not universal, just general.

    One example is consonant symbolism in which the impressionistically "thicker" or "heavier" consonant indicates a more forceful action or greater effect. So in Lakota there are triplets of verbs where the "lighter" sounding action gets progresively "heavier". I don't rememebr this example exactly, but there is one of the groups that includes a word 'ʃloka" in which the first verb meanas to poke a hole in something, the second means to router a hole out wider and the third means something more extensive, I forget what.

    You see some of this with gendered words. In this case the general pattern isi that a more front vowel or a "sharper" sounding consonant is going to indicate femlae gender. In Lushootseed the demonstratives (the only part of the language that shows gender at all) have an alternation 'ti M/'tsi' F. In Manchu there's an alternation between 'a" and 'i" on the verbs. In Mandarin if a name has retroflex initials the chances are it's a male name, where palaltal initials are for girls names. (Tehe actual meaning of the word used to form the name can override this, and of course it doesn't really apply with the oer consonantal series.)

  7. I love the way Jespersen put it (in a very evolutionary way) in Language, Its Nature, Development and Origin XX.11:

    Sound symbolism, we may say, makes some words more fit to survive and gives them considerable help in their struggle for existence.

    Note that if phonosymbolism is adaptive, it should result in a lot of convergence.

  8. "The likelihood that two words from different languages will display both a “matching” meaning and a “matching” form quite by chance is much higher than most people would imagine, and increases dramatically every time we relax the criteria of what constitutes a match. "

    Mark Rosenfelder (A a programmer with some background in linguistics) covered this in detail in 1998, including a statistical model to calculate how many "random" matches one could expect between two purely unrelated languages.

    Given this semantic leeway, the probability of a match on a single Quechua word is 1 - ((1 - p)^m = 1 - 0.911^20 = .845. That's quite telling right there-- it means that, given his phonetic and semantic laxness, the comparer is ordinarily going to find a random match for almost every Quechua word.

  9. So finding "regular correspondences" in genetic data is possible simply because of the huge number of base pairs, and hence the small probability of long strings of identical base-pair sequences arising by chance? And this is without controlling for e.g. the adaptive value of certain sequences?

    In comparative linguistics, the strings are first of all small, so the probability of correspondences arising by chance is higher. But we also rely on the arbitrary nature of form-meaning pairs, and hence typically exclude cases of obvious onomatopoeia from consideration when determining cognates. In genetics, I understood that the "comparative reconstruction" is usually only done with the "junk DNA", i.e. the parts of the genome that have no biological function, and hence no adaptive pressure that might cause certain sequences to arise non-randomly.

  10. Of course adaptive phenotypic convergence is frequent and adaptive molecular convergense is possible and well-documented. It ought to be controlled for in phylogenetic analyses. I said the the problem was far from trivial. Neutrally evolving DNA is important in molecular dating, since the rate of mutation is roughly constant in the absence of adaptive pressures, but of course you also study the evolution of functional DNA, in particular genes (coding sequences). They are so long (tens of thousands of base pairs) that adaptive convergence can affect them only to a small extent, but it is a real problem.