31 January 2013

Viral Stuff: The Top 100 Words We Use

The Oxford English Corpus is such stuff as Oxford Dictionaries are made on. It contains texts collected mainly from innumerable Internet sites. Their total length so far is about two billion words. The texts represent different varieties of English, different genres, styles and registers; they all come from present-day English (from the year 2000 onwards) and are supposed to be representative of the current state of the language. Here are the 100 commonest words in that vast material:

1     the
2     be
3     to
4     of
5     and
6     a
7     in
8     that
9     have
10    I
11    it
12    for
13    not
14    on
15    with
16    he
17    as
18    you
19    do
20    at
21    this
22    but
23    his
24    by
25    from
26    they
27    we
28    say
29    her
30    she
31    or
32    an
33    will
34    my
35    one
36    all
37    would
38    there
39    their
40    what
41    so
42    up
43    out
44    if
45    about
46    who
47    get
48    which
49    go
50    me
51    when
52    make
53    can
54    like
55    time
56    no
57    just
58    him
59    know
60    take
61    people
62    into
63    year
64    your
65    good
66    some
67    could
68    them
69    see
70    other
71    than
72    then
73    now
74    look
75    only
76    come
77    its
78    over
79    think
80    also
81    back
82    after
83    use
84    two
85    how
86    our
87    work
88    first
89    well
90    way
91    even
92    new
93    want
94    because
95    any
96    these
97    give
98    day
99    most
100   us

A few facts are worth noting.

Almost all these words are “native” in the sense that they continue forms inherited from Old English. Most of them can be traced back in time still further, to Proto-Germanic, and quite a large number have their roots in Proto-Indo-European, the most distant reconstructible ancestor of English. Only four of them (just, people, use, and the second syllable of because) are Old French loanwords (first attested in 14th-century documents). A few are of Old Norse origin: notably, the 3pl. personal pronouns they, their, and them, but also want, and possibly take, while get and give owe at least their initial /ɡ/ to Old Norse influence (the closely related Old English verbs began with the palatal glide /j/). The Old Norse loans were taken from the Scandinavian settlers in the Danelaw area, presumably between 800 and 1200. The remaining items (ca. 90% of the list) have “always” been English.

This illustrates the rule that the more common a word is, the less likely it is to undergo lexical replacement [see: Frequency of word-use predicts rates of lexical evolution throughout Indo-European history]. If we looked instead at the entire lexicon of present-day English, we would find that relatively recent borrowings from foreign languages, most often Latin or French, account for at least some 80% of the vocabulary. That’s because rarely used words are much more likely to be substituted.

Many of the most common items are not content words that indicate things, ideas, actions, states, etc., but function words that mean little or nothing by themseves. They join content words to modify their meaning, express grammatical relationships, glue the sentence together, and facilitate discourse. They include articles, pronouns, conjunction, prepositions, simple adverbs, auxiliary and modal verbs, quantifiers, and miscellaneous “particles”. Nearly all the words in the first two columns above are of this kind (the only exceptions being say, get, and go, whose meaning is not particularly specific either). The “Top 100” words are extremely successful replicators: practically every sentence must contain a few of them. Their occurrences make up about 50% of the total material in the Oxford English Corpus!

29 January 2013

The Little Lambs Who Lost Their Way: Lexical Exceptions

Consider the following Old English words: gān ‘gone’, clāþ ‘cloth’, brād ‘broad’. They belonged to the same lexical set as OE gāt ‘goat’, and we would expect them to have evolved like the rest of the GOAT set, since they do not share any characteristic subregularities with any recognised “minority flock”. Even the spelling of gone and broad (similar to that used in stone and goat, respectively) suggests that they were still members of the GOAT set at the time when the modern orthographic conventions were becoming fixed. And yet they have parted company with other words containing OE ā. Broad has joined the CAUGHT set (with Modern English /ɔː/, as in cause), while the other two vary between CAUGHT and LOT (Modern English /ɒ/ or its unrounded counterpart /ɑ/ as in American dialects). Note also that while OE sc(e)ān ‘shone’ yields the expected outcome /ʃoʊn/ in America, the normative British pronunciation is /ʃɒn/, with a shortened vowel.

Such cases are truly irregular and call for individual explanation. We know that the shortening of the vowel of clāþ cannot date back to Old English (OE claþ” would have become Modern English “clath”). OE ā produced a mid-low rounded vowel /ɔː/ (conventionally spelt ǭ to distinguish it from other O spellings) after the Norman Conquest, during the Middle English period. Indeed, the word was very often spelt clothe, clooth or cloothe  in Middle English, apparently indicating a long-vowel pronunciation. Note that the OE plural clāþas has normally developed into Mod.E clothes, with /oʊ/ (the th may be mute, but that is another story). Today, however, clothes is no longer regarded as the plural of cloth, but rather as an independent collective noun (a case of word duplication!). The distribution of the modern pronunciations of cloth points to an early shorthening of Middle English /ɔː/, as a result of which the word joined the LOT set. Then, in some (but not all) mainstream accents of Modern English, the short vowel was affected by the lengthening heard in moss, cost, lost, frost, moth, often, off, cough, etc., induced by the following voiceless fricative.

The development of broad must have been different, since the word does not show a short vowel in any major accent, and the final consonant is not a voiceless fricative. When the Great Vowel Shift of the 15th century transformed ME /ɔː/ into Early Modern English /oː/ (diphthongised to /oʊ/ in most contemporary varieties of English), one stray sheep left the flock as its vowel underwent an irregular lowering (for reasons that elude us). That lowered pronunciation merged with the new /ɔː/ that resulted from the smoothing of the diphthong /aʊ/ after the Great Vowel Shift (in such words as daughter, caught, law, cause, and drawn).

Gonna be gone
Perhaps there was another sheep of the same contrary disposition, since the long vowel of gone in the accents that rhyme it with drawn is best explained in the same way. Why do we find /gɒn/ ~ /gɑn/ as well? It’s hard to say at which historical stage the shortened variant originated. It could have appeared before the Great Vowel Shift, immediately after it, or still later, with the same result. It is quite possible that it has arisen many times. It is worth observing that high-frequency verbs often display irregular phonetic simplification, possibly because sloppy pronunciations are easier to tolerate in words more or less predictable from the context. Note the similarly unexpected short vowel of says and said, does and done, as well as been (pronounced like bin in American English). Been, said, does, done, says, and gone (in that order) are all among the 500 most frequently occurring English word-forms.

I will return to this interesting correlation between frequency of use and erratic behaviour (which usually consists in some kind of phonological erosion – the shortening, reduction or loss of speech segments).

28 January 2013

Flocking Behaviour: The Regularity of Sound Change

Sound change is amazingly regular. If different words contain the same sound (especially in the same context), the sound will be similarly affected by historical changes. As an example, consider Proto-Germanic /ai/, which regularly yields Old English /ɑː/, Middle English /ɔː/, and Modern English /oʊ/ (with the usual proviso concerning slightly different dialectal developments).

PGmc. /ai/
OE /ɑː/
ME /ɔː/
Mod.E /oʊ/

You could say that words containing the same sound (or phoneme, to use a technical term) prefer to evolve together in the same way, rather than independently. They behave like a flock of sheep moving in the same direction. The full picture is actually more complicated. Lexical sets may have subsets in which a more specific phonetic environment affects the pronunciation, giving rise to a subregularity.

  • For example, PGmc. *klaini- developed into pre-Old English *klāni-, but the *i in the second syllable caused an assimilatory change known as i-umlaut: the back vowel /ɑː/ changed into front /æː/. OE clǣne eventually developed into Modern English clean /kliːn/; and that is the regular development whenever i-umlaut applies.
  • If OE /ɑː/ (ME /ɔː/) was followed by /r/, the vowel failed to develop an u-like glide in mainstream Modern English, ending up as /ɔː/ instead, as in boar from OE bār.

In such cases, a smaller group becomes isolated from its mother flock and follows its own course. They often join another flock. Thus, clǣne and other umlauted forms joined the OE words containing /æː/ of other origin. In Middle English times they merged with still other lexical sets: words containing the Old English diphthong ēa (pronouced /æːɑ/), and words with an OE short /e/ that underwent lengthening in Middle English before a single consonant followed by a weak vowel (as in OE mete ʻfoodʼ). They formed one mighty herd of words with Middle English /ɛː/ > Modern English /iː/. Can you guess which Modern English word is the descendant of OE mete? [see bottom right].

Flocking behaviour (induced by a stimulus)
The regularity of sound change (at least in macroevolutionary time-scales) is extremely important in historical linguistics. It allows us to demonstrate, for a pair of languages, that they derive from a comon ancestor: if they do, regular sound changes leave a statistically significant pattern of correspondences between their vocabularies. The expectation of regularity also helps us to distinguish evidence of common ancestry from the effects of linguistic borrowing, and from chance similarities. This remains true even if we realise that the regularity of sound change is not always perfect. In further posts I shall try to show when and why we may expect words to stray from their flock and travel on their own.

27 January 2013

Of Shades and Shadows: Duplication and Divergence

Words occasionally undergo duplication. This may happen in an inflected language if different grammatical forms of the same lexical item accumulate so much differences that speakers start regarding them as representing different units. For example, English shade and shadow used to be the same word, Old English (OE) sceadu, a feminine noun with oblique forms such as gen.sg. sceadwe (or sceaduwe, with a “parasitic” vowel). This inflectional pattern was somewhat irregular and restricted to just a handful of feminines (which happened to share some peculiarities explained by their earlier history). A speaker of Old English would normally have expected a genitive form like *sceade, without a superfluous /w/. It must have been tempting to re-interpret sceadwe as a case-form of a different (though closely related) word. In Middle English (ME) times we already have two different lexical items, shade and shadwe (the latter with spelling variants such as shadewe or shadowe). Although virtually synonymous (their meanings overlapped more that they do today), they were clearly two separate entries in the ME lexicon.

Shade and shadows in the streets of Vienna
The contrast soon increased as ME /a/ became lengthened whenever it was followed by a single consonant plus a reduced vowel (which the final -e was at the time). Eventually the weak vowel dropped out, but the lengthening persisted. Early in the 15th c. the pronunciation of shade was /ʃaːd/. As a result of the so-called Great Vowel Shift, which operated in the following decades, it developed into /ˈʃɛːd/, the ancestor of the modern forms (/ˈʃeɪd/ in the mainstream varieties of English, but with a number of dialectal variants). Meanwhile, the variant shadwe kept an inherited short vowel, becoming Modern English /ˈʃædəʊ/ (or something similar, depending on the accent). Today, both shade and shadow double up as verbs (to shade versus to shadow), and they did so already in Chaucer’s times. OE sceadwian ‘cover with shadow’ can only account for to shadow, so to shade must be an innovation based on the shorter variant of the noun.

A pair of related synonyms is unlikely to survive in the long run. One of them will eventually outcompete the other either because speakers have a reason to favour it consistently or because (even if there is no consistent bias) in a finite-size speech community random drift sooner of later eliminates one redundant variant. But there is a way in which duplicated words can escape extinction: they may develop different meanings or different grammatical functions. Today shade means ‘darkness caused by the screening of light’, ‘something that shuts out light’, or ‘degree of a colour, nuance (of anything)’, while shadow means ‘dark image cast by a body that blocks light’ or, figuratively, ‘hint, trace’ (as in the shadow of a doubt). They are rarely interchangeable. They have escaped direct competition by undergoing functional specialisation. That’s why they are still alive, a thousand years after they started drifting apart.

Duplication followed by divergence is a common motif in linguistic evolution, and I intend to return to it soon.

25 January 2013

The Meaning of ‘Language Evolution’ (3): Microevolution, or the Variation and Change Around Us

Do you pronounce often with the vowel of lot or that of sauce? (or do you speak an accent in which lot and sauce contain the same vowel?). Do you say off’n with no /t/ even when speaking slowly and distinctly? Which syllable of controversy do you stress? Which plural form of fungus do you prefer: fungi or funguses? and if the former, do you pronounce it with /ŋɡ/ (as in anger) or /nʤ/ (as in angel)? with final /iː/ or /aɪ/?¹ Would you say different from others, different than others, or perhaps different to others? A toilet at a petrol station or a restroom at a gas station? A frying-pan or a skillet? Let’s not argue or don’t let’s argue? Do inflammable things burn or not? And if they aren’t inflammable, would you call them flammable or non-inflammable?

Some vowels on the move
There’s clearly a high level of variation within English, and the same is true of any language with a substantial number of speakers. People may use different pronunciations, different inflected forms, different grammatical constructions, different words for the same concept, and the same words with different meanings. English has several national varieties and a large number of local dialects. What is the source of that enormous diversity? How do new variants come into existence? What happens to variants that co-occur in the same dialect? Do their relative frequencies of use change in the course of time or reach an equilibrium? How is variation related to language change?  Are all variants equal or are some of them preferable to others?

The emergence of variation inside speech communities and the way variation changes over time are fundamental processes making large-scale language change possible. Let’s call those processes linguistic microevolution
¹ This is a linguistic blog, so some special symbols will be used from time to time. Please make sure that the fonts in your browser support the International Phonetic Alphabet. Phonetic characters may not be visible in older versions of commonly used fonts, in which case it is preferable to select e.g. Arial Unicode MS rather than Arial as the default sans-serif font.

23 January 2013

The Meaning of ‘Language Evolution’ (2): Macroevolution, or the Rise and Fall of Languages

We may speak of ‘language evolution’ ignoring the origins question. Note that while language (uncountable), meaning a type od communication system (and the human ability to use it), is part of our biological endowment and works in a similar way in all human groups, there are thousands of very different individual languages (countable), most of them subdivided into regional or social dialects. What is the source of this variety? Has it always been there, or was there a time when all humans spoke one language?

Languages are transmitted from generation to generation of users. The transmission is cultural, not biological. Infants acquire or pick up their first language from their social environment (starting with their close family and caretakers, then drawing upon other sources of linguistic input). As opposed to formal learning, acquisition happens naturally and involves no explicit teaching. Children learn the vocabulary of their native tongue, work out its grammatical rules and master all the subtle ‘language-games’ connected with the social uses and functions of language.

Every language has its population of users – a historically continuous ‘speech community’ which maintains its code of communication through generations. We know that languages change in the process. The change is slow but inevitable. If a speech community splits in two or more parts, e.g. as a result of migration, different changes accumulate in the resulting groups of speakers, making communication between them increasingly difficult. Thus an originally common language produces different ‘daughters’, separated by communicative barriers.

As we examine the living or historically attested languages, we discover that many of them can be grouped into families – sets displaying structural correspondences that point to shared descent (a common ancestor in the past). For example, French, Spanish, Italian, Romanian and other Romance languages derive historically from the colloquial varieties of Latin once spoken across the Roman Empire. As the Empire collapsed and broke down into smaller and relatively isolated political units, spoken Latin was geographically fragmented, giving rise to numerous local languages.

The offspring of Latin (somewhat idealised and simplified)

While new languages emerge in this way, other, less lucky ones (and even whole families) die out – sometimes because a speech community is wiped out by a natural disaster or by inter-ethnic violence; more usually, because their original speakers have abandoned their old code, adopting instead one imposed by some other group. It is easy to see that the rise and fall of languages depends crucially on external (non-linguistic) factors: the history of human migrations, conquests, accidents of political and cultural history, vagaries of social prestige, etc. Let’s call those long-term and large-scale processes linguistic macroevolution. It is the story of language differentiation and extinction, the origin of linguistic families, results of contact between different speech communities, etc.

22 January 2013

The Meaning of ‘Language Evolution’ (1): The Misty Origins

When we say ‘language evolution’, we may mean a number of things. First, we may mean the origins of language. As a type of communication based on articulate speech, language has been around for a very long time. It may be as old as our species (ca. 200,000 years), if not older. We are biologically equipped for speaking and for understanding speech. To produce and process spoken messages we need complex anatomical, neural and cognitive prerequisites. They are roughly the same in all healthy humans but set us apart even from our closest cousins – chimps and bonobos. Although at present we cannot tell with any accuracy when those special prerequisites first appeared, they are too complex to have sprung into existence overnight as a result of, say, a yet unidentified genetic mutation that suddenly (and quite miraculously) enlarged and rewired the human brain, affected the configuration of the movable organs of speech (plus their neuromuscular control), dramatically improved the processing of auditory information, etc. – all at once.

Homo ergaster (Africa, 1.5 million years ago)
struggling to speak
It is a safe bet that the biological development of the language faculty proceeded stepwise and required a number of gradual improvements as well as enough time for genetic innovations responsible for those improvements to become fixed in early human populations. Language as a means of communication became fully developed and fine-tuned so long ago that direct evidence for most of its early history is beyond our reach. Writing was invented just a little more than 5,000 years ago (initially in Egypt and Mesopotamia), which means that the tangible ‘fossil’ record of ancient languages spans no more than about 2.5% of the history of anatomically modern Homo sapiens. Sophisticated methods of linguistic reconstruction can take us back slightly farther than that, but even so we can only scratch the surface to expose the shallow layers just underneath it.

So here’s the first meaning of ‘language evolution’: speculation about the prehistoric emergence of language (language in general, as opposed to individual languages) and the study of its biological underpinnings.