Language Evolution: Hypotheses Aren’t Facts

12 May 2013

Hypotheses Aren’t Facts

When biologists prepare the data to be analysed by a phylogenetic inference program, they may use morphological characteristics of organisms (or of their fossils), appropriately measured and encoded as a data matrix. The choice of the characters to be analysed is important and difficult, and often controversial. When the family tree of a DNA or RNA region is to be assembled, sequences of nucleotides are treated as data. The phylogeny of a family of proteins is inferred from their amino acid sequences. In every case the data come from empirical observations and measurements. One can include living and fossil taxa in the same phylogenetic analysis, provided that the respective datasets are of the same kind. One can even integrate morphological and molecular data for living taxa with only morphological data for extinct taxa (for which molecular data are not available), and still get valid results.

The Homo/Mus MRCA, an early restoration
(Disney 1928)

One can’t, however, replace real datasets with reconstructed ones derived from hypothetical taxa. Why? Precisely because hypotheses are hypotheses, not observations. You cannot measure the bones or count the teeth of a hypothetical animal, even if you have a fairly accurate idea of what it must have looked like. A scientific hypothesis is always tentative and open to revision or rejection. A family tree, no matter how well-supported, is also a hypothesis. If your analysis suggests that Homo sapiens and the house mouse (Mus musculus) are more closely related to each other than either is to the cat or to the horse, you can predict (or rather retrodict) what the most recent common ancestor of mice and men should have looked like, and in what way it must have been different from the common ancestor of cats and horses. Evolutionary trees feature numerous hypothetical taxonomic units with hypothetical morphological or molecular traits. They are very useful in making testable predictions, but they do not count as proven facts. New empirical evidence may make us revise the whole family tree, and many of our hypothetical taxa will be gone instantly. Whatever we can infer about them, anyway, always comes from the interpretation of relevant data – observable, empirical data.

The same data may be compatible with an astronomical number of alternative models. Competing analyses yield different topologies of the family tree and different inferences about the hypothetical ancestral nodes. Those inferences implicitly encode the structure of a given tree, and so are not new independent “facts”. There would be no ancestral nodes if we had no phylogenetic hypothesis in the first place. We choose a family tree whose branches appear to be robustly supported and tentatively regard it as “the best”, but there is no universal metric to guarantee that our currently preferred hypothesis is optimal in some absolute sense. We can only say that we have arrived at a reasonably accurate reconstruction of past reality given our present knowledge (until falsified by new evidence).

What does it all have to do with linguistics (and with the PNAS article)? I will get to that in the next post, soon.

[► Back to the beginning of the Proto-World thread]