It may seem too much to expect that a person’s geographic origin can be determined from a DNA sample. But, thanks to a mathematical technique called principal component analysis, this can be done with remarkable accuracy. It works by reducing multi-dimensional data sets to just a few variables.
We live in the age of “big data”. Voluminous data collections are mined for information using mathematical techniques. The data may be assembled in a matrix – a rectangular array of numbers – with a column for each individual and a row for each variable.
For a medical database, the variables might be age, height, weight, blood-group and numerous other relevant factors, resulting in a very large matrix. We can examine small tables of numbers visually and detect interesting patterns but, with many variables, each requiring a separate dimension, simple inspection may reveal nothing of value.
A simple example illustrates dimension reduction. Suppose we let the two axes of a graph measure height and weight. Taller people are usually heavier than shorter ones, so these two variables are not independent; they are correlated. Each individual is represented by a point, and all the points form a cloud. The cloud is not round in shape, but elongated. We can fine a straight line through the centre of the cloud in the direction of elongation, so that all the points lie close to this line. Thus, the essentials of the two-dimensional cloud are captured in the one-dimensional line.
Problems are much tougher to solve in higher dimensions; this is called “the curse of dimensionality”. Dimension reduction is essential in big data science. Interesting features can often be captured by isolating a few key combinations of variables. What is the best way to represent data so as to highlight features of interest? Can we reduce a large data set to a much smaller one while preserving essential characteristics? Is there redundancy that can be exploited, with many variables determined by others?
Many sophisticated analysis techniques have been developed that reduce the dimensions and reveal signals buried in extraneous noise. One method of great power is called principal component analysis (PCA). From data in a high dimensional space, this method determines a small number of new variables called principal components, allowing us to spot patterns. PCA also allows us to visualize the data in a simple two-dimensional diagram that often encapsulates the essence of the problem. Clusters of points with distinct behaviour can often be detected.
PCA has many applications, in acoustics, seismology, forensic science, meteorology and medicine. An intriguing application in genetics has shown that DNA can be used to infer an individual’s geographic origin with remarkable accuracy - often to within a few hundred kilometres.
A paper in the journal Nature, with lead author John Novembre of UCLA, studied the genetic variation in a sample of more than 3000 European people. Each DNA specimen was genotyped at about half a million loci. PCA was then used to drastically reduce this data set to just two dimensions and depict it on a plane graph.
The first two principal components are correlated with perpendicular combinations of longitude and latitude. With appropriate orientation, their visualization had a striking resemblance to a map of Europe (a detail is shown in the figure). Individuals from the same region cluster together so that major populations can be identified. For example, clusters corresponding to the Iberian and Italian peninsulas are clear, and the Irish and British groups are easily distinguished.
The results mean that European DNA samples contain vital information about their donors. Thus, one can place 90 per cent of individuals within about 700 km of their geographic origin.
Peter Lynch is emeritus professor at UCD School of Mathematics & Statistics – he blogs at thatsmaths.com