The geography of Europe is mapped in our genes

In the age of ‘big data’ voluminous data collections are mined for information

Thanks to a mathematical technique called principal component analysis, this can be done with remarkable accuracy

Thanks to a mathematical technique called principal component analysis, this can be done with remarkable accuracy

 

It may seem too much to expect that a person’s geographic origin can be determined from a DNA sample. But, thanks to a mathematical technique called principal component analysis, this can be done with remarkable accuracy. It works by reducing multi-dimensional data sets to just a few variables.

We live in the age of “big data”. Voluminous data collections are mined for information using mathematical techniques. The data may be assembled in a matrix – a rectangular array of numbers – with a column for each individual and a row for each variable.

For a medical database, the variables might be age, height, weight, blood-group and numerous other relevant factors, resulting in a very large matrix. We can examine small tables of numbers visually and detect interesting patterns but, with many variables, each requiring a separate dimension, simple inspection may reveal nothing of value.

A simple example illustrates dimension reduction. Suppose we let the two axes of a graph measure height and weight. Taller people are usually heavier than shorter ones, so these two variables are not independent; they are correlated. Each individual is represented by a point, and all the points form a cloud. The cloud is not round in shape, but elongated. We can fine a straight line through the centre of the cloud in the direction of elongation, so that all the points lie close to this line. Thus, the essentials of the two-dimensional cloud are captured in the one-dimensional line.

Higher dimensions

Problems are much tougher to solve in higher dimensions; this is called “the curse of dimensionality”. Dimension reduction is essential in big data science. Interesting features can often be captured by isolating a few key combinations of variables. What is the best way to represent data so as to highlight features of interest? Can we reduce a large data set to a much smaller one while preserving essential characteristics? Is there redundancy that can be exploited, with many variables determined by others?

Many sophisticated analysis techniques have been developed that reduce the dimensions and reveal signals buried in extraneous noise. One method of great power is called principal component analysis (PCA). From data in a high dimensional space, this method determines a small number of new variables called principal components, allowing us to spot patterns. PCA also allows us to visualize the data in a simple two-dimensional diagram that often encapsulates the essence of the problem. Clusters of points with distinct behaviour can often be detected.

PCA has many applications, in acoustics, seismology, forensic science, meteorology and medicine. An intriguing application in genetics has shown that DNA can be used to infer an individual’s geographic origin with remarkable accuracy - often to within a few hundred kilometres.

A paper in the journal Nature, with lead author John Novembre of UCLA, studied the genetic variation in a sample of more than 3000 European people. Each DNA specimen was genotyped at about half a million loci. PCA was then used to drastically reduce this data set to just two dimensions and depict it on a plane graph.

Components

The first two principal components are correlated with perpendicular combinations of longitude and latitude. With appropriate orientation, their visualization had a striking resemblance to a map of Europe (a detail is shown in the figure). Individuals from the same region cluster together so that major populations can be identified. For example, clusters corresponding to the Iberian and Italian peninsulas are clear, and the Irish and British groups are easily distinguished.

The results mean that European DNA samples contain vital information about their donors. Thus, one can place 90 per cent of individuals within about 700 km of their geographic origin.

Peter Lynch is emeritus professor at UCD School of Mathematics & Statistics – he blogs at thatsmaths.com

The Irish Times Logo
Commenting on The Irish Times has changed. To comment you must now be an Irish Times subscriber.
SUBSCRIBE
GO BACK
Error Image
The account details entered are not currently associated with an Irish Times subscription. Please subscribe to sign in to comment.
Comment Sign In

Forgot password?
The Irish Times Logo
Thank you
You should receive instructions for resetting your password. When you have reset your password, you can Sign In.
The Irish Times Logo
Please choose a screen name. This name will appear beside any comments you post. Your screen name should follow the standards set out in our community standards.
Screen Name Selection

Hello

Please choose a screen name. This name will appear beside any comments you post. Your screen name should follow the standards set out in our community standards.

The Irish Times Logo
Commenting on The Irish Times has changed. To comment you must now be an Irish Times subscriber.
SUBSCRIBE
Forgot Password
Please enter your email address so we can send you a link to reset your password.

Sign In

Your Comments
We reserve the right to remove any content at any time from this Community, including without limitation if it violates the Community Standards. We ask that you report content that you in good faith believe violates the above rules by clicking the Flag link next to the offending comment or by filling out this form. New comments are only accepted for 3 days from the date of publication.