Microsoft quietly deletes largest public face recognition data set

Database of 10m faces has been used by military researchers and Chinese firms

Microsoft has quietly pulled from the internet its database of 10 million faces, which has been used to train facial recognition systems around the world, including by military researchers and Chinese firms such as SenseTime and Megvii.

The database, known as MS Celeb, was published in 2016 and described by the company as the largest publicly available facial recognition data set in the world, containing more than 10 million images of nearly 100,000 individuals.

The people whose photos were used were not asked for their consent, their images were scraped off the web from search engines and videos under the terms of the Creative Commons licence that allows academic reuse of photos.

Microsoft, which took down the database days after the Financial Times reported on its use by companies, said: “The site was intended for academic purposes. It was run by an employee that is no longer with Microsoft and has since been removed.”


Two other data sets have also been taken down since the FT report was published in April, including the Duke MTMC surveillance data set built by Duke University researchers, and a Stanford University data set called Brainwash.

Brainwash used footage of customers in a cafe called Brainwash in San Francisco’s Lower Haight district, taken through a live-streaming camera. Duke did not respond to requests for comment. Stanford said it had removed the data set after a request by one of the authors of a study it was used for. A spokesperson said the university was “committed to protecting the privacy of individuals at Stanford and in the larger community”.

All three data sets were uncovered by Berlin-based researcher Adam Harvey, whose project Megapixels documented the details of dozens of data sets and how they are being used.

Microsoft's MS Celeb data set has been used by several commercial organisations, according to citations in AI papers, including IBM, Panasonic, Alibaba, Nvidia, Hitachi, Sensetime and Megvii. Both Sensetime and Megvii are Chinese suppliers of equipment to officials in Xinjiang, where minorities of mostly Uighurs and other Muslims are being tracked and held in internment camps.

Training algorithms

Microsoft itself has used the data set to train facial recognition algorithms, Mr Harvey’s investigation found.

The company named the data set Celeb to indicate that the faces it had scraped were photos of public figures. But Mr Harvey found that the data set included several arguably private individuals, including security journalists such as Kim Zetter, Adrian Chen and Shoshana Zuboff, the author of Surveillance Capitalism, and Julie Brill, the former FTC commissioner responsible for protecting consumer privacy.

“Microsoft has exploited the term ‘celebrity’ to include people who merely work online and have a digital identity,” said Mr Harvey. “Many people in the target list are even vocal critics of the very technology Microsoft is using their name and biometric information to build.”

When the FT previously contacted people in the database, they were unaware of their inclusion. "I am in no sense a public person. There is no way in which I've ceded my right to privacy," said Adam Greenfield, a technology writer and urbanist who was included in the data set.

“It’s indicative of Microsoft’s inability to hold their own researchers to integrity and probity that this was not torpedoed before it left the building,” he said. “To me, it is indicative of a profound misunderstanding of what privacy is.”

GDPR violation?

Tech experts said Microsoft may have been in violation of the EU’s General Data Protection Law by continuing to distribute the MS Celeb data set after the regulations came into effect last year.

"They are likely to have taken it down because their lawyers expressed concern that they do not have a basis to process special category data such as faces under article 9 of GDPR," said Michael Veale, a technology policy researcher at the Alan Turing Institute. "They may not have a get-out clause for processing biometric data for the purposes of "uniquely identifying a natural person".

“Particularly as the use of the data set has moved from a purely research use to something that products are being built with,” he added. “There is reason to believe that the people in the data set cannot be considered to expressly and clearly have made their faces public.”

Microsoft said it was not aware of any GDPR implications and that the site had been retired “because the research challenge is over”.

Although the database has been deleted by Microsoft, it is still available to researchers and companies that had previously downloaded it. Mr Harvey said it was still being shared on open-source websites.

"You can't make a data set disappear. Once you post it, and people download it, it exists on hard drives all over the world," he said. "Now it is completely disassociated from any licensing, rules or controls that Microsoft previously had over it. People are posting it on GitHub, hosting the files on Dropbox and Baidu Cloud, so there is no way from stopping them from continuing to post it and use it for their own purposes." – Copyright The Financial Times Limited 2019