most Americans (87%) can be uniquely identified from just three pieces of personal data: a birth date, five-digit zip code and gender
This 87% re-identification risk estimate, incorrectly cited in http://smartdatacollective.com/tamaradull/298401/pii-anonymized-data-and-big-data-privacy as having been initially reported in a 2001 AT&T research paper [1], actually comes from work done by Latanya Sweeney using the 1990 U.S. Census data.[2]
The cited 87% estimate, in fact, has never been reproduced by other scientists. In 2006, Philippe Golle, a Palo Alto Research Center Computer Scientist, attempted to replicate Sweeney's 1990 Census results and additionally estimated the results using the 2000 Census data. [3] Using well-documented and more accurate estimation methods, Golle found that in 1990 only 61% of the US population was actually unique with regard to this combination of three characteristics. Additionally, Golle found that only 63% of the US population was unique using the 2000 Census data. Subsequently, Barth-Jones, using Golle's method, additionally found that only 62% were unique in the 2010 Census data. [Unpublished work, Personal communication Daniel Barth-Jones, db2431@columbia.edu].
It should also be understood that the estimates made by both Golle and Sweeney will necessarily produce overestimates of practically realizable re-identification risks because the calculations assume that the entity attempting re-identification will be capable of producing a perfect (or nearly perfect) population register of the entire U.S. population with respect to their Date of Birth, Gender and current 5-digit Zip Code. This re-identification challenge has been described by Barth-Jones as the "myth of the perfect population register" [4].
In 2013, Sweeney provided an illustrative demonstration of the extent to which perfect population register challenges can practically reduce re-identification risks quite dramatically by using the same three Zip, Gender and Birth Date data elements in an attempt to re-identify participants in the Personal Genome Project [5]. Once participants names (which were embedded the the project's data files were excluded from the re-identification risk calculations), only 28% of the participants with data for their Date of Birth, Gender and 5-digit Zip Code could be actually be re-identified in spite of the study's use of both voter registration and commercial data broker data to attempt to create a suitable population register for re-identification. [6]
Notably, this real-world demonstration by Sweeney of the realized re-identification risks when attempting to construct a population register using currently available commercial and governmental data sources revealed re-identification risks only one third of the 87% level originally cited by Sweeney, and less than half of the 61-63% levels found by Golle and Barth-Jones. [7]
References: 1) http://www2.research.att.com/~bala/papers/wosn09.pdf Krishnamurthy, B. Wills, CE. On the Leakage of Personally Identifiable Information Via Online Social Networks. 2001. ACM, 2) http://dataprivacylab.org/projects/identifiability/paper1.pdf L. Sweeney, Uniqueness of Simple Demographics in the U.S. Population, LIDAPWP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA, 2000. 3) https://crypto.stanford.edu/~pgolle/papers/census.pdf Golle, P. Revisiting the Uniqueness of Simple Demographics in the US Population. Proceedings of the 5th ACM Workshop on Privacy in Electronic Society, 2006, p. 77-80. 4) Barth-Jones, Daniel C., The 'Re-Identification' of Governor William Weld's Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now (June 4, 2012). Available at SSRN: http://ssrn.com/abstract=2076397 or http://dx.doi.org/10.2139/ssrn.2076397 5) Sweeney, Latanya and Abu, Akua and Winn, Julia, Identifying Participants in the Personal Genome Project by Name (April 29, 2013). Available at SSRN: http://ssrn.com/abstract=2257732 or http://dx.doi.org/10.2139/ssrn.2257732 6) Bambauer, J. https://blogs.law.harvard.edu/infolaw/2013/05/01/reporting-fail-the-reidentification-of-personal-genome-project-participants/comment-page-1/ 7) Barth-Jones, DC http://blogs.law.harvard.edu/billofhealth/2013/10/01/press-and-reporting-considerations-for-recent-re-identification-demonstration-attacks-part-2-re-identification-symposium/