3 Matching Annotations
  1. Apr 2015
    1. most Americans (87%) can be uniquely identified from just three pieces of personal data: a birth date, five-digit zip code and gender

      This 87% re-identification risk estimate, incorrectly cited in http://smartdatacollective.com/tamaradull/298401/pii-anonymized-data-and-big-data-privacy as having been initially reported in a 2001 AT&T research paper [1], actually comes from work done by Latanya Sweeney using the 1990 U.S. Census data.[2]

      The cited 87% estimate, in fact, has never been reproduced by other scientists. In 2006, Philippe Golle, a Palo Alto Research Center Computer Scientist, attempted to replicate Sweeney's 1990 Census results and additionally estimated the results using the 2000 Census data. [3] Using well-documented and more accurate estimation methods, Golle found that in 1990 only 61% of the US population was actually unique with regard to this combination of three characteristics. Additionally, Golle found that only 63% of the US population was unique using the 2000 Census data. Subsequently, Barth-Jones, using Golle's method, additionally found that only 62% were unique in the 2010 Census data. [Unpublished work, Personal communication Daniel Barth-Jones, db2431@columbia.edu].

      It should also be understood that the estimates made by both Golle and Sweeney will necessarily produce overestimates of practically realizable re-identification risks because the calculations assume that the entity attempting re-identification will be capable of producing a perfect (or nearly perfect) population register of the entire U.S. population with respect to their Date of Birth, Gender and current 5-digit Zip Code. This re-identification challenge has been described by Barth-Jones as the "myth of the perfect population register" [4].

      In 2013, Sweeney provided an illustrative demonstration of the extent to which perfect population register challenges can practically reduce re-identification risks quite dramatically by using the same three Zip, Gender and Birth Date data elements in an attempt to re-identify participants in the Personal Genome Project [5]. Once participants names (which were embedded the the project's data files were excluded from the re-identification risk calculations), only 28% of the participants with data for their Date of Birth, Gender and 5-digit Zip Code could be actually be re-identified in spite of the study's use of both voter registration and commercial data broker data to attempt to create a suitable population register for re-identification. [6]

      Notably, this real-world demonstration by Sweeney of the realized re-identification risks when attempting to construct a population register using currently available commercial and governmental data sources revealed re-identification risks only one third of the 87% level originally cited by Sweeney, and less than half of the 61-63% levels found by Golle and Barth-Jones. [7]

      References: 1) http://www2.research.att.com/~bala/papers/wosn09.pdf Krishnamurthy, B. Wills, CE. On the Leakage of Personally Identifiable Information Via Online Social Networks. 2001. ACM, 2) http://dataprivacylab.org/projects/identifiability/paper1.pdf L. Sweeney, Uniqueness of Simple Demographics in the U.S. Population, LIDAPWP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA, 2000. 3) https://crypto.stanford.edu/~pgolle/papers/census.pdf Golle, P. Revisiting the Uniqueness of Simple Demographics in the US Population. Proceedings of the 5th ACM Workshop on Privacy in Electronic Society, 2006, p. 77-80. 4) Barth-Jones, Daniel C., The 'Re-Identification' of Governor William Weld's Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now (June 4, 2012). Available at SSRN: http://ssrn.com/abstract=2076397 or http://dx.doi.org/10.2139/ssrn.2076397 5) Sweeney, Latanya and Abu, Akua and Winn, Julia, Identifying Participants in the Personal Genome Project by Name (April 29, 2013). Available at SSRN: http://ssrn.com/abstract=2257732 or http://dx.doi.org/10.2139/ssrn.2257732 6) Bambauer, J. https://blogs.law.harvard.edu/infolaw/2013/05/01/reporting-fail-the-reidentification-of-personal-genome-project-participants/comment-page-1/ 7) Barth-Jones, DC http://blogs.law.harvard.edu/billofhealth/2013/10/01/press-and-reporting-considerations-for-recent-re-identification-demonstration-attacks-part-2-re-identification-symposium/

    1. intentional re-identification is different from accidental re-identification. accidents frequent. intention is hard.

      Accidental Data Recognitions

      @KenNeumiester suggested in this tweet:

      @dbarthjones intentional re-identification is different from accidental re-identification. accidents frequent. intention is hard.

      — Ken Neumeister (@KenNeumeister) March 29, 2015

      <script async="" src="//&lt;a href=" http:="" <a="" href="http://platform.twitter.com" target="_blank" rel="nofollow noopener">platform.twitter.com="" widgets.js"="" target="_blank" rel="nofollow noopener">platform.twitter.com/widgets.js" charset="utf-8"></script>

      that accidental data re-identifications occur frequently. His comment was surprising to me because I've never experienced an accidental or "spontaneous" data re-identification in 25 years of working on health data de-identification. Our twitter exchange about this was interesting and I learned that his take on this may differ from mine because he apparently works in a different data domain than I do, but it started me thinking.

      Part of the reason for my having never encountered a single spontaneous re-identification in over two decades of opportunity for this to happen might be that I'd ended my experience providing patient care in a clinical setting before I began working in the area of data de-identification. I don't doubt that spontaneous health data re-identifications might occasionally occur for clinicians who work with de-identified data for patients that they've personally seen (or maybe even with their recognizing patients that have been treated in the units/offices where they work).

      A couple of points are probably worth mentioning though about such spontaneous clinical environment data "re-identifications". First, if they occur, they would typically only impact a single person and certainly not any large numbers of persons. In any large healthcare data set this should presumably constitute a very small re-identification risk among the total number of persons at risk of re-identification. Secondly, and more importantly, I not really sure we should consider such spontaneous events as "re-identifications". I'd propose that "spontaneous data recognitions" would be a more illuminating name for such events.

      Why does the distinction I'm making matter? Well, because it speaks to both the real source of potential privacy concern and the likelihood of privacy harms resulting from such an event. The de-identified data isn't the source of the critical information needed to enable possible privacy concerns here. The knowledge of a patient's details enabling a spontaneous recognition of the patient within a de-identified data set hasn't come from the data set -- it comes from external information that the data viewer already possessed independent of the data.

      Sure, on occasion we might suppose that the data "recognizer" might also learn something new about the patient from the data which wasn't already known to them. But with respect to the possibility of resulting privacy harms from such events, we need to remind ourselves that the likelihood of the newly revealed information being of great sensitivity or otherwise capable of producing possible privacy harms for the patient will likely be a comparatively rare occurrence.This is because it will have most likely have been the case that the patient was recognizable primarily due to their having some atypical / rare characteristics enabling their spontaneous recognition. There might be additional info within the data that the recognizer didn't already know, but initial recognitions will be rare events, and additional revelations resulting in privacy harms will be rarer still. This is basic probabilistic reasoning (in spite of the problem that even very smart people will sometimes avoid "doing the math" if they've been poked with a scary scenario first). See http://blogs.law.harvard.edu/infolaw/2014/11/21/the-antidote-for-anecdata-a-little-science-can-separate-data-privacy-facts-from-folklore/

      But the probabilistic reality remains that the result of a final outcome B stemming from an initial rare event A, which then can lead to yet another rare event (or even moderately frequent) event B, yields only rarer still occurrences of event B.

      We can add to this probabilistic assurance that because, in general (at least for this context of spontaneous health data recognitions) having sufficiently detailed knowledge that a person has atypical characteristics (or a rare medical condition) which could enable their spontaneous recognition usually comes from having already been placed in a general position of trust with respect to the participating in the patient's care, or otherwise having already had some sort of trusted relationship with the patient.

      So in the same way the we generally trust doctors, nurses and other healthcare providers to with our personal health information, the same trust - and ethics that enable such trust - should be rightly expected to be in place for sensitive information revealed through spontaneous medical care data recognition events.

      The very same reasoning also generally applies even for the context of non-medically trained personnel.(statisticians, data analysts, etc.) accessing de-identified medical data. With properly de-identified data, spontaneous data recognitions are only likely to occur extremely rarely for those cases where there isn't already some sort of existing relationship with the person being recognized that enabled the detailed knowledge required to realize the recognition.

      Yes, on extremely rare occasions, statisticians and data analysts, might spontaneously recognize their family members, neighbors, or workplace friends/ acquaintances or even themselves within properly de-identified health data, but the number of people for they will know sufficient details to allow this to occur will be very small in any large and properly de-identified data set. And when this does rarely occur, resultant privacy harms will likely also be rare because, in general, the knowledge of the details through which a spontaneous re-identification could occur will be have been obtained through having an existing relationship with the individual who has been recognized. Put in statistical terms, we are helped in this specific context of spontaneous recognitions by the fact that knowledge of our intimate details that are needed to enable such recognitions is typically correlated with having some sort of trust relationship with us.

      I'd further argue though that we should also be supplementing these inherently probabilistic protections by consistently providing ethical training and mentorship to all personnel who access de-identified medical data. In my training as an HIV epidemiologist, I received plenty of meaningful mentorship about privacy ethics, but had admittedly little course-based training on these issues. This is an area where we can and should do more.

      However, I think it's helpful to recognize that although accidental or spontaneous re-identification might occur on occasion, there's good reason to understand that they are likely to be quite rare and when they do occur, they most often will be unlikely to pose an influential source of privacy harms.

  2. Mar 2015
    1. According to a study published in 2006, 87% of the American population has a unique combination of ZIP code, birth date and sex.

      This 87% re-identification risk estimate, incorrectly cited in https://www.cippguide.org/2010/09/21/de-identification-re-identification/ as having been initially reported in a 2006 study [2], actually comes from work published in 2000 by Latanya Sweeney using the 1990 U.S. Census data.[1]

      The cited 87% estimate, in fact, has never been able to be reproduced upon scrutiny from other scientists. In the 2006 paper cited in this link, Philippe Golle, a Palo Alto Research Center Computer Scientist, attempted to replicate Sweeney's 1990 Census results and additionally estimated the results using the 2000 Census data. [2] Using well-documented and more accurate estimation methods, Golle found that in 1990 only 61% of the US population was actually unique with regard to this combination of three characteristics. Additionally, Golle found that only 63% of the US population was unique using the 2000 Census data. Subsequently, Barth-Jones, using Golle's method, additionally found that only 62% were unique in the 2010 Census data. [Unpublished work, Personal communication Daniel Barth-Jones, db2431@columbia.edu].

      It should also be understood that the estimates made by both Golle and Sweeney will necessarily always produce overestimates of practically realizable re-identification risks because the calculations assume that the entity attempting re-identification will be capable of producing a perfect (or nearly perfect) population register of the entire U.S. population with respect to their Date of Birth, Gender and current 5-digit Zip Code. This re-identification challenge has been described by Barth-Jones as the "myth of the perfect population register" [3].

      In 2013, Sweeney provided an illustrative demonstration of the extent to which perfect population register challenges can practically dramatically reduce re-identification risks. The same three Zip, Gender and Birth Date data elements were used in an attempt to re-identify participants in the Personal Genome Project [4]. Once participants names (which were embedded the the project's data files in some cases where excluded from the re-identification risk calculations), only 28% of the participants with data for their Date of Birth, Gender and 5-digit Zip Code could be actually be re-identified in spite of the study's use of both voter registration and commercial data broker data to attempt to create a suitable population register for re-identification. [5]

      Notably, this real-world demonstration by Sweeney of the realized re-identification risks when attempting to construct a population register using currently available commercial and governmental data sources revealed re-identification risks only one third of the 87% level originally cited by Sweeney, and less than half of the 61-63% levels found by Golle and Barth-Jones. [6]

      References: 1) http://dataprivacylab.org/projects/identifiability/paper1.pdf L. Sweeney, Uniqueness of Simple Demographics in the U.S. Population, LIDAPWP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA, 2000. 2) https://crypto.stanford.edu/~pgolle/papers/census.pdf Golle, P. Revisiting the Uniqueness of Simple Demographics in the US Population. Proceedings of the 5th ACM Workshop on Privacy in Electronic Society, 2006, p. 77-80. 3) Barth-Jones, Daniel C., The 'Re-Identification' of Governor William Weld's Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now (June 4, 2012). Available at SSRN: http://ssrn.com/abstract=2076397 or http://dx.doi.org/10.2139/ssrn.2076397 4) Sweeney, Latanya and Abu, Akua and Winn, Julia, Identifying Participants in the Personal Genome Project by Name (April 29, 2013). Available at SSRN: http://ssrn.com/abstract=2257732 or http://dx.doi.org/10.2139/ssrn.2257732 5) Bambauer, J. https://blogs.law.harvard.edu/infolaw/2013/05/01/reporting-fail-the-reidentification-of-personal-genome-project-participants/comment-page-1/ 7) Barth-Jones, DC http://blogs.law.harvard.edu/billofhealth/2013/10/01/press-and-reporting-considerations-for-recent-re-identification-demonstration-attacks-part-2-re-identification-symposium/