A bug and a dilemma
Just a few months in the past, I found that the SAS statistical software program package deal, which is used worldwide by universities and different massive organisations to analyse their knowledge, contained—till fairly not too long ago—a bug that would lead to data that the consumer thought they’d efficiently deleted (and was not seen from throughout the utility itself) nonetheless being current within the saved knowledge file. This might result in personal identifiable information (PII) about research contributors being revealed, alongside no matter different knowledge may need been collected from these contributors, which—relying on the research—may probably be extraordinarily delicate. I discovered this solely by probability when taking a look at an SAS knowledge file to attempt to work out why some numbers weren’t popping out as anticipated, for which it could have been helpful to know if numbers are saved in ASCII or binary. (It turned out that they’re saved in binary.)
Here is how this bug works: Suppose that as a researcher you’ve got run a research on 80 named contributors, and also you now have a dataset containing their names, research ID numbers (for instance, if the research code inside your organisation is XYZ this code is likely to be XYZ100, XYZ101, and so on, as much as XYZ179), and different related variables from the research. Someday you resolve to make a model of the dataset that may be shared with out the contributors being identifiable, both as a result of you must deposit this in an archive once you submit the research to a journal, or as a result of someone has learn the article and requested on your knowledge. You possibly can share this in .CSV file format, and certainly that will usually be thought-about greatest observe for interoperability; however there could also be good causes to share it in SAS’s native binary knowledge file format with a .sas7bdat extension, which may in any case be opened in R (utilizing a package deal named “sas7bdat“, amongst others) or in SPSS.
So that you open your file referred to as participants-final.sas7bdat within the SAS knowledge editor and delete the column with the contributors’ names (and every other PII, similar to IP addresses, or maybe dates of delivery if these will not be wanted to ascertain the contributors’ ages, and so on), then put it aside as deidentified-participants-final.sas7bdat, and share the latter file. However what you do not know is that, due to this bug, in some unknown share of instances the textual content of many of the names can typically nonetheless be sitting within the sas7bdat binary knowledge file, near the alphanumeric participant IDs. That’s, if the bug has struck, somebody who opens the “deidentified” file in a plain textual content editor (which could possibly be so simple as Notepad on Home windows) would possibly see the names and IDs among the many binary gloop, as proven on this picture.
I’m fairly positive these two individuals didn’t participate on this research.
This screenshot reveals an precise extract from an information file that I discovered, with solely the names and the research ID codes changed with these of others chosen from the cellphone ebook. The total names of about two-thirds of the contributors on this research had been readable. In fact, you possibly can’t learn the binary knowledge and it could take quite a lot of work to take action, however given the participant IDs (PRZ045 for Trump, PRZ046 for Biden) you possibly can merely open the “anonymised” knowledge file in SAS and discover out all you need about these two individuals from throughout the utility.
I’ve been advised by SAS assist (see screenshot under) that this bug was mounted in model 9.4M4 of the software program, which was released on 16 November 2016. The assist agent advised me that the issue was identified to be current in model 9.4M3, which was launched on 14 July 2015; nevertheless, I have no idea whether or not the issue additionally existed in earlier variations. I feel it could be prudent to imagine that any file in .sas7bdat format created by a model of SAS previous to 9.4M4 could have this problem. Neither the existence of the issue, nor the truth that it had been mounted, had been documented by SAS in the release notes for version 9.4M4; equally, nevertheless, the assist consultant didn’t inform me that the issue is thought to be high secret or topic to any form of embargo.
(The identification of the organisation that shared the information during which I discovered the bug has been redacted right here.)
SAS is a posh software program package deal and it’ll typically take some time for giant organisations emigrate to a brand new model. In all probability by now most variations have been upgraded to 9.4M4 or later, however fairly a couple of websites may need been utilizing the earlier model containing this bug till fairly not too long ago, and as I already talked about, it is not clear how previous the bug is (i.e., at what level it was launched to the software program). So it may have been round for a few years previous to being found, and it may effectively have nonetheless been round for 2 or three years after that date at many websites.
Now, this discovery precipitated me a dilemma. I nervous that, if I had been to go public with this bug, this would possibly begin a race between individuals who have already shared their datasets that had been made with a model previous to 9.4M4 making an attempt to interchange or recall their information, and Unhealthy Folks™ looking for materials on-line to take advantage of. That’s, to disclose the existence of the issue would possibly improve the danger of information leaking out. However, it is also attainable that the unhealthy persons are already conscious of the issue and are actively on the lookout for that materials, during which case day by day that passes with out the issue turning into public data will increase the danger, and going public can be the beginning of the answer.
Word that that is totally different from the everyday “white hat”/”bug bounty” state of affairs, during which the Good Folks™ who discover a vulnerability inform the software program firm in regards to the bug and receives a commission to stay silent till an affordable period of time has handed to patch the programs, after which they’re free to disclose the existence of the issue. In these instances, patching the software program fixes the issue instantly, as a result of the extent of the vulnerability is proscribed to the software program itself. However right here, the vulnerability is within the knowledge information that weren’t anonymised as supposed. There isn’t a technique to patch something to cease these information from being learn, as a result of that solely wants a textual content editor. The one treatment is for the information to be deleted from, or changed in, repositories as their authors or guardians grow to be conscious of the difficulty.
Within the authentic case the place I found this problem, I reported it to the proprietor of the dataset and he organized for the offending file to be recalled from the repository the place he had positioned it, specifically the Open Science Framework. (I additionally gave a heads-up to the Govt Director of the Middle for Open Science, Brian Nosek, at the moment.) The dataset proprietor additionally reported the issue to their administration, as they thought (and I fully agree) that coping with this form of problem is past the pay grade of any particular person principal investigator. I have no idea what has occurred since, nor do I feel it is actually my enterprise. I’d argue that SAS must have completed one thing extra about this than simply sneaking out a repair with out telling anyone; however maybe they, too, regarded on the trade-off described above and determined to maintain quiet on that foundation, slightly than merely avoiding embarrassment.
I’ve spent a number of months questioning what to do about this data. In the long run, I made a decision that (a) there in all probability aren’t too many corrupt information on the market, and (b) there in all probability aren’t too many Unhealthy Folks™ who’re prone to go trying to find delicate knowledge this fashion, as a result of it simply would not look like a really productive method of being a Unhealthy Individual. So I’m going public at present, within the hope that the sensible penalties of unveiling the existence of this drawback are unlikely to be main, and that giving individuals the possibility to right any SAS knowledge information that they may have made public will probably be, on steadiness, a web win for the Good Folks. (For what it is value, I requested two professors of ethics about this, considered one of them a specialist in data-related points, and so they each stated “Ouch. Robust name. I do not know. Do what you assume is greatest”.)
Now, what does this discovery imply? Nicely, should you use SAS and have made your knowledge accessible utilizing the .sas7bdat file format, you may want to take a look within the knowledge information with a textual content editor and verify that there’s nothing in there that you just would not count on. However even should you do not use SAS, there should still be a few classes for you from this incident, as a result of (a) the truth that this explicit software program bug is mounted doesn’t suggest there aren’t others, and (b) everybody makes errors.
First, contemplate at all times utilizing .CSV information to share your knowledge, if there isn’t a compelling purpose not to take action. The opposite day I needed to obtain a two-year-old .RData file from OSF and it contained knowledge buildings that had been already partly out of date when learn by newer variations of the package deal that had create them; I needed to hunt round on-line for the answer, and which may not work in any respect at some future level. Once I had sorted that out I saved the ensuing knowledge in a .CSV file, which turned out to be practically 20% smaller than the .RData file anyway.
Second, attempt to preserve all PII out of the dataset altogether. Construct a separate file or information that connects every participant’s research ID quantity to their identify and every other data that isn’t going to be an analysed variable. In case your research requires you to generate a personalised report for the contributors that features their identify then this would possibly signify just a little additional effort, however typically this strategy will tremendously scale back the probabilities of a leak of PII. (I believe that for each participant whose PII is revealed by bugs, a number of extra are the victims of both knowledge theft or just failure on the a part of the researchers to delete the PII earlier than sharing their knowledge.)
(Because of Marcus Munafò and Brian Nosek for invaluable discussions about an earlier draft of this publish.)