Is it legal for GHTorrent to aggregate Github user data?
Do users have a right to demand removal of their e-mail address from the GHTorrent data set? This is a frequent complaint for the project. GHTorrent aims at research on Github software projects, making metadata (such as user activity & profile information) easy to index and search. This includes e-mail addresses of users, allowing all kinds of links to be created. But how legal is that?
Github is one of the biggest platforms for distributed software development, in particular for open source. The amount of activity on the platform makes it very attractive for scientific research on software development. For example, I found a paper showing that contributions by female developers are more likely to be adopted than those by male developers.
Such research requires the combing over of hundreds of thousands of projects and developers, which is hardly feasible if done by hand each time. Thus GHTorrent: in practical terms an offline mirror of all Github metadata, easy to search without having to visit Github for each research question.
Not everyone is happy with that. In particular because the metadata includes developer e-mail addresses. Researchers can use that address as an identifier. The male/female researched relied on that: using the e-mail addresses they could retrieve the Google+ profile of developers, and from there retrieve the gender of developers. And of course you can use an e-mail address to, well, send e-mail. Thus many complaints about unwanted invitations to participate in all kinds of research.
The data is public. E-mail addresses are visible to all, so anyone who wants can acquire the same data by just querying the Github API or scraping the site. So it would seem it is just an ethical question whether you should do this or not. But not really. At least in Europe – where GHTorrent is operated from – there are strict privacy rules on personal data, and these also apply if the data was available from public sources.
An e-mail address is personal data under European rules, because it can be traced to a natural person (the developer). Any entity that brings together, indexes or makes available such data, is a “controller” or responsible entity for the result under these laws. A controller, by law, must show a basis in law for its processing, and comply with regulations on informing data subjects, and allowing them to view, correct and remove their personal data.
This applies also if the data is publicly available. Google discovered this in 2014 when it was hit with the “right to be forgotten”-decision of the European Court of Justice. Although Google search results are derived from public sources, Google is itself responsible for how it combines and presents those results. This means Google is a controller and thus has its own responsibility regarding basis in law, information and correction and removal, regardless of how this is handled or justified by Google’s sources.
The same applies to GHTorrent. This project brings public data together, but this data contains personal data. And simply because of that reason the GHTorrent operators are “controller” under the law and must inform users about what is going on, and allow them to correct or have removed their data.
Removal, however, is not an absolute right. The criterion is whether the data has become irrelevant for the purpose for which it was collected. In the Google case: where search results for a search on a person’s name were “inadequate, irrelevant or excessive”. This criterion also applies to GHTorrent, but its application is more complex. What is the analogy to a 20-year-old news article about a person for the GHTorrent dataset? I actually wouldn’t know.
There may be a simpler solution though. Data processing is legal only if there is a basis in law. Merely pointing at public sources is not enough (nor is saying you are doing scientific research). Permission was not obtained, no contractual relationship exists, so the only available basis in law is the “legitimate interests” criterion. Scientific research is a legitimate interest that may permit using data without permission, however the scientist must then show a pressing need for the data (could you do without the data, or with less data?) and must also take all reasonable steps to protect the privacy of the data subjects.
Here things may go wrong, as this last step in practice means an opt-out must be offered. Again, this is not absolute. The law requires appropriate safeguards, and an opt-out is one such safeguard. Given the context of the GHTorrent data, other safeguards might also be appropriate. For example, e-mail addresses could be hashed (allowing later matching but not reuse), or removed and only provided to researchers who separately agree to keep them confidential. To me an opt-out seems the simplest solution, however.
So, to summarize: it seems legal what GHTorrent is doing, however they must comply with European data protection legislation. In practice this means an opt-out mechanism must be provided, unless GHTorrent can find another privacy safeguard that provides an equivalent solution to the problem of e-mail address misuse.