Facebook's latest privacy debacle was driven by their failure to properly manage user IDs. This is not a new problem area and as the EFF points out, Facebook has done this before. So while I don't know if Facebook will be interested in this post, those who care about protecting their user's privacy in an age of data sharing may want to have a look at the threats and defenses needed to share user IDs across sites. Securing user IDs isn't easy.
[Update 10/22/2010: Changed the title and intro and added three new sections at the end.]
Table of Contents
Section 1: Where should user IDs be put? URIs?
Section 6: Tracking Users Across Sites
1 Where should user IDs be put? URIs?
pasta.example.com lets people post their favorite Pasta types. Joe has an account on pasta.example.com. What should the URI to his favorite Pasta shape be? It could be http://pasta.example.com/favorites/shape. The website knows to pull up Joe's data because Joe's ID is in a cookie or maybe an authentication token.
In that case however if Joe wants to send his friend a link to his favorite pasta he can't. Because when his friend tries to click on the URI pasta.example.com doesn't know which user's data to pull up. By putting Joe's identity into the URI, e.g. http://pasta.example.com/users/joe/favorites/shape now Joe can share the URI around and his friends can see Joe's favorite pasta shape.
The point is that by putting user IDs into URIs we make it possible for services to reason about multiple users and understand how to pull up their data. So, for example, if a recipe service wants to show Jane who Joe and Jake's favorite pasta shapes are the recipe service can just record http://pasta.example.com/users/joe/favorites/shape and http://pasta.example.com/users/jake/favorites/shape. If the user IDs weren't in the URI then the recipe service would need to understand what magic the pasta site was using to identify users (e.g. cookies, auth tokens, etc.) and learn that special magic.
By putting the User ID into the URI normal HTTP logic just works.
2 Referer [sic] Header
Let's say that Joe is accessing his medical records at https://cancer.example.com/users/joe. On that page is a link to an outside service with some relevant information. Joe clicks on that link and is taken to that outside site. In theory nobody needs to know where Joe came from but in reality browsers send referer [sic] headers that list the site the user came from. This HTTP header will contain the URI https://cancer.example.com/users/joe. This exposes Joe's identity and his association with cancer.example.com. If Joe's identity wasn't in the URI then that information wouldn't be leaked.
This attack is as old as the hills but as the recent Facebook referer leak shows, that doesn't mean people don't still screw it up. The 'solution' is that external links must come from a page that contains no user identifying information in the URI. There are a couple of ways of doing this but given the Facebook leak my guess is that oceans of technical ink will now be spilled on various tricks to avoid this problem so I don't see the need to add to it. I couldn't actually find a really solid article on the various ways to defuse the referer [sic] leak so if anyone has a good URI please let me know so I can add it here.
3 Man-in-the-Middle (MITM)
A user may log into a website security over HTTPS (thus hiding their login and their identity) but then be switched to HTTP for the rest of the interaction. This would let an observer see the URIs going back and forth and see that the person talking to cancer.example.com is Joe.
SSL protects everything from the TCP payload on up. So if SSL is used for all communications, not just login, then all URIs will be protected against eavesdroppers. For any non-trivial data we should be using SSL to protect it.
4 Browser History
Browsers keep track of what URIs a user has visited, even if those URIs are over HTTPS. If the user's identity is in the URIs then this opens up various attack vectors. One is simply accessing the machine after the user has left and viewing the history. Even if the cookies and caches have been cleared history tends to stick around and so the attacker can see what sites the user has to been to and what identities were used just by seeing the browser history.
The core of this attack is that an attacker basically takes control of the browser and can view the contents of the browser history at will. This is a fairly scary situation in and of itself. In the case that users are only expected to use the service from their own devices then this attack basically means that an attacker owns the browser and so can do much worse things than just look at history.
But what about scenarios where a user may access the site from someone else's device such as a kiosk? In this case we need to think carefully about the threat model. If we assume the machine is compromised before the user shows up then then user's identity is already compromised no matter what we do with URIs. So the only scenario where this attack really matters is one where the user is expected to use machines they don't control, the machine they use is expected to generally be clean and only later on is the machine attacked and the user's data compromised.
If we think that scenario is realistic (I don't btw, I suspect in most cases if a machine is going to be compromised it happened before the user got there) then we need to rethink the whole design of the website. We will certainly need to run everything over SSL, ideally use something like Silverlight with heavily dynamic content and keep user's IDs out of URIs.
5 Browser Links
This one is just plain nasty. It's a security hole plain and simple. But it's there and we can't ignore it. So we need to threat model it. The attack requires one to guess the exact URL, down to the last character, for it to work. In other words one can't just say "Hey has the user been to any site that begins with https://cancer.example.com/?". Instead one has to have the exact URL with the user ID and all other path values. So for the attack to be launched the attacker has to know both which site they are trying to attack and which specific user they are looking for.
How to address this attack depends on one's threat profile. In other words, a fairly small service dealing with fairly harmless information can probably ignore this attack. A major site dealing with secret data needs to be extremely concerned.
Unfortunately there really isn't much one can do about this attack. For example, let's say that a user goes to a bad guy's site and is fooled into logging in. So the bad guy knows who the user is. Now the bad guy site sends down a bunch of test links and sees that https://cancer.example.com/mainpage was clicked on. Even though the user's ID isn't in the link, it doesn't matter, the bad guy site knows the user has gone there.
The next step in the attack is for the bad guy to figure out what ID the user uses on cancer.example.com. This is where some defenses can come into play. One game one can play is to put cryptographically secure random numbers into all the URLs as query components. This will kill the link query attack dead since the attacker can't guess those numbers and so can never provide a test URL that will match against a logged in URL. In other words if the user's logged in URL is https://cancer.example.com/users/joe?id=[cryptographically secure random number] then the link guessing attack just fails.
Of course if Website A is going to legitimately redirect the user to cancer.example.com and the redirect URL doesn't include some cryptographically secure random number then the protection fails. So where cross site redirects involving URLs pointing directly at the user's ID are involved this is a fairly tenuous mechanism. Thankfully this is a reasonably rare scenario. Typically when sites legitimately share user specific data they do so on the back end.
Another technique when dealing with data that isn't intended to be public is to use a cryptographically secure random number as the user's identifier. This isn't a foolproof technique since referer [sic] could still leak the identifier. But now the attack becomes substantially more complex since the attacker needs to be one of the referrers or have a relationship with one of the referrers. The attack is still possible, just more expense.
One can up the ante even more by taking the real user ID and encrypted it with a known key. This keeps the conceptual model server side simple (decrypt, use ID) and provides the ability to change the visible ID as often as one wants. This approach can even work with partners if a shared key is used. In other words website A can redirect the user to website B with the user ID included in the URL as an encrypted value using a key shared by the websites.
6 Tracking Users Across Sites
There are plenty of scenarios where a user authorizes another site to learn something about them. So Joe, in our previous example, might want cancer.example.com to share data about him with white.cell.tracker.example.com and maybe discount.drug.example.com. In both cases cancer.example.com will have to provide links to the two other sites and those links will contain Joe's identity. This will then let the two sites exchange notes and realize they are both dealing with Joe. This allows everything from unwanted marketing to more nefarious scenarios.
If two websites want to conspire to track a user there is literally nothing the user can do to stop it. An excellent proof of this is xauth.org. This is a group of popular sites (including my current employer) who publish information about their users to a central site which can then track the user across those sites. Of course the sites promises to only share identity data where appropriate given the user's approval. No, really, they promise. That creating xauth.org isn't a criminal conspiracy helps to illuminate just how backwards our laws are when it comes to electronic privacy.
But in any case, xauth.org is perfect proof that if sites want to conspire to rob you of your privacy then there is absolutely nothing you can do about it. You will just have to blindly hope that the sites don't decide to go rogue. Of course you have no say in the matter.
That having been said it's one thing for a group like xauth.org to exist ahead of time. That really can't be defended against if the attackers are sufficient determined. It's another for a group of sites who weren't previously cooperating to decide to cooperate in the future and to be able to share their old logs and compare users.
For example, site A shares information about Joe with site B and C. In both cases site A identified Joe by using the user ID "Joe". If, in the future, site B and C want to conspire to track users across them (this can be extremely profitable since it allows for targeted advertising) then they just look for common IDs from site A.
To prevent that specific kind of retroactive data sharing one can use what are called pairwise unique IDs. That is, when site A identifies Joe to site B they can use a different identifier then they use with site C. Pairwise Unique IDs can be generated either using encryption (e.g. take the user's actual ID and the site's ID and encrypt them together) or by look up tables. Strictly speaking look up tables are probably more secure since encryption keys can eventually be broken. But they are more expensive to implement. Check with your local crypto guru to make the proper trade off for your situation.
7 Triangulating Services
Let's tell what I'm sure is a completely hypothetical story (cough). Imagine, if you will, a service that lets people create their own websites. In the URL for each user's website is a unique ID used to identify the owner of the website. The ID is just a jumble of numbers and there is no official mechanism to translate from the unique ID to something readily identifiable like an e-mail address or name.
Normally websites created with this service will display the person's name, e-mail, etc. Thus, of course, providing a way to map from the supposedly anonymous unique ID to the person's identity, but believe it or not, that isn't the hole in this case. But the service decides to add the ability to create anonymous websites. What makes the websites anonymous is that they won't automatically display the user's name or e-mail. But their unique ID? It's still in the path. But in theory this is 'o.k.' because again, there is no 'official' mechanism to go from the unique ID to more useful user identifiers.
However there was another, separate, service, that provided IM. This (hypothetical of course) service also used a unique ID for users. Now the IM service didn't display the unique ID for a user but it was included in the (theoretically thoroughly hacked and well documented) protocol in messages send down to the IM client.
All would have been fine if it weren't for one tiny little issue. Both the IM and website service used the same identity provider and so used the same unique ID for the same user. So if user A had an anonymous website and sent IM's to user B then user B could pull out user A's user ID from the IM client, do a quick Internet search, see if a website existed with the same user ID and if the website was anonymous now user A knew that user B was the owner of the website.
The issue here is ID triangulation. By exposing the same ID for the same user in two different contexts an attacker can triangulate and determine who the user is.
We have already discussed encrypting IDs and using pairwise unique IDs. Now we need to throw in per service pairwise unique IDs. That way even if the same entity is accessing the same user's data across two different services for the same user they will get two different IDs.
Still, one has to think carefully about going to this level of obfuscation. In many cases it's a feature and not a bug that a service can track a user across multiple service front ends. So long as the service has been empowered to have this information (which may be by default in terms of public data) then there is no problem.
The problem comes when services are trying to hide this data. Note, btw, that at this point it doesn't even matter if the ID is in the URL or not. We have taken the argument beyond that issue. If the ID is available via any mechanism to the caller (cookie, auth token, whatever) across two different services then this kind of triangulation is possible.
So the bottom line is - ID in the URL or not, if you are sharing an identity provider with anyone (even other services at your own company) then you have the triangulation problem and need to threat model it to decide if it requires remediation.
8 Blowing user anonymity
Let's say that there is a website called party.example.com. It tracks social and business events and can send out SMS messages notifying its users when an event interesting to them is going to happen. Sanford is something of a party animal but he doesn't want his fellow CPAs to know that. To separate his two lives Sanford created two identifies, email@example.com and firstname.lastname@example.org.
Sanford logs into party.example.com using email@example.com and asks to receive SMS messages about the loudest most insane parties where he lives. As part of this process Sanford is redirected using OAuth to his carrier BigTelco where Sanford logs in with his account firstname.lastname@example.org. All party.example.com knows is that Sanford uses BigTelco, it doesn't know anything about his identity at BigTelco. After granting party.example.com permission to send him SMS messages, BigTelco sends Sanford back to party.example.com with a refresh token that include a BigTelco ID for Sanford.
Now Sanford again logs into party.example.com (using a different browser) but this time he logs in with the identity email@example.com. This time he subscribes to SMS notifications about business events in his area. Again Sanford is redirected to his carrier, BigTelco and again logs in as firstname.lastname@example.org. At this point however the permission request is short circuited since party.example.com already has permission to send Sanford SMS messages. So BigTelco just sends Sanford automatically back to party.example.com with a refresh token containing BigTelco's ID for Sanford.
The result is that party.example.com now has two different refresh tokens for two different accounts at party.example.com (email@example.com and firstname.lastname@example.org) which have the same BigTelco User ID. Bingo, party.example.com now knows that the same person owns both accounts, Sanford's anonymity is lost.
This scenario involves OAuth but any situation where two websites are sharing data about a user will run into this issue.
In the previous section we talked about having to make pairwise per service unique IDs for users in order to protect their privacy. In other words an ID that is unique per calling application, per service being called, per user. Now we add another dimension which requires unique IDs, per third party account. In other words in order to protect user's privacy it's necessary for third parties (like party.example.com) who are asking for a user ID to specify what user ID they know the user by. Then BigTelco needs to generate a new ID that was unique for the combination of [Big Telco User ID, Big Telco Service Type, calling partying ID, ID by which the calling party knows the Big Telco User]. Only by generating a unique user ID any time any of these four values change can Big Telco be sure it won't blow their user's attempts to protect their own privacy.
Note, btw, that party.example.com, when it shares the ID it knows Sanford by, should apply all the techniques mentioned in this paper as well. What's good for the goose is good for the gander.
The core of protecting a user's privacy in a scenario like this is a four column table: Internal User ID, Internal Service ID, Calling Party ID & Calling Party's ID for the User. Any time any of these values change in an interaction with an external party a new external user ID is needed.
9 A side note on implementing secure IDs
Securing user IDs means constantly generating new IDs for users to be handed out to different parties in different circumstances. These IDs must later be traced back to an internal ID in a way that can't be hacked by outsiders. There are two sort of obvious ways to do this.
The one that comes to mind first is encryption. Just take the four values discussed in the previous section, concatenate them, encrypt them and use that as the user ID. But there are several practical problems with using encryption this way. First, encryption keys need to be rolled over. So what happens to old IDs when a new key is in use? Typically it's o.k. to use an old key for validation purposes for a while but eventually it needs to be retried. But then what? Will every user ID generated with the old key have to be rolled over? That isn't going to be much fun.
Also nothing ever gets forgotten on the Internet and some day our 'super duper strong cryptography' will be 'weak cryptography' that 12 year olds will amuse themselves by breaking on their neurally integrated processing implants. So by generating these IDs we create a situation where 10, 20, 30 years down the road the keys will be cracked and the secrets revealed. Having one's past from decades gone by come back to haunt one isn't a pleasant thought.
9.2 Tables v1
One could literally create a table with the four specified columns and a fifth column that contains a cryptographically secure random number to be used as the external user ID. When an ID is submitted it is looked up in the table to determine what user it maps to. By using a random number rather than encrypted content no secret information is allowed free onto the network. All the juicy stuff is strictly internal. This also means that user IDs once created for a particular quadruple doesn't ever have to change.
Typically look ups on the table will go from user ID to the four columns which have to be matched to the current context to make sure the right ID is being used and to discover the desired internal user ID.
But we will also need a reverse look up table if we are going to be nice and make sure that if someone asks us for a permission they already have then they will get the same ID they got last time. Building the reverse look up table isn't much fun but it's also par for the course for anyone who deals with denormalized databases. Alternative if one's user base is small enough then a relationship database can be used which makes the reverse look up trivial.
9.3 Tables v2
If one is willing to be a little less friendly to forgetful developers then there is a way to simplify the table, get rid of the reverse look up and almost certainly increase over all security. The trick is that any time a partner asks for a permission about a user they will get a different user ID. This means that if a site forgets it has a permission and later remembers it will now have two different IDs. But this simplification means, amongst other things, that one doesn't need an ID for the user from the partner. Instead any time a permission is asked for the user will again be prompted and a new ID will be generated.
2 thoughts on “User IDs – managing the mark of Cain”
Good stuff Yaron! I am reluctant to accept that all is lost with Privacy. We need to try to regain it to whatever degree we can. Simiply using SSL/TLS all the time for all communication acorss the internet will be a big help. …and it is getting cheap enough that services like facebook, live, etc. should simply follow gmail’s example. Always on SSL signficantly improves things.
I do think that wider use of SSL would be a big improvement and have been arguing for several years now in Microsoft to do just that. For example, when we first shipped the Live third party delegation system back in 2007 or so we ran the whole thing over SSL for that very reason. I remember all the doubters screaming that without (insanely expensive) hardware accelerators we would die within a day. They were wrong. We ran just fine terminating SSL directly on the application server. Much thanks to Microsoft’s crypto team who has been doing a ton of work to make SSL run fast on Windows.
But as the examples at the end of the article show, SSL won’t be enough in and of itself. Privacy leaks in many places. I’m actually planning on splitting this article. One part will just take the use cases at the end and show how privacy leaks and what we can do to manage it. The other part will specifically address privacy leakage in the context of the browser. That will hopefully help highlight the issues with privacy leaking even with SSL.