Building a data web needs to start with a clear statement of a user’s bill of rights. This is the core of the requirements that should drive the technologies used. Below I list what I think are the four fundamental rights and then explore their technical implications.
This is a continuation of my previous article asking why we don’t have an open data web. My list of core rights would include:
- A right to own one’s own data
- A right to full privacy in accessing and sharing one’s own data
- A right to use one’s data where, when and how one wants
- A right to successfully use these rights without being a computer/security expert
My belief is that the only really meaningful way to own one’s own data is if that data is primarily collected and stored on devices that the user owns. All of a user’s information ranging from blog posts, to micro blogs, to emails, to group postings, etc. could easily fit on a smart phone. Photos and videos are the only outliers here and presumably the exponential growth in storage space will deal with that as well. So there is no technical reason that all of a user’s data can’t live on just about any reasonable user device. So the foundations of an open data web should start with the assumption that most interesting data starts and is primarily stored on user devices.
We already are in a multi-device world however. I would be extremely irritated if say a document I created on my phone was only available on my phone and not my PC or my tablet. Today we solve this kind of synchronization issue via some kind of cloud service. But this puts us back in the place of handing our information to someone else and asking for it back.
My belief is that not only must ones own data be on one’s own devices but it must be instantly and easily available on all of those devices with no effort on the part of the user. This means that we need a peer to peer mesh that can be used to keep all the devices in synch. This doesn’t preclude the use of a non-user controlled entity if the user so chooses but for this to be a meaningful choice (as opposed to a hard requirement) then the system must work perfectly in the absence of such a third party system. In other words, power up the mesh.
The term privacy is tricky so I need to define it a bit more. In the security community when one says ’privacy’ typically what is meant is hiding the content of a transmission. So to most security folks if a communication is encrypted then it is private. And this is true to a point, but I would argue that it isn’t sufficient. Real privacy isn’t just hiding what you said, but also hiding who you said it to. The attack in play here is called ’traffic analysis’ and it focuses not so much on what is being said but who it is being said to. Most security folks run away from solving the traffic analysis problem because it is genuinely hard and in many cases may actually be unsolvable. But it’s not our job to save the world, just to try.
The first, and easiest, step to stopping traffic analysis is to use a network like TOR which makes it more difficult to see who is sending messages to whom. But this isn’t, unfortunately, enough. For example, Mozilla has an awesome service that shares book marks between a user’s browsers on different machines. Mozilla implemented this service in all the best ways including encrypting the content with a key that only resides on the user’s computers so Mozilla can’t see the data themselves. So far, so good. But unfortunately just the act of copying the data to Mozilla’s servers acts as a traffic beacon. Anyone watching Mozilla’s servers can see when the user is logged in, when they are active and unless the user is using TOR, where the user is. To provide real privacy the user should be synching their bookmarks across their devices using the previously described peer to peer mesh.
But now imagine that a group of users want to share a browsing experience. Again, we could use a central server but again that central server acts as a traffic beacon. Only by moving the entire experience to the users’ devices can we hope to provide any reasonable protection against traffic analysis.
So the point is that providing meaningful privacy means getting out of the cloud and onto one’s own devices. Note that none of this means one can’t use the cloud. Heck, one of the peers could be a cloud service. But it does mean that the primary home of one’s own data is one’s own devices and that the default behavior is device to device sharing over encrypted connections routed through a traffic analysis resistant network like TOR.
If one is going to have meaningful control over one’s data then one needs an open ecosystem on one’s devices to use that data. If one has to ask some feudal lord their permission to run an app then one isn’t in an open ecosystem and one doesn’t have meaningful control over one’s data.
One also needs open standards for some combination of how data is stored on the devices (so multiple apps both on that device and on other devices the data is synch’d to can use the data) and interfaces for well understood services like calendaring, blogging, messaging, etc. When to use a data layer standard and when to use an application protocol layer standard is something we need to explore on a case by case basis.
In addition one needs a software stack that implements all the infrastructure components of the system. This would include code to handle everything from affiliating a user’s devices together, handling users connecting to each other, synchronizing data between devices, etc.
There is also a lot of identity wonkery that is needed here (thankfully the technology is fairly straight forward). I will need to throw around terms like attestation, user generated identities, web of trust, etc. I’ll have a whole article on this at some point. But for now the key point is that users need to be able to create an identity and build a community around that identity without being required to interact with any Identity Provider (IdP, the people who know your password). The IdP model is a real problem in terms of privacy because IdPs act as traffic beacons, they know when you login, how often, who you are talking to, etc. Yes, yes, I know. There are ways to mitigate these issues without getting rid of IdPs. The way I think about it is less as IdPs and more as attestation servers. From that frame things are much easier. But whatever, I’ll get into the details later.
But, again, the point is that the user owns their identity, nobody has the right to take it from them and they can use their own identities with their own data with whatever software they choose in cooperation with anyone else without asking for anyone’s permission.
One of the big failings of mesh based systems is that they are complex. Throw in user generated identities and things can get really ugly. Now throw in empowering users to decide what software to run, what rights to give, etc. and the scope for abusing users quickly runs to infinity.
But if our solutions to these problems are ’don’t do bad things’ then we are screwed beyond all hope. It’s absurd to expert users to be computer and/or security experts and then blaming them when they can’t deal with impenetrable dialogs and unfathomable configurations.
My own thinking is that the data web has to take a four step (sigh, yes, four... sorry) approach to helping users to live happy lives on the data web - Protection, Mitigation, Detection & Recovery.
Our first job is protection, to make sure that bad things are hard to do to users. This means looking at technologies like micro VMs as way to let users run code on their devices which we can meaningfully limit with a really reduced security surface area. Microsoft’s Drawbridge technology is particularly exciting to me.
But we also need to accept that we are going to fail. We are going to fail because sometimes the software and/or hardware barriers we put up won’t work. We are going to fail because sometimes users will make unfathomably bad decision. But in either case stuff happens. So our next job is to think about mitigation. What can we do to reduce the damage when device security is violated? The main thing I can think of here is trying to slow things down so we have time to detect and recover. For example, maybe delete or mass update orders can be delayed or require space for escrowing previous state for some period of time? Can we mark devices as more and less trusted? So a cell phone (which is really managed by the carrier and hence ripe for abuse) might just be marked as off limits for certain kinds of data by other devices in the mesh. We need to think about ways of mitigating the damage for an attack in the period of time before we know the attack has occurred.
Then we have to worry about detection and recovery. For me the flagship scenario is that all of a user’s data is destroyed, including all of their private keys. In this scenario we assume the user has no backups or the devices that were the backups were themselves compromised. How does a user recover their identity? Anything we do here is deeply scary because any recovery mechanism could just as easily be used by an attacker to hijack an identity. But I think we have to go here. My core thought is that identity is a consensual illusion. It has no existence outside of the beliefs of a group of people. So if those people change their beliefs (for good or bad reasons) as to who someone is then identities can be moved around. Scary but exciting stuff.
But to me the bottom line to making a usable system is to make sure that the cognitive load’ we put on users is at a level they can reasonably handle and that when things go wrong (and we must assume they will) we have workable, non-centralized, mechanisms to let users recover. I don’t expect miracles here but it is our duty to try.
It’s worth noting that this work is being done on my own time with my own resources and is completely independent of my employer. So whatever you think of this article don’t blame my employer, it’s not their fault.