I first heard about the problem with UPnP on NPR. Windows XP had a major security flaw [1,2]. What really focused my attention was when I found out that the flaw was in a system called Universal Plug and Play. I worked in the network architecture group at Microsoft that owned UPnP and led the design effort on SSDP.
So I decided to investigate.
The three holes that have been found so far are:
- A buffer overflow that allows a remote machine to take over the UPnP machine.
- Performing a HTTP GET without checking how much information has been downloaded which allows an attacker to cause the machine to run out of memory.
- Not checking for excessive network announcements that leaves UPnP machines open to be used for DDOS attacks.
The first two holes, which I'll go into more below, were implementation flaws. Although it makes me sad to be associated with a product that suffered from such security holes the reality is that the architecture group and the product group had a (strong enforced) separation. As a member of the architecture group I was only supposed to work on the spec design. All implementation related issues were to be handled by the product group. So the first two problems were out of my remit.
The third hole was a known problem. I had even written up a nice way to deal with it. But there was a fairly long and heated argument between the product group and the architecture group as to how serious this hole was. In the end the compromise that was reached is that the spec would identify the bug (which it did) but would not specify a solution.
How UPnP works
To understand the flaws it helps to understand the Simple Service Discovery Protocol (v1-03) which is UPnP's discovery mechanism. SSDP v1 was designed to support the SOHO (small office/home office) market. It was designed to allow a client (e.g. Windows ME/XP) to discover UPnP compatible devices and automatically configure them.
Scenario 1 - The device announces itself
When installing a network printer today the user has to find the printer's name, go to each client, type in that name and configure the printer. With UPnP the printer, as soon as it came on-line, would announce itself (NOTIFY) on an administratively scoped multicast channel and say "I'm here". The "I'm here" announcement (NOTIFY) contains a URL that points to a description of the device. This is a piece of XML that the UPnP client can read and learn things like "This is a XYZ brand printer."
Scenario 2 - The windows box finds available devices
Imagine a user buys a brand new Windows XP box and brings it into their home. When the XP box wakes up it can send out an administratively scoped multicast request (SEARCH) and ask "What services are on this network?" Then the printer can send a response (NOTIFY) directly to the XP box saying "Hi, I'm here, here is the URL for my description." The XP box can then retrieve the description.
With the above background, we can now see what went wrong in Microsoft's implementation of UPnP. Please keep in mind that I don't actually have a UPnP enabled box to play with (I use W2K) so I'm making educated guesses.
Bug #1 - The Buffer Overflow
Sending a stream of NOTIFY messages at the right frequency with the right content will cause a buffer overflow. Even worse, the overflow behavior is apparently fairly predictable so that a knowledgeable attacker can use the overflow to take control of the machine. Also there is nothing really novel here. It's not like buffer overflows are a new threat. The implementation simply didn't do its job and check for them.
Bug #2 - The Unchecked Buffer
When the client receives a NOTIFY it takes the URL in the location header, which points to the device's description, and executes a HTTP GET. My guess is that what happens next is that UPnP tries to download the description and once the download is complete then tries to parse it. The problem is - what if the server is malicious and sends an infinite length response? In that case UPnP will suck up the data until it finally runs out of memory, which is exactly what the www.eeye.com attack does. Again, this isn't a novel threat. Checking response lengths isn't exactly state of the art defensive programming.
Bug #3 - Assuming there is no evil in the world
The third bug is a bit more subtle. UPnP by it's nature is the potential basis for a DDOS attack. This is unavoidable in a distributed, un-administered discovery service. The attack is to send out a multicast announcement saying "I'm a new service, find out about me" then include a URL that points at some victim's server.
Every machine within range of the multicast will make a GET request to the URL. This isn't so bad in itself since each machine will only send out one GET. What's bad is that the black hat could just keep pumping out announcements, all pointing to the same victim's machine. This acts as a request multiplier since every one announcement the bad guy sends out will result in N GET requests.
There is no un-administered way to stop this kind of attack but it can be slowed down to the point of not being very useful to black hats by doing simple analysis of announcement behavior. UPnP when run only in the administrative scope was designed to handle 20 to 50 discoverable devices. Think about a typical home or small office (the target markets for V1 of UPnP). Even with 802.11b and intelligent refrigerators the average office isn't going to have more than say 50 devices maximum. If the client finds itself having to deal, in a short period of time, with more than 50 unique devices or with more then a few announcements per second then something has clearly gone wrong and the UPnP service should temporarily de-activate itself.
Section 6.3.1 of v1-03 discusses just such an auto-shut off algorithm. We had developed a specification for the algorithm which would have explicitly defined 'best practices' for dealing with this type of situation but it was pulled from the specification at the last minute because the development team did not feel that the threat was sufficient to justify the effort of implementing the algorithm. The architecture team fought hard to try and prevent the algorithm's removal and to require its implementation but in the end the compromise decided by senior management was to document the issue but not solve it.
Appendix - Q&A
Q: The www.eeye.com report mentions chargen, why is that relevant?
A: Chargen is a service that when connected to via TCP will generate random ASCII characters. It was originally used to test things like printers. Normally pointing at HTTP client at chargen will just cause the client to terminate the connection. The reason is that HTTP responses have a specific format that chargen (statistically speaking anyway) isn't going to properly produce. Therefore the HTTP client will realize it's dealing with a loony HTTP server and cut the connection. However Microsoft's UPnP service uses WinInet (I was PM for it for several years) to perform the HTTP GET request. WinInet was designed to be IE's HTTP client stack and is quite old. Back in the bad old days there were these loony HTTP servers that either had no HTTP response headers or had malformed HTTP headers. To deal with such situations WinInet really goes out of its way to be flexible in what it accepts. So pointing IE to a chargen port will cause IE to display the chargen output on the screen. This isn't a big deal because the user can always press the stop button. This also didn't have to be a big deal for UPnP. WinInet uses a read file metaphor on a stream interface so the calling application is required to provide buffers to pull in data as it becomes available. If the total amount of retrieved data ever gets too large the calling application can tell WinInet to kill the connection. UPnP apparently never checks to see how much data it has already read in (device descriptions, typically, should be a couple of kilobytes long) or if what it was reading in is a UPnP device description so it just keeps sucking in buffers until it runs out of memory.
Q: The Microsoft report uses echo to attack the UPnP service, how does that work?
A: The echo port does what it sounds like it does, it returns whatever bytes are sent to it. I suspect that maybe Microsoft meant chargen rather than echo. If WinInet is pointed to echo WinInet will return the bytes that are echo'd back and then wait. WinInet is waiting for the server to close the connection indicating the end-of-data. This will eventually happen when the TCP connection times out but I can't see how this would do much more than just hang the UPnP service for a bit.
Q: Both reports say that UPnP supports broadcast addresses, what's the story with that?
A: UPnP is only allowed to use the IANA reserved administratively scoped multicast address for multicast announcements. It should not be listening/sending multicasts anywhere else.
Q: How was UPnP going to scale beyond 20 to 50 devices?
A: UPnP faced two distinct growth problems. One was related to existing corporate networks whose device discovery problems were far beyond what UPnP v1 could handle. The other growth problem was related to SOHOs where as blue tooth and 802.11B chip sets are shoved into everything SOHOs will quickly find themselves with far more than 20 to 50 devices but still needing a completely un-administered network management solution. The solution to both problems was a UPnP directory server. This server, which is repeatedly referred to in SSDP, was intended to be a central collection point. Devices would register themselves with the directory and clients would tell the directory what kind of devices they were interested in. The directory server would then act as a router that would match devices and clients. We had gone so far as to develop a leadership election algorithm so that a new directory could automatically take over if another directory failed. This was useful in SOHO scenarios where someone may have a UPnP directory server on their Windows XP v10 (or whatever it will be called) machine but have no idea how to configure it. We even had worked out how to make the directory properly bridge between directory aware and directory un-aware devices/clients and had started to deal with how to have distributed directories. But after I left the team apparently work in this area stopped.
Q: Why couldn't the description URL always point to the device?
A: Having the description URL always point to the device would help to perhaps limit the DDOS attack because many times the client knows the local network mask and can tell if the device is on the local network. This means that the client's GET request would first be fired at the device. Even if the device redirected the request somewhere else this at least meant that the device could handle the load which by definition limits how bad the load can be. The problem with this strategy is that it prevents one from having devices in non-trivial networks. Imagine I have a group of printers in a fairly complex network with multiple local network masks. I have properly set up the routing of my administrative multicast scope so that the device's multicasts will reach each other but the device's actual location are outside of the local scope. In that case there would be no way for the client to use the device because the device's description would be on a machine that wasn't in the local network mask. Like most security related problems this was a trade off. I still believe that if the auto-shut off algorithm is used that the DDOS threat is sufficiently minimized to be worth taking the risk in order to enable access to devices outside of the local network mask.