Hearing on the radio that a major windows sub-system I had helped to create was the cause of one of the biggest holes in windows found up until then certainly got my attention. So I decided to investigate.
I first heard about it on NPR. Windows XP had a major security flaw [1,2], even Microsoft was saying that this one was serious. That got my attention as Microsoft generally denied the importance of security flaws claiming that the flaw 'isn't in the wild' or 'doesn't affect real users' or 'has never been used', etc. What really focused my attention was when I found out that the flaw was in a system called Universal Plug and Play. I was the network architect for UPnP and was the lead author for SSDP, the algorithm at the center of the security hole. You can imagine the shape my stomach was in at the thought that I might be responsible for a security hole so egregious that even Microsoft agreed it was a problem.
So I decided to investigate.
The three holes that have been found so far are:
- A buffer overflow that allows a remote machine to take over the UPnP machine.
- Performing a HTTP GET without checking how much information has been downloaded which allows an attacker to cause the machine to run out of memory.
- Not checking for excessive network announcements that leaves UPnP machines open to be used for DDOS attacks.
<>As a member of the architecture team my job was to write specifications, the actual code was developed by a separate program management/development/test team. So it's tempting to just write the first two holes off as typical bad Microsoft programming practices. But the uncomfortable reality is that I was fully aware that the UPnP team's programming/testing practices left something to be desired. For example, at one point a snippet of code from the checked in source tree, real code that was supposed to ship in the final product, was sent around and a contest was held to see who could figure out what it was supposed to do. I say supposed to do because the code didn't actually work. There were two contests, one for developers and one for PMs. I won the PM contest. The page of code was an AtoI function. Once you fixed the endless loop it turned out to require O(N2) iterations where N was the number of digits in the original ASCII number.
We had one or two testers depending on what other projects were around and only one of the testers understood what was going on and she could only cover a tiny bit of the code base. When I worked for IE the rule of thumb for network code was to have between 2 to 3 testers per developer, although we were lucky to have a 1 to 1 ratio. As a side note, the testers we did have for networking in IE were absolutely rock solid and easily the equivalent of 2 bodies a piece. UPnP had something like 2 testers for 5 or so developers.
Yes, I talked about the problem with the group's management. Yes, I talked about the problem with several of the testers and developers on the team. Some cared, most didn't. Eventually those of us who had any pride in our work just got up and left. Without support from management there really wasn't much else to do. The feeling of apathy and doom was pretty consistent throughout the project.
The third hole was a known problem. I had even written up a fairly nice way to deal with it. But I was shot down. The team felt that the threat was so small it wasn't worth dealing with. Microsoft seems to view security as primarily a PR issue and the third bug just didn't seem to be much of a PR threat.
How UPnP works
To understand the flaws it helps to understand the Simple Service Discovery Protocol (v1-03) which is UPnP's discovery mechanism. SSDP v1 was designed to support the SOHO (small office/home office) market. It was designed to allow a client (e.g. Windows ME/XP) to discover UPnP compatible devices and automatically configure them.
Scenario 1 – The device announces itself
When installing a network printer today the user has to find the printer's name, go to each client, type in that name and configure the printer. With UPnP the printer, as soon as it came on-line, would announce itself (NOTIFY) on an administratively scoped multicast channel and say "I'm here". The "I'm here" announcement (NOTIFY) contains a URL that points to a description of the device. This is a piece of XML that the UPnP client can read and learn things like "This is a XYZ brand printer."
Scenario 2 – The windows box finds available devices
Imagine a user buys a brand new Windows XP box and brings it into their home. When the XP box wakes up it can send out an administratively scoped multicast request (SEARCH) and ask "What services are on this network?" Then the printer can send a response (NOTIFY) directly to the XP box saying "Hi, I'm here, here is the URL for my description." The XP box can then retrieve the description.
With the above background, we can now see what went wrong in Microsoft's implementation of UPnP. Please keep in mind that I don't actually have a UPnP enabled box to play with (I use W2K) so I'm making educated guesses.
Bug #1 – The Buffer Overflow
Sending a stream of NOTIFY messages at the right frequency with the right content will cause a buffer overflow. Even worse, the overflow behavior is apparently fairly predictable so that a knowledgeable attacker can use the overflow to take control of the machine. For what it's worth Microsoft software development practices at the time specifically required looking for buffer overflow and tools were available to check for them but I have it on fairly good authority that no serious attempts were ever made to detect overflow conditions.
Bug #2 – The Unchecked Buffer
When the client receives a NOTIFY it takes the URL in the location header, which points to the device's description, and execute a HTTP GET. My guess is that what happens next is that UPnP tries to download the description and once the download is complete then tries to parse it. The problem is – what if the server is malicious and sends an infinite length response? In that case UPnP will suck up the data until it finally runs out of memory, which is exactly what the www.eeye.com attack does.
Bug #3 – Assuming there is no evil in the world
The third bug is a bit more subtle. UPnP by it's nature is the potential basis for a DDOS attack. This is unavoidable in a distributed, un-administered discovery service. The attack is to send out a multicast announcement saying "I'm a new service, find out about me" then include a URL that points at some victim's server.
Every machine within range of the multicast will make a GET request to the URL. This isn't so bad in itself since each machine will only send out one GET. What's bad is that the black hat could just keep pumping out announcements, all pointing to the same victim's machine. This acts as a request multiplier since every one announcement the bad guy sends out will result in N GET requests.
There is no un-administered way to stop this kind of attack but it can be slowed down to the point of not being very useful to black hats by doing simple analysis of announcement behavior. UPnP when run only in the administrative scope was designed to handle 20 to 50 discoverable devices. Think about a typical home or small office (the target markets for V1 of UPnP). Even with 802.11b and intelligent refrigerators the average office isn't going to have more than say 50 devices maximum. If the client finds itself having to deal, in a short period of time, with more than 50 unique devices or with more then a few announcements per second then something has clearly gone wrong and the UPnP service should temporarily de-activate itself. Section 6.3.1 of v1-03 discusses just such an auto-shut off algorithm. We had developed a specification for the algorithm which would have explicitly defined 'best practices' for dealing with this type of situation but it was pulled from the specification at the last minute because the development team, over my objections, did not feel that the threat was sufficient to justify the effort of implementing the algorithm.
Loosing the auto-shut off algorithm really hurt. I fought hard to keep it but eventually the lead developer just said straight out that he didn't care what I thought, he wasn't going to implement it. He went to the head of the architecture group and complained that I wasn't a team player. I was given clear marching orders – cut the section. All I could get them to agree to was to leave in the section describing the problem but not put in the solution. I was hoping this would eventually lead to us introducing the auto-shut off algorithm but I left the team soon after and no one seems to have followed up.
Appendix – Q&A
Q: Why is SSDP still a draft?
A: When I left the UPnP team I had progressed the SSDP draft to the point where the development team felt it was sufficient to use for V1. I had my own views on the subject and this among many other problems with the team caused me to move on. Apparently no one tried to finish the specification after I left even though the last version of the draft is over two years old.
Q: The www.eeye.com report mentions chargen, why is that relevant?
A: Chargen is a service that when connected to via TCP will generate random ASCII characters. It was originally used to test things like printers. Normally pointing at HTTP client at chargen will just cause the client to terminate the connection. The reason is that HTTP responses have a specific format that chargen (statistically speaking anyway) isn't going to properly produce. Therefore the HTTP client will realize it's dealing with a loony HTTP server and cut the connection. However Microsoft's UPnP service uses WinInet (I was PM for it for several years) to perform the HTTP GET request. WinInet was designed to be IE's HTTP client stack and is quite old. Back in the bad old days there were these loony HTTP servers that either had no HTTP response headers or had malformed HTTP headers. To deal with such situations WinInet really goes out of its way to be flexible in what it accepts. So pointing IE to a chargen port will cause IE to display the chargen output on the screen. This isn't a big deal because the user can always press the stop button. This also didn't have to be a big deal for UPnP. WinInet uses a read file metaphor on a stream interface so the calling application is required to provide buffers to pull in data as it becomes available. If the total amount of retrieved data ever gets too large the calling application can tell WinInet to kill the connection. UPnP apparently never checks to see how much data it has already read in (device descriptions, typically, should be a couple of kilobytes long) or if what it was reading in is a UPnP device description so it just keeps sucking in buffers until it runs out of memory.
Q: The Microsoft report uses echo to attack the UPnP service, how does that work?
A: The echo port does what it sounds like it does, it returns whatever bytes are sent to it. I suspect that maybe Microsoft meant chargen rather than echo. If WinInet is pointed to echo WinInet will return the bytes that are echo'd back and then wait. WinInet is waiting for the server to close the connection indicating the end-of-data. This will eventually happen when the TCP connection times out but I can't see how this would do much more than just hang the UPnP service for a bit.
Q: Both reports say that UPnP supports broadcast addresses, what's the story with that?
A: UPnP is only allowed to use the IANA reserved administratively scoped multicast address for multicast announcements. It should not be listening/sending multicasts anywhere else.
Q: How was UPnP going to scale beyond 20 to 50 devices?
A: UPnP faced two distinct growth problems. One was related to existing corporate networks whose device discovery problems were far beyond what UPnP v1 could handle. The other growth problem was related to SOHOs where as blue tooth and 802.11B chip sets are shoved into everything SOHOs will quickly find themselves with far more than 20 to 50 devices but still needing a completely un-administered network management solution. The solution to both problems was a UPnP directory server. This server, which is repeatedly referred to in SSDP, was intended to be a central collection point. Devices would register themselves with the directory and clients would tell the directory what kind of devices they were interested in. The directory server would then act as a router that would match devices and clients. We had gone so far as to develop a leadership election algorithm so that a new directory could automatically take over if another directory failed. This was useful in SOHO scenarios where someone may have a UPnP directory server on their Windows XP v10 (or whatever it will be called) machine but have no idea how to configure it. We even had worked out how to make the directory properly bridge between directory aware and directory un-aware devices/clients and had started to deal with how to have distributed directories. But after I left the team apparently work in this area stopped.
Q: Why couldn't the description URL always point to the device?
A: Having the description URL always point to the device would help to perhaps limit the DDOS attack because many times the client knows the local network mask and can tell if the device is on the local network. This means that the client's GET request would first be fired at the device. Even if the device redirected the request somewhere else this at least meant that the device could handle the load which by definition limits how bad the load can be. The problem with this strategy is that it prevents one from having devices in non-trivial networks. Imagine I have a group of printers in a fairly complex network with multiple local network masks. I have properly set up the routing of my administrative multicast scope so that the device's multicasts will reach each other but the device's actual location are outside of the local scope. In that case there would be no way for the client to use the device because the device's description would be on a machine that wasn't in the local network mask. Like most security related problems this was a trade off. I still believe that if the auto-shut off algorithm is used that the DDOS threat is sufficiently minimized to be worth taking the risk in order to enable access to devices outside of the local network mask.