Anatomy of a commercial-grade immune system

July 27, 2017 | Autor: Morton Swimmer | Categoria: OPERATING SYSTEM, Virtual Environment, Real Time, Immune system
Share Embed


Descrição do Produto

Anatomy of a Commercial-Grade Immune System Steve R. White, Morton Swimmer, Edward J. Pring, William C. Arnold, David M. Chess, John F. Morar IBM Thomas J. Watson Research Center P.O. Box 704 Yorktown Heights, NY 10598

Abstract We have built the first commercial-grade immune system that can find, analyze and cure previously unknown viruses faster than the viruses themselves can spread. The system solves several important problems. A single console allows a customer administrator to decide whether viruses are submitted for analysis automatically, or whether explicit approval is required, and permits new virus definitions to be distributed automatically in response to a new virus, or held for the administrator's approval. A novel active network architecture permits the system to handle a vast number of customer submissions quickly, so the system can handle floods due to an epidemic of a fast-spreading virus, or due to submission of many uninfected files. The analysis center can analyze most viruses automatically, and with greater speed and precision than human analysts can. The analysis center runs the viruses in a virtual environment, so the process is safe and lets our programs analyze the behavior of the virus in real time. Viruses can be replicated in a number of operating system and application environments, including various national languages. Upconversion and downconversion of macro viruses are handled automatically. Both the active network and the analysis center are scaleable, so the system can easily accommodate ever-increasing loads. End-to-end security of the system allows the safe submission of virus samples and ensures authentication of new virus definitions. During the presentation, we will give a live demonstration of a pilot that we have run with customers, and review our experience with the pilot system.

Introduction For the most part, virus incidents in the past occurred in a fairly regular pattern. Viruses spread slowly, most customers updated anti-virus software regularly, and anti-virus companies could usually keep ahead of the problem by analyzing the rafts of viruses circulated from both helpful and marginal sources. It was unusual for customers to get a new virus that had never been seen before, and for which a cure was not already available. New viruses were found at the rate of a few per day on average. Today new viruses are found at the rate of 8-10 per day, which is still well within the capabilities of the human virus analyzers at most anti-virus vendors. In the new world of Internet-born viruses, however, viruses can become very widespread shortly after their first infection and in some cases before a cure is available. Virus incidents in the future will resemble the Internet Worm and the Melissa virus more than they did the now-ancient Stoned virus. Each new virus will have the potential to rage out of control unless a cure is made available quickly and distributed widely. Worse, there is nothing to prevent viruses from being written at a much faster pace that they are today. It is, in fact, easy to imagine viruses written at a fast enough pace that even a dedicated effort to hire and train new human virus analyzers could not keep up. Taken together, these two trends paint a disturbing picture: More new viruses than humans can handle, spreading more quickly than humans can respond. Whatever we do to solve this problem, it will look quite different from the current solution.

1

To solve this problem, we have built the first commercial-grade immune system that can find, analyze and cure previously unknown viruses faster than the viruses themselves can spread [1]. The system solves several important problems. While rapidity of response requires the entire system to be capable of automated operation, customer administrators can control which parts of their system are automated and which parts require manual intervention. A novel active network architecture permits the system to handle a vast number of customer submissions quickly, so the system can handle floods due to an epidemic of a fast -spreading virus. A virus analysis center can analyze most viruses automatically, and with greater speed and precision than human analysts can. Both the active network and the analysis center are scaleable, so the system can easily accommodate ever-increasing loads. End-to-end security of the system allows the safe submission of virus samples and ensures authentication of new virus definitions. We are piloting this immune system with customers, in conjunction with Symantec Corp.1 The remainder of this paper is structured as follows. We review some historical incidents of very rapidly spreading viruses, and viruses that caused widespread concern. We use these examples to understand what a system must do to solve the problem of epidemics of fast-spreading viruses. We discuss the three types of loads that will be placed on such a system - average loads, peak loads and overloads – and the requirements that such a system must satisfy. We then describe in detail our implementation of an immune system, focusing on the novel elements of the active network and virus analysis center, which work together to keep the system in constant operation and capable of handling very large virus epidemics. We close with a look at the capabilities of the pilot immune system, and a summary of current work in progress.

Epidemics and Floods Any system that solves virus problems during an emergency must face the fact that the world is a very big place. There are hundreds of millions of PCs installed in the world at the time of this writing. Large antivirus companies serve (easily) tens of millions of PCs. If just a tiny number of these decide to submit a possibly new virus for analysis on any given day, the anti-virus company could be faced with tens of thousands of new submissions. You can be sure that all of those customers are concerned enough to want their virus dealt with right away. Similarly, even if a recent virus has already been analyzed and a cure made available, concern that it is becoming widespread rapidly can cause huge numbers of people to request an update to their virus definitions. A virus epidemic, in particular, presents both problems simultaneously. A new, very fast spreading virus could easily infect over a hundred thousand machines in one day.2 If most of those machines forward a copy for analysis, very long queues will develop at the anti-virus vendor. As recent virus incidents have taught us, for every person that gets such a virus, hundreds more will request updated definitions to ensure that they are protected from it. That’s a lot of downloads in a single day! We now examine several incidents that illustrate these problems, describe the nature of the problem in more detail, and discuss the various causes of these kinds of massive loads on a system that handles virus emergencies. 1

The pilot is built on top of an upcoming version of Norton AntiVirus, which is part of the Symantec Digital Immune System™. The Symantec Digital Immune System™ provides central management of Symantec applications within a corporation. As used in this paper, the isolated term “immune system” refers specifically to anti-virus technology developed at IBM, in conjunction with Symantec, to deal with the problem of viral epidemics. Its description in this paper does not necessarily imply any related product plans by either Symantec or IBM. 2 While we have not yet seen an infection this large in a just one day, it is straightforward to describe how it could happen with today’s technology. For reasons that we hope are obvious, we decline to do so here. The Melissa virus was possible much earlier, and its rapidity of spread seems obvious in hindsight. We expect viruses with even faster spread rates in the future.

2

The Internet Worm In November of 1988, a graduate student at Cornell University unleashed what came to be called the Internet Worm on the then-tiny Internet. It spread among two flavors of Unix systems on the Internet, infecting them directly without needing any human intervention. As a result, it spread extremely quickly. In a few hours after it was first released, it was all over the world. Within a day, it had infected hundreds, or perhaps thousands of Unix systems [2, 3]. Only fast and intensive efforts by teams of dedicated experts prevented it from becoming a permanent pest on the Internet. The Internet Worm was the first example of a virus3 that took advantage of the Internet explicitly to spread, and it spread (arguably) more quickly than any other virus to date. Today, the Internet is approximately a thousand times larger than it was in 1988 [4], and the world depends on it much more critically. A similar virus today would be a much larger problem.

The Michelangelo Virus The Michelangelo virus was a run-of-the-mill boot virus that was discovered late in 1991. If an infected machine were booted on March 6 of any year, the Michelangelo virus would effectively destroy all data on the system’s hard drives [5]. A curious hysteria gripped much of the Western world in the weeks prior to March 6, 1992. Though there was no evidence that the virus was widespread and, indeed, good evidence that it was not, hype and publicity catapulted the threat to the front page of newspapers and the top story of television news programs. On March 5, 1992, the day before the dreaded Michelangelo virus was to strike, hordes of people crowded into software stores, denuding shelves of anything that resembled anti-virus software, then rushed home to their computers, fearing that doom was about to befall them. Of course, the sky did not fall the next day. Sure, some computers were infected with the Michelangelo virus, and some of them had their data destroyed, but not very many. Our own estimates at the time were that more disks died of routine hardware failure that day than were affected by the Michelangelo virus. The Michelangelo virus never did cause the gigantic epidemic that some feared, but that’s not the point. The point is that people thought it would, so they bought, updated and used anti-virus software in much greater number that week than in any other week previous. Their demand for updates was the highest it had ever been. One way to see the flood of interest unleashed by Michelangelo Madness is in the following figure, which shows the number of reports of various viruses to central incident monitoring desks in large companies, around the first week of March, 1992. Each point represents two weeks of reported incidents for every 1000 PCs in the reporting population [6]. As you can see, a few Michelangelo virus infections were reported, but there were many more reports at the time of the Stoned virus (which was then the most prevalent virus) and far more reports of all other viruses. Notice two important things about this graph. The first is that the number of reported incidents is much larger in the two weeks around March 6 than at any other time, indicating that people scanned their systems and found viruses that were already there and probably had been for some time. Publicity caused people to react. The second thing to notice is that about five times as many viruses were reported in that period, suggesting that about five times as many people in this corporate population scanned their systems than usual.

3

For simplicity, we avoid the tedious linguistic debate that attempts to differentiate viruses and worms. From an end-user’s point of view, they both spread and cause very similar problems. We beg your indulgence.

3

Reported Virus Incidents Around March 6, 1992

Incidents per 1000 PCs

0.6 0.5 All Viruses

0.4

Stoned

0.3

Michelangelo

0.2 0.1 0 Two Week Intervals

Figure 1: Reported incidents of viruses around March 6, 1992, the date when the Michelangelo virus was first to strike. Because the popular hysteria caused so many people to scan their systems, reported incidents of all viruses were approximately five times higher than usual. This early example of a flood indicates that peak loads on a virus system can be much higher than average loads. An even more telling statistic is that anti-virus vendors sold more copies of their products in the week preceding March 6 than in the entire rest of that year [7]. From this, we can estimate that the demand for anti-virus updates that week was at least 52 times higher than usual in the population as a whole. If most of those sales were in the few days before March 6, which is likely given the buying hysteria reported in news media, demand for updates might have been over a hundred times higher in those few days.

The Melissa Virus On the afternoon of March 26, 1999, support desks at major anti-virus vendors began to get calls about pieces of mail that were arriving in people's electronic mailboxes. Each piece of mail had a subject line of “Important Message From [the name of the sender]”, and contained an attached Microsoft Word document, along with the text “Here is that document you asked for ... don't show anyone else ;-)”. The attached document caused the Microsoft Word macro-warning dialog to appear when it was opened. When opened, the document contained a list of Web sites. Some anti-virus products warned of a possibly new virus in the document. When anyone receiving the mail contacted the person that had apparently sent it, the sender would deny having sent any such thing. Based on the number of calls that began to come in, and the questions that began to be asked on various Web sites and newsgroups, whatever was causing this was rapidly becoming widespread. Once anti-virus experts obtained copies of the document, analysis was simple. The virus, eventually dubbed “Melissa” (or “WM97/Mellisa.A”), was relatively simple, but also very significant. It infected

4

other documents just like other macro viruses, installing itself in the global template and infecting documents that the user subsequently worked on. It also had another, more important, infection route. The virus accessed Microsoft Outlook, if it was installed on the machine, and emailed a copy of itself, as an attachment with the subject and content as described above, to the first fifty people in every accessible Outlook address book [8, 9]. Since the mail would arrive from someone who had the recipient in his or her address book, it was often opened, and the attached document was often opened as well, despite the warnings from Microsoft Word. Some people, presumably, had turned those warnings off previously to avoid the inconvenience of an extra mouse click when opening documents containing macros. March 26, 1999 was a Friday. News of the Melissa virus spread quickly, by word-of-mouth, by alerts from anti-virus and security companies, and by extensive press coverage [10, 11]. Updates to anti-virus programs to handle the new virus were quickly developed and made available, and many corporate security officers worked overtime to update their networks, and install stop-gap patches to forbid mail with certain subject lines, or upgrade mail-server virus scanners to include the new signature. Nonetheless, over the next few days, some enterprise mail systems had to be taken down in order to purge copies of the virus from the systems. Some mail systems crashed because of the excess load caused by the fast-spreading messages. Within some companies, the impact of Melissa was as serious as the impact of the Internet Worm on the Internet as a whole over a decade earlier. Melissa also had some subtler and more insidious effects. While the initial distribution of the virus was in a document containing a list of Web sites, and was posted to the newsgroup alt.sex, Melissa was able to spread to other documents as well. When an infected document is opened on a system where the virus has not run before, it is that document, and whatever it contains, that is mailed out to fifty people on every mailing list. There have been cases where a sensitive company document became infected, and later mailed copies of itself out, to addresses including some that should not have had a copy of the document. Concerned that the Melissa virus might send their own confidential documents outside the organization, some companies shut down their email systems entirely until they could be sure that the virus was cleaned up. This often took days. The Melissa virus could be contained by largely manual methods because of three facts. (1) It began to spread on a Friday (giving the world a weekend of breathing space to install countermeasures before businesses opened on Monday). (2) It was easy to detect and prevent, since the virus consists of fewer than 100 lines of code, and took no serious steps to hide itself. (3) It was a rare occurrence, so anti-virus workers and company security officers were willing to work the weekend to develop and deploy solutions. If a Melissa-like event were to occur every Monday, or even daily, the manual resources to combat it would be hard to find and sustain. Nevertheless, Melissa caught anti-virus vendors unprepared. The demand for updated virus protection was so large that major anti-virus vendors could not keep up with the flood of requests. Their download sites, and in some cases their information Web sites, failed to respond, presumably due to insufficient network or server capacity. Our own study at the time showed major anti-virus vendors unable to update many of their customers for a period of several days during the peak of the Melissa virus concern. While an outage of several days might be tolerated when there are no pressing virus problems, reliable availability of a cure is most urgent when there is a virus emergency.

The ExploreZip Worm A little over two months after the Melissa virus made headlines, a worm called “ExploreZip” began to spread in the Internet, using methods similar in some ways to Melissa, but very different in others. ExploreZip is a 32-bit Windows application, not a set of Microsoft Word macros. However, it spreads itself in a similar way: when an infected program is run, the worm sends a copy of itself as an attachment to email. The worm looks through the victim's electronic in-basket for addresses to send itself to, and it disguises its mail as replies to the mail that it finds. So if I am infected, and you send me mail, you will get back what seems to be a reply from me, saying “I received your email and I shall send you a reply ASAP.

5

Till then, take a look at the attached zipped docs.” The worm itself is attached, disguised as a selfextracting ZIP archive. If you attempt to open the attachment, the worm displays a convincing-looking archive error, but also installs itself in your system, and begins sending more copies of itself, this time under your name. Infected systems do more than send out copies of the worm as mail. They also actively look across the local network (the “Network Neighborhood”, in Windows terms) for further systems to infect, and for files to destroy. The worm is designed to destroy files with certain extensions: c, h, cpp, asm, doc, xls, and ppt. These represent both program code and documents, the files that are most likely to contain valuable information. Whenever the worm finds a system whose drives or directories have been shared, and are writable, on the local network, it will destroy files with these extensions on those drives and directories. If it finds what seems to be a Windows directory, it will copy itself into it and patch that WIN.INI file, in hopes of infecting any system using that copy of Windows [12, 13, 14]. ExploreZip spread quickly and destructively [15]. Despite the experience that the world had had just two months before with Melissa, the worm was able to spread to many companies and individuals, and destroy hundreds of thousands of files4. We have reports of companies where the worm was known to be active on the network, but because it took time to identify and locate the actual infected machine, important (and insufficiently backed-up) files were destroyed in the interim. ExploreZip, like Melissa, illustrates that it is important not only to detect and remove viruses from your organization, but to do it quickly, to minimize the amount of damage (in the form of leaked information, destroyed files, or anything else) that they can do. With viruses like ExploreZip, the threat grows much more substantial the longer they fester within an organization.

Types of Loads With the exception of a few large incidents like those noted above, virus incidents in the past trickled in at a regular rate. In the vast majority of cases, new viruses were acquired by anti-virus vendors in “collections”, and were analyzed and updated virus definitions for them distributed, long before these viruses affected any customer. A new virus seen for the first time by a customer could be handled as an adjunct to the normal process of analyzing collections, albeit with higher priority. As a result, most anti-virus vendors have tuned their processes for dealing with new viruses to the average rate at which new viruses come in, and assume that there are a small number of incidents of any new virus. As we have seen, however, Internet viruses change these assumptions dramatically, and it becomes crucial to deal smoothly with much larger, faster spreading incidents. Average load The average load on a virus protection system consists of the average rate at which samples are submitted for analysis, and the average rate of update requests by customers. If only a few viruses a day are created, and only a few of these show up in customer incidents, then the submission rate is very small and easily handled manually. If customers update primarily as a preventative measure, and do so regularly, the update rate may be fairly large but it is very predictable. This makes it easy to calculate the hardware and software capacity necessary to deal with this average load. Handling average loads is what a typical doctor does in scheduling time for patient visits, confidently predicting an average rate of unconnected maladies. 4

Sadly, we are being very conservative here. We have direct reports of well over a hundred thousand destroyed files from a very small number of surveyed sites. The ExploreZip worm is likely to have destroyed many millions of files worldwide.

6

Peak loads As we have seen in the simple examples above, peak loads on a virus protection system can be more than an order of magnitude larger than average loads. In the past, most anti-virus vendors handled these as exceptions and regarded it as acceptable if they could not respond quickly, or at all. We have argued that we must handle peak loads much more smoothly and routinely than in the past. If the Red Cross could only handle the number of injuries seen on an average day, they wouldn’t be much good in an emergency. Overloads No matter how diligently we design a virus protection system, it will always be possible for it to be subjected to more requests for analysis, or more requests for updated virus definitions, than it can handle. There are two important questions. 1.

Under what conditions can overloads occur? If the least increase in demand causes an overload, the system is useless in emergencies, and failures in these important situations may well erode confidence in the system even under normal conditions. Where possible, a good system will anticipate peak loads much higher than average loads, and perform well under most conceivable conditions.

2.

If an overload does occur, what is the effect on the users? Do users merely experience a delay, until more capacity becomes available? Are they told that the system is unavailable and then required to submit their request again at some unspecified later time? Is their request thrown away entirely, without notification? Where possible, a good system will take care of users without further action on their part.

Causes of Peak Loads Having decided that it is critical to do a good job of handling most conceivable peak loads, we turn our attention to their various causes. Epidemics The immune system is designed primarily to address widespread epidemics of fast-spreading viruses. Each time such an epidemic occurs, the system will experience a sudden and dramatic increase in load. While ordinarily the system will see slow-spreading new viruses sent in from only a few users over a matter of hours or days, during an epidemic caused by a fast-spreading virus, the system may see thousands or even tens of thousands of submissions in a single day. As the world becomes more connected, and other factors that we have identified in this paper come into play, these epidemics are likely to become significantly more common. But the epidemics that drove the initial development of the system are not the only possible cause of peak loads, and some of the other possible causes are in fact potentially more difficult to deal with.

Upload floods A system that is designed to cope well with a flood of actual infected samples will not necessarily cope well with a sudden flood of samples that are not in fact infected at all. Analyzing a non-infected file to be sufficiently sure that it contains no virus can be more difficult and time-consuming that verifying that an infected sample contains a virus. If some factor in the world causes many users to become suspicious of a particular non-infected file or files in the same short time-span, the system may be flooded with non-

7

infected files. The humor columnist Dave Barry caused a vast upsurge of traffic on a university Web site some years ago, simply by mentioning the site in his widely read column. If some comedian or widelydistributed joker were to say, for instance, that COMMAND.COM or EXPLORER.EXE contained a dangerous new virus, thousands of people might hear the rumor or misunderstand the joke, and submit uninfected copies of these files to the immune system. Similarly, if some other anti-virus program in wide use were to release a set of signatures that caused a false positive on some widely-distributed file, users would quickly flood the system with copies of that (uninfected) file, thinking that it probably contained a new virus. To avoid overloading the analysis center with time-consuming analyses of uninfected files, the system must provide a way to deal gracefully with this sort of load as well.

Download floods In our discussion of the Michelangelo virus, we say how press coverage of a particular virus caused a sudden large rise in the demand for anti-virus software. In the same way, we can expect that whenever some dangerous-sounding new virus is mentioned in the press, some percentage of users will decide, right then, to update their anti-virus software (which may have been allowed to get out of date before the coverage). Widely distributed hoax rumors describing fictional dangerous viruses can be expected to have the same effect. If the system is not structured to support peaks in the demand for downloads, users with an important need to get updates (due to an actual new virus) may be unable to access the system, due to overloading of the download links by users responding to hoaxes or hype.

Widespread false positives False positives, in which anti-virus software claims that there is an infection when none is present, have been an ongoing problem in the anti-virus industry since its inception. Every organization that distributes virus definitions has had this problem, sometimes spectacularly so [16]. Widespread false positives – those which hit on very common files – could be real problems. Before 1995, PC viruses were slow-moving file and boot viruses, which took six months to two years to become prevalent worldwide, if they ever did [17]. A virus definition that caused a widespread false positive was embarrassing but not fatal. The embarrassed organization would issue an update that (hopefully!) fixed the problem, sometimes accompanied by a notice to users of the problem. Organizations would update their virus definitions every few months. As a result, false positives that were discovered in previous months would have long since been fixed, and seldom affect most organizations. Macro viruses, first seen in the wild in the PC world in 1995, spread much more quickly than the previous generation of file and boot viruses. These new viruses could become prevalent around the world in a matter of a few months. Organizations responded to this new threat by increasing the frequency with which they updated virus definitions to once a month, or even more often. As they did, anti-virus vendors had to decrease the time it took to respond to new viruses, and to respond to newly discovered false positives. The situation remained tenuously in balance. The new generation of self-mailing viruses like Melissa and ExploreZip, and the faster viruses that will follow them, can become prevalent around the world in days, or even hours. A system which can respond to a new, rapidly spreading virus in days or hours could also, if nothing were done to prevent it, distribute virus definitions that cause a widespread false positive in days or hours. While this same system might be able to distribute a correction quickly, lots of people could have been affected in the meantime. Now, however, these falsely detected files could be sent up to the virus protection system to be analyzed, clogging the system itself and preventing it from working on legitimate viruses. This makes it more important than ever for false positives to be prevented in the first place. Abuse

8

Peak loads on a virus protection system can be generated by abuse of it, even by legitimate and wellmeaning customers of the system. If a user were to submit thousands of files to the system, the system could spend all of its time trying to analyze these files, and be unable to service legitimate customers. Clearly, these problems must be anticipated and solved by any useful system.

Requirements of a Commercial-Grade Solution It is trivial to claim that a system solves this problem. It is even rather easy to build a system that only appears to, or does so badly. It is easy to make a toy system. It is more of a challenge to create a system that actually solves the problem of fast-spreading viruses, and does so reliably enough, and safely enough, that businesses will trust their critical operations to it. In this section, we discuss what a commercial-grade system must do.

Solve the Problem: Cure a Virus Faster than It Spreads It may seem obvious, but a solution to the problem must, well, actually solve the problem: it must cure a new virus faster than the virus spreads. There are many useful things that could be done instead, and some them even sound similar. You could make it easier for customers to get virus samples to a room full of virus analyzers. You could provide the virus analyzers with some tools to make their job easier. You could post virus definition updates on the Web as fast as your fingers can dance across the keyboard. However, unless the system can find, analyze and create a cure for a new virus, then deploy that cure faster than the virus can spread, it doesn’t solve the problem. Only a fast, end-to-end solution will work.

Detect New and Unknown Viruses The first step in the solution is to detect new, previously unknown viruses at each client system. Fortunately, the anti-virus industry has developed a number of heuristics that do a reasonable job of this. The better your heuristic detection, the more effective you can be at combating new viruses. You can’t cure what you can’t find.

Handle Epidemics and Floods By their very nature, fast-spreading viruses tend to infect lots of computer very quickly; they tend to cause epidemics. A system that updates virus definitions is nice, but if it is not available or does not respond quickly when there is an epidemic, it does not solve the problem. Similarly, a system that becomes unavailable when there are floods of various kinds misses the point of being there for customers in an emergency.

Speed Requires Automation To respond quickly enough to fast-spreading viruses, the system must deploy a cure for a new virus within hours of its first discovery. (It may need to be even faster in the future.) In some cases, having to wait for a human to become available to examine a possibly infected machine, analyze a virus or test an update will mean the difference between nipping the virus in the bud and enduring a massive infection. To achieve a response that is consistently fast enough, the entire process of finding, analyzing, and curing a virus must be capable of automation. Customers can, of course, require manual intervention where it is consistent with

9

their business processes and their evaluation of the risks, but the option to automate the entire process will be necessary.

Scale Up with the Problem The virus problem is always changing. Just this year, we saw, in Melissa, an entirely new type of virus that spread faster than any PC virus in history. In ExploreZip, we saw this rapid spread coupled with an extremely destructive payload. The virus problem will continue to change. It is entirely possible that viruses will be created at a much larger rate than they have been in the past. It is quite likely that they will spread even more quickly than they do today. As we have seen again and again, the problem can get suddenly worse without warning. A solution to the problem must be capable of scaling up when this happens, not months or years later. Both the architecture and the implementation must be capable of quickly scaling up to meet a much larger threat than we have seen to date.

Maintain Safety and Reliability Clearly, a solution to the problem of epidemics must work reliably, especially during an epidemic. Perhaps not so obvious is that it needs to work almost flawlessly all the time. The reason has to do, again, with speed. A system that is fast enough will have to be automated. If customers cannot trust the system enough to enable its automation, they face the awful problem of trying to decide which is worse: risking a massive infection or risking the cure. If the system is reliable, and safe, and performs consistently all the time, customers will be able to trust it when it counts most.

Keep the Customer in Control The customer must have sufficient flexibility to incorporate the system into his or her infrastructure consistent with the organization’s policies. It is not about technology; it is about protecting the customer.

Immune System Architectural Overview In order to cure a new virus faster than it spreads, we have built an immune system for the world’s computers. Much like the biological immune system, it defends the “body” of computers against viruses that are seen once by any of them. It can find, analyze, and create a cure for a new, previously unknown virus, then make that cure available to all of the computers. It can do this completely automatically, and quite fast – most importantly, faster than the virus itself can spread. To see how this immune system functions, we now step through an example of detecting a virus at a client system, sending a sample of the virus to a local administrator, transporting it to a virus analysis center, analyzing it, and distributing the cure. In this example, all of the steps can be done automatically, with no humans at any of the computers involved.

10

Figure 2: Overview of the Immune System. A new, previously unknown virus is found in a client in one organization. A sample of the virus is transported through the organization’s administrator, to the immune system’s active network, where it travels through a hierarchy of gateways. If it cannot be handled directly by the gateways, it reaches the analysis center, where it is analyzed and a cure is prepared. The cure is distributed to the infected organization and made available to others who have not yet encountered the virus. The entire process can be done automatically.

Virus Detection A possibly new virus is detected on a client system. This is done by an anti-virus product on that system, and can be done in a number of ways. Heuristics can detect a new, previously unknown virus either by its appearance, by simulating how it will behave when run, or by actually observing the behavior of the program or system [18, 19]. It is also possible that the anti-virus program has a signature that identifies the virus, but it cannot verify or disinfect the virus. This could happen either because it is a new virus that is similar to some existing virus, or because it is a known virus for which a signature was extracted but, perhaps because it was never seen in the wild before, no verification or disinfecting information was derived. (It is also possible that it is not a virus at all, but we will deal with that later in this example.) The client cannot determine if the file or other object is actually infected, but the heuristic or signature detection raised enough suspicion that further analysis is necessary.5 To that end, a sample of the suspicious 5

It is mathematically impossible to do a perfect job of detecting all possible viruses – to correctly identify all possible viruses and correctly determine that uninfected objects contain no virus [20]. Nevertheless, in practice one can come extremely close. In an immune system like the one described here, it pays for the

11

object is extracted, packaged in a harmless form, and sent off to an anti-virus administrator system over the organization’s internal network.

Administrator System The administrator system permits control and auditing of what leaves and enters the organization’s internal network via the immune system. An organization can have one or more administrator systems that collect captured samples and decide what action to take. The administrator system may have access to more recent virus definitions that handle the submitted sample. In this case, it can process the sample immediately by returning updated definitions to the immune system client that submitted the sample. If the administrator system cannot handle the sample by itself, it can forward the sample higher in the immune system hierarchy for analysis. Before it is sent, the administrator might want to have potentially confidential information stripped from the sample. For instance, the administrator might want a potentially infected Microsoft Word document to have its text removed or replaced in order to avoid exposing possibly sensitive information outside the organization. This can be done automatically while leaving the operation of the virus intact for later analysis. Similarly, Microsoft Excel documents can be stripped by removing or replacing the contents of the spreadsheet cells, without affecting the macros. Once samples are prepared for submission, there is a process that selects which samples should be forwarded for automated analysis. If an administrator system suddenly receives a thousand samples, it is likely that they are all infected with the same virus. Only a few representative samples will be submitted at first. Then, when the results of analyzing those samples are returned, the rest of the samples pending submission will be checked to see if they can be handled immediately. Samples that still can’t be handled locally (e.g. those infected with a different virus) are then queued for submission. The administrative system also keeps track of the status of various samples – waiting to be submitted, submitted but not yet analyzed, analysis complete and updated virus definitions ready, etc. This makes it easy for the human administrator to understand the status of any active virus incidents in the organization. To ensure rapid response to a new virus, all of these functions can be carried out automatically. The administrator can also configure the system to require human intervention and choice in deciding if files need to be stripped, in prioritizing samples for submission or in submitting the samples themselves.

Active Network Once samples are submitted from the administrator system, an active network processes them and transports them across the Internet for potential analysis by a central virus analysis center. This active network is designed to deal with epidemics or floods by handling as many submitted samples as possible within the network itself, leaving the analysis center to concentrate on a single copy of a new virus rather than its many siblings. Standard Internet transport and security protocols are used throughout to ensure reliable and safe transmission. The active network is a key part of a commercial-grade immune system. Without it, the system could not handle epidemics or floods. Without its security measures, the system would be vulnerable to eavesdropping and malicious spoofing.

detection function to err on the side of too many detections – detecting as many viral things as possible as viral, and making a small but tolerable number of mistakes by detecting nonviral things as viral. The nonviral things are weeded out further along in the process.

12

This active network is described in more detail below. For our example, we assume that the sample appears to be a new virus, which cannot be handled within the network itself. It is therefore transported securely via the Internet to the virus analysis center.

Virus Analysis Automated virus analysis is one of the keys to building an immune system. The virus analysis center’s job is to analyze the virus sample, to use the results of this analysis to create and test a cure for the new virus, and to package that cure as a virus definition update which can be distributed to users. This is another component of the immune system which is easy to do poorly, and which has been the subject of intense research and development at IBM Research for several years. The virus analysis center is described in more detail below.

Cure Distribution Once the virus analysis center has created a cure in the form of a virus definition update, the update must be returned to the client that reported the initial infection. It must also be made available to other systems within the reporting organization and other systems around the world, so that they can be protected from the virus before the virus spreads to them. In our pilot immune system, the update is returned via the active network, using the same standard Internet transport and security protocols as were used for the sample on the way up. Once the update is received by the administrator system, any samples that are still waiting to be submitted are scanned to see if the updated virus definitions can handle them. (In particular, a copy of the original virus sample that was submitted for analysis is scanned.) When a sample can be handled by the new definitions, those definitions are sent to the client that submitted the sample. The clients install the updated virus definitions, scan themselves, and can disinfect whatever viruses are found. At the same time, the updated definitions can be made available to other client systems in the organization. As before, the immune system is designed so that all of this can happen automatically in the interest of rapid response. Also as before, the administrator can select which actions require human deliberation: sending the updated virus definitions to clients, scanning the clients with the updated definitions, and disinfecting any viruses that are found.

An Active Network to Handle Epidemics and Floods Overview The role of the active network is twofold. Under average loads, it provides a safe, reliable means of transporting virus samples from a customer to the virus analysis center, and transporting the resulting new virus definitions back to the customer. Under peak loads, such as epidemics and floods, it has the critical responsibility of handling potentially huge volumes of traffic both ways without clogging up the analysis center with requests to analyze the same virus (or the same clean file) over and over again. By its nature, the virus analysis center performs very computationally intensive tasks, and it cannot feasibly keep up with the millions of potential files that the immune system may receive during an epidemic or flood. The active network must intermediate between these requests and the analysis center.

13

Figure 3: The Active Network. Administrator systems, which send virus samples to the immune system, form the leaves of the active network. Samples travel through a hierarchy of filters, which handle the sample if it has already been analyzed as uninfected or as a known infected file. Otherwise, they forward it to the analysis center for analysis, resulting in updated virus definitions which are distributed downward to the gateways, to the administrator systems, and ultimately to the clients. The active network is composed of nodes called “gateways”, which are arranged in a tree. The leaves of the tree are individual administrator systems, from which sample submissions originate and to which virus definition updates are delivered. The root of the tree is the virus analysis center. The purpose of this hierarchical structure is to ensure that adequate computing power is available to administrators to address their needs even when there are epidemics and floods. Each gateway has two primary functions when a virus sample is submitted. First, it checks to see if it can handle the sample by itself. It does this by trying to match a checksum6 of the sample file with a database of checksums that correspond to previously analyzed files – files that are known to be clean and files known to contain a particular virus. If a match is found, a result is returned indicating that the file is not infected, or that it is known to be infected and can be handled with a virus definition set of a particular version or later. This can be done very quickly since the checksum is part of the header of the request. If the checksum matches that of a previously analyzed file, the sample file itself is not even transmitted to the gateway. If the checksum matches that of a file that has already been forwarded higher in the active 6

We use MD5, an Internet standard based on cryptographic technology, for checksums in the immune system.

14

network for further analysis, the gateway does not have to receive the file. Instead, it waits for the results of the analysis in progress, and sends its results to everyone who submitted that same file. This means that floods of clean files, or known infected files like ExploreZip, can be dealt with very quickly at the lowest levels of the active network. The second function of a gateway is to scan the sample file with the latest virus definitions, to see if these definitions handle the virus. It may be, for instance, that an administrator system, or a gateway lower in the tree, has not yet received the latest virus definitions. If the sample file can be handled, this definition file is returned, and the administrator system or lower gateway node is updated. This means that epidemics of a known virus, even one that was just examined by the analysis center a few minutes ago, can be handled quickly by the active network. If the sample file has not been analyzed before, and is not handled by the latest virus definitions, the gateway forwards the sample to the next higher node in the tree, which may be another gateway or may be the analysis center. Under average loads, the gateways are flow-through systems. Sample files are held in the gateways only long enough to check them for known viruses. If they cannot be handled directly by the gateway, they are immediately sent to the next higher node. Under normal conditions, the rate at which files move through the gateways is dependent only on the rate at which the analysis center can accept them at the top of the tree. Under exception conditions, such as extremely heavy loads or a temporary outage higher in the network, sample files are held in a queue for transmission. This optimizes the speed with which samples can be processed under normal conditions, while providing a graceful method of dealing with exception conditions. Once a sample has been examined by the analysis center, a message is returned to the active network indicating whether or not the sample was infected. The gateway adds this result to its database of previous results, using the file’s checksum as an index. If any files in the submission queue have this same checksum, they are removed from the queue. Status messages for all of these files are returned down the gateway tree to the administrator systems that submitted them, just as if they had all been analyzed. At the same time, the gateway returns the checksum the gateways lower in the tree so they can similarly update their databases. If the sample was infected, or if the sample contained a false positive that has now been corrected, an updated virus definition file is returned to the gateway. The gateway scans its submission queue to determine if any pending samples can be handled with this new definition file. For any samples that can now be handled, the updated virus definition file is returned to the gateways lower in the tree so they can similarly check their submission queues, and ultimately to the administrator system so the new definitions can be distributed. At the same time, status information is returned down the gateway tree to the administrator system, to inform the administrator of the identity of the virus and the version number of the virus definition file that handles this virus.

Safety and Reliability A system that is intended to handle virus emergencies must be reliable, especially in an emergency, and must not expose customers to risks such as disclosure of their sensitive information or the delivery of a forged virus definition file from an unscrupulous source. The immune system represents a significant advance over current industry practice in both of these dimensions. To meet the objective of reliability, a system must have a transaction protocol that guarantees delivery of the sample to the appropriate gateway or analysis center, ensures that an appropriate response is generated, and guarantees delivery of the updated virus definitions (or other response) back to the administrator system. We would not want to put the customer in the position of wondering if the sample arrived at the analysis center, or if a response might have gotten lost on the way back. In order to handle certain kinds of

15

floods, the transaction protocol must permit meta-information about the sample (e.g. its checksum) to be sent, and acted upon, without having to send the entire sample file, which may potentially be quite large and time consuming to transmit. To meet the objective of security, a virus protection system must encrypt the virus sample, virus definition files and any information sent along with them, to prevent disclosure of potentially sensitive customer information. In fact, the immune system encrypts the entire transaction stream that sends virus samples and virus definitions through the active network. A virus protection system must also authenticate the updated virus definition files, both to certify to the administrator that they came from the authentic analysis center, and to ensure that they have not been changed en route. The immune system does this too.

Figure 4: The Active Network Protocol Stack. The special-purpose transaction protocols that implement the active network are built on top of international standards for structured data, reliable transport, and secure communications. We have created special-purpose transactions for use in the active network. These transactions send samples up, and send back status information and virus definition files. However, we have been careful to use only international standard protocols for the structure, transport and security protocols on which this communication is based. As a transaction protocol we use HTTP, an Internet standard. For security, we use SSL, an Internet security standard. We use DES, RSA and DSA as the underlying cryptographic primitives, which are international standards. We use TCP/IP, again an Internet standard, as a transport protocol. It is notoriously difficult to get transaction and security protocols right, and attempting to create new ones when established, well-understood protocols will do is ill advised at best. For a system that must be reliable and must be secure, time-tested international standards are the right choice.

16

It should be clear, however, that the active network is not “Web-based” in any sense. Administrator systems are not Web browsers. The immune system software that they run cannot connect to any machines other than immune system gateways. Indeed, it is incapable of even talking to other machines on the Internet because these other machines do not share the SSL encryption keys that the gateways use for communication. Similarly, gateways are not Web servers. They do not serve Web pages at all, and Web browsers cannot communicate usefully with them because the browsers do not know the SSL encryption keys that the gateways use.

Scaling the Active Network The active network is designed to be easy to scale up to larger transaction volumes as the nature of the virus problem changes. Additional gateways can be added to the tree, and the gateways around them reconfigured to understand the addition of the new gateways. Nothing else needs to change. As an example, if it turns out that there is a particularly large amount of traffic coming from the Isle of Skye7, it would be easy to set up an additional gateway specifically to handle traffic from that location. If the computer-using population of Europe doubles overnight, doubling the number of gateways devoted to Europe would ensure balanced traffic. In practice, we expect that even high, peak traffic can be processed successfully with only a handful of gateways.

Automated Virus Analysis Center Overview The job of the analysis center is to determine if the sample contains a virus by actually getting the virus to spread. If it does contain a virus, the analysis center analyzes it and produces a virus definition update that can detect, verify and disinfect the virus. This virus definition file is tested to ensure that it works correctly on all available sample of the virus. A problem in any phase of this process results in the virus sample being sent to human analysts for processing. Once it completes testing, the virus definition file is then sent out via the active network to all organizations that submitted samples of this virus. As shown in Figure 5, the analysis center consists of a network of computers, isolated from the rest of the world by a firewall for security purposes. Pools of NT and IBM RISC/6000 worker systems are used to do each phase of sample processing. A supervisor system is in charge of coordinating all activity inside the analysis center.

7

This example might seem far-fetched, but you get the idea.

17

Figure 5: The Virus Analysis Center. Samples come into the virus analysis center from the active network through a firewall, which isolates the virus analysis center from the rest of the Net. Samples are queued for processing under control of a supervisor system, which tracks priorities and status, assigning tasks to pools of worker systems, until analysis is complete and updated virus definition files are returned to the active network through the firewall. A server stores virus replication environments and contains an archive of everything done in the analysis center. The pools of worker systems can be expanded dynamically to scale the analysis center to larger workloads. When a sample arrives in the virus analysis center, it is placed on a queue pending analysis. Since the active network is designed to be a flow-though system in most cases, we expect all submitted samples that require analysis to end up in this queue. If there are many submissions in a short period of time, which is characteristic of virus emergencies such as epidemics, this queue may need to be rather large. A priority is assigned to the sample, which is used to determine the resources allocated to that sample. While the priority system is very flexible, its current use is very simple. Urgent customer samples are put through with high priority, and get first use of any available worker machines for processing. Normal customer samples are put through with medium priority, getting use of machines that are not in use by urgent cases. “Zoo” samples are those submitted by virus lab personnel for routine analysis, usually from large collections of viruses not known to be in the wild anywhere. These are assigned low priority, and are analyzed when no customer samples need work. Multiple samples with the same priority are processed in the order in which they arrive at the analysis center.

The Supervisor A supervisor system oversees the flow of samples through the system. It is responsible for keeping track of what worker machines are available, parceling out work to them, noticing when an assigned task is complete, and noticing if something goes wrong during a task and intervention is needed.

18

Each sample goes through a number of processing stages, which we describe below. The supervisor keeps track of the current stage of each sample. It knows what must be done to it next, and what machine resources it needs to carry out that task. When a worker system becomes available, the supervisor selects the next sample on the queue that can use that particular system for its next analysis stage. It dispatches a task to that machine along with the virus sample and its history. When the task is complete, the supervisor adds the result of the task to the history of the sample and puts the sample back on the queue to await its next stage of processing. Once a sample has completed all of the processing stages, its history is archived and the resulting virus definition, if any, is sent out. The supervisor provides architectural isolation to the various tasks, separating machine resource concerns, prioritization and queuing from the actual job of analyzing the virus. The resulting architecture is robust enough to run continuously, a critical feature of a system that needs to respond to virus emergencies. It is also modular enough that new methods for analyzing viruses, even analysis tasks for entirely new kinds of viruses, can be added both easily and dynamically.

Integration with Back Office Systems The virus analysis center is integrated with back office systems that track customer incidents, build new virus definitions, and maintain a database of virus definitions. This is complicated by the fact that human virus analysts and the virus analysis center will be analyzing viruses (albeit different viruses) at the same time. Customer incident numbers must be assigned consistently so that technical support staff can respond properly to customer calls about the status of a sample that has been submitted. Virus definition version numbers must be assigned sequentially so that it is clear that one set of definitions is a superset of previous definitions, and this must be done no matter who creates the new definition – human or machine.

Virus Analysis Tasks As illustrated in Figure 6, a sample goes through a number of steps in order to determine if it is infected and to analyze whatever virus might be present. The type of virus it might contain is first classified, then the virus is replicated enough times for analysis to be reliable. The virus is analyzed, and information to detect, verify and disinfect the virus is extracted. This information is used to create a virus definition, which is then tested against all of the samples of the virus. If all of these steps are successful, the updated virus definitions are returned. These processes are each implemented as modular, isolated tasks. The supervisor can dispatch any of them to any number of worker systems at any time. As a result, several viruses can be analyzed simultaneously and several machines can be devoted to the analysis of a single, difficult virus. These tasks are now described in some detail.

19

Figure 6: Processes within the Analysis Center. The supervisor directs virus samples through the various processing. Computationally intensive stages, such as replication, can be done in parallel to complete the process more quickly. The fail-safe design of the analysis center defers problematic samples, at any stage, to a human analyst. Classification The first step in analyzing a virus is to try to determine what type of virus it is, so that specialized typespecific routines can be brought to bear. For Microsoft Word files, the classification task currently identifies the version of Word and determines, as best it can, the language of the file (English, French, etc.). For Microsoft Excel files, it determines the version of Excel. For DOS file viruses, it determines if they are COM or EXE files. To ensure reliability, this classification is done by examining the structure of the file, rather than by looking at the filetype.8 Creation of the replication environment Once the sample has been classified, it must be replicated, both to determine that it is, in fact, a virus, and to create enough samples that it can be analyzed reliably. The first step in replication is to set up a virtual 8

As we develop additional virus analysis capabilities, it is easy to extend the classifier to recognize boot sectors, Win32 executables, etc.

20

environment in which the virus is likely to replicate. For Microsoft Word and Excel viruses, we use the version (and language) of Word or Excel determined by the classifier, running in a Windows emulator on an IBM RISC System/6000 under AIX. For DOS file viruses, we use a DOS emulator running under Windows NT. Running viruses in an emulated environment provides two benefits. First, it is easy to ensure that the virus runs safely, and is incapable of infecting any real machine in the analysis center. Second, it allows us to instrument the environment so that the analysis center can sense what the virus is doing as it does it. This is a valuable aid in analysis. When an appropriate replication environment has been selected, an image of that environment is obtained from the server and installed on one or more worker machines of the proper type. Replication Replication tasks are now dispatched to the worker machines whose replication environments were set up in the previous step. Replication tasks run the virus in the emulated environment, trying to make it infect “goat” files (uninfected files of known structure put there for exactly this purpose). To try to infect the “goat” files, the system emulates the actions that an expert human virus analyst would try. For instance, DOS executable “goat” files are run as programs in the emulated environment after the virus has been run to infect that environment. Microsoft Word “goat” files are read into Microsoft Word as documents, modified and written back out. Key sequences are inputted to the emulated machine to simulate a user typing.9 The goal of replication is to obtain enough virus samples to permit analysis to be done reliably. If, for some reason, the first set of replication tasks do not generate enough samples, more can be dispatched, and the process can be repeated until enough samples are available. Analysis Once enough samples have been replicated, the virus can be analyzed. In fact, some of this analysis has already been done as part of the replication task, since it had to know enough about the virus to determine if it had replicated and that there were a sufficient number of good samples. If several different forms of a virus have been generated (e.g. upconversions of macro viruses), each form is analyzed separately and may result in an additional virus definitions. Completing the analysis involves activities like extracting a good signature string for the virus10, constructing a map of all of its regions for verification, and creating disinfection information. These turn out to be very challenging technical problems, which the anti-virus industry largely believed to be impossible when we first started working on them. Their solution is described elsewhere [21, 22]. The result of virus analysis is a set of source files from which virus definitions can be produced.

Definition generation The definition generation task starts with the virus definition source produced by the analysis task and creates a complete set of virus definition files, including definitions for all viruses to date. Human analysts can also create updated virus definitions, of course, as a result of their manual analysis of viruses. In order to ensure consistency between definitions generated by humans and those generated automatically, and to 9

This might sound easy. In fact, many different techniques are needed to get various viruses to respond by infecting other files. These techniques are under constant refinement. Automated replication is one of the more difficult problems to solve in order to analyze the vast majority of viruses automatically. 10 Extracting a “good” signature string means picking one that has a negligible chance of causing a false positive. An automated system whose responses might cause problems will not be trusted. If customers require lengthy, manual testing of every updated virus definition, it will not be possible for them to update their clients quickly enough to protect them from a fast-moving virus. The analysis center uses very sophisticated statistical techniques to ensure much higher reliability of its signatures than can be obtained by humans, even when tested against a large corpus of clean files [21].

21

ensure regularity of the sequence numbering of virus definitions, both humans and the analysis center use a single definition generation system. When a new set of definitions is to be generated, the definition generation system is locked by either the human or the analysis center, a new sequence number is created, the definition is generated and tested, and the definition generation system is then unlocked for subsequent use. As a result, the definition generation step is a serialized resource in the system, and is a bottleneck if it not sufficiently fast. Test Once an updated virus definition file is available, the test task uses these definitions to ensure that all of the samples can be detected, and that all of the goat files can be returned to their original form by disinfecting them. The virus definition must properly detect, verify and disinfect all files. No exceptions are permitted. Once a virus definition file has passed test, it is packaged up and sent out by the supervisor system to the active network as a solution to the submitted virus.

Deferring Problematic Samples No matter how good we get at analyzing viruses automatically, some samples may lie beyond the current state of the art. The analysis center might decide at any stage that it could not process a sample further. It might be unsuccessful at replicating the virus. It might be unable to analyze the replicants. It might fail the test phase. The inability to process a sample can happen because it contains a new or complex type of virus that the analysis center cannot currently handle. Alternatively, the sample might be uninfected, in which case it will not replicate. In the former case, a human analyst will have to examine the virus, create a virus definition update, and help us understand how to enhance the analysis center to handle such viruses in the future. In the latter case, a human can verify that no virus is present, and update the active network so it recognizes this file as uninfected if it is ever seen again. In any case, problematic samples are deferred from the analysis center to a human analyst.11 When a sample is deferred, the human analyst is provided with all of the results that were obtained by the analysis center, including classification information, replicants, analysis and the results of any testing. This information can give the human analyst a valuable head start. When the human analyst is finished, she returns information to the analysis center that allows sample processing to complete.

Safety and Reliability The analysis center has been designed to operate 24 hours a day, producing safe, reliable virus definitions. It is fault tolerant; if a worker machine experiences a hardware failure, it is automatically removed from the pool of machines and any tasks assigned to it are reassigned. It is transaction based; it recovers from serious failures such as power outages by simply backing up to a previous good state and continuing its operation, without losing any submitted samples. It is isolated from outside interference by a strict firewall. Viruses are stored in non-executable form whenever possible. They are only executed on virtual machines from which they cannot escape.12 Detection signatures are extracted so as to ensure extremely low false 11

Our own biological immune systems handle the vast majority of germs to which we are exposed. When they cannot, human doctors are called in to help. Our cyberspace immune system will require human experts to handle exception cases. By focussing automated response on the common problems, both immune systems allow the human experts to concentrate on the harder cases where their expertise is needed. 12 There is no analog in this system of a virus like HIV, which infects the immune system itself. This is one of the important differences between the biological immune system and the system described here. The

22

positive rates. Full verification information is added to virus definitions, so that disinfection is only attempted if a virus exactly matches the one analyzed. This eliminates the risk that a (rare) false positive could lead to an improper attempt to disinfect. Virus definitions undergo rigorous testing before they are released. Finally, if a problem is encountered in any phase of the analysis, the sample is deferred to human analysts. This fail-safe policy ensures that the analysis center only produces dependable definitions.

Scaling the Analysis Center The design of the analysis center anticipates the necessity of responding to an increased load on the system in the future. Individual virus incidents are handled in parallel, and each individual processing step can also be done in parallel. In particular, the most time consuming step in sample processing is replication, in which several potential environments might have to be tried before one is found in which the virus spreads sufficiently to generate enough replicants. The replication step, too, can be done in parallel: replication in different environments can be attempted simultaneously. Because of this parallelism, simply reconfiguring the analysis center with more worker machines increases the rate at which an individual virus can be analyzed, and the overall throughput of the analysis center. Adding new worker machines to the analysis center can be done dynamically, without having to take the system down. When a new worker machine is added, the supervisor notices the new resource, adds it to its resource pool, and starts using it immediately. Roughly, doubling the number of worker machines doubles the overall throughput of the analysis center, up to the point where definition generation is a bottleneck. Similarly, when a new type of virus arises, new analysis modules can be added to the system dynamically. If, for instance, we develop the capability to analyze a new class of Unix viruses, the analysis modules can be added to the system and can begin processing new Unix viruses without any interruption in the service of the analysis center.

How the Immune System Handles Loads We have suggested that a system designed to respond to virus emergencies must handle average loads flawlessly. It must handle very heavy peak loads without denying service to any customer and should, at worst, inflict only slight delays. If something does cause an overload condition, the system should handle it gracefully, without requiring any action on the part of customers and, at worst, recover and continue to process requests as any backlog of requests is cleared up. No commercial anti-virus system currently satisfies these criteria. The immune system does.

Average Loads Under average conditions, the immune system is designed to handle the load of new viruses, and demand for virus definition updates to deal with submitted samples, with ease. (See below for performance estimates of the pilot system.) In fact, there is so much spare capacity in the pilot system that will use it to analyze the hundreds of new zoo viruses we expect to receive in collections during the pilot period, and still expect to have capacity left over. The active network and analysis center are designed to be robust, faulttolerant systems that operate 24 hours a day.

former consists of cells that can be infected by viruses, just like other cells in the body. In the latter, viruses are run in protected virtual machines, from which they cannot escape.

23

Peak Loads During peak loads, the immune system continues to operate as usual, but focuses its efforts on urgent customer incidents. During an epidemic, a few samples of the new virus from each administrator system will be transported to the analysis center by the active network. When they arrive at the analysis center, they take priority over lower priority zoo viruses, which will wait in the queue until customer incidents are completed. The first of these customer samples will be analyzed, typically in less than an hour. Once an updated virus definition is available, all of the remaining samples in the queue will be scanned with the new definitions, and the definitions will be sent out to all affected organizations. The pilot version of the analysis center could easily handle thousands of infected organizations in that first hour; the active network would deal with any that submitted samples subsequently. Similarly, during a widespread panic in which people think a particular (uninfected) file might be infected, a few samples of the file will be send from each administrator system. The analysis center will recognize that it cannot find a virus in the file, and defer its analysis to a human. Once the human has determined that the file is clean, the analysis center sends that message to all organizations that have submitted it so far. Subsequent submissions of that same file are handled immediately at the lowest levels of the active network.

Overload The immune system is designed so that overload is a rare exception. While it could occur during an extremely wide epidemic of a very fast-spreading virus, we think it is more likely to happen due to a network outage or other failure. In the former case, the input queue in the analysis center could fill up before the first sample of the virus has completed analysis. In the latter case, the administrator systems might not be able to contact the gateway, or the gateway might not be able to contact the analysis center. In any of these cases, samples that cannot be transmitted are enqueued on the administrator system or the gateway, as appropriate. Once the backlog is cleared or the communications problem fixed, the enqueued samples are transported as usual. The only effect of overload is increased delay in transporting the samples. The same is true on the downward path, as updated virus definition files are sent to customers. No samples are lost, service is never denied to any customer, and customers are not required to intervene to ensure that their samples are processed and their updated virus definitions are returned.

Current Capabilities and Performance We are currently working on a customer pilot of the immune system described here. In the pilot, we are working with a small number of large customers to validate the usefulness of the system in their environment and to understand what enhancements would be most valuable as we build towards a system which could be deployed as a product. This section discusses the capabilities of the pilot system, which might be substantially different from any possible product implementation.

Active Network While the active network is capable of being fully hierarchical, the pilot system uses a single gateway as the contact point for all pilot customers. This is consistent with the very small number of pilot customers and the expected peak loads on the pilot system. In a possible product version, we would expect at least two levels of hierarchy, with gateways deployed in several of the major geographies of the world.

24

Even so, we estimate that the active network in the pilot system is capable of supporting an upload rate of 100,000 virus samples per day, and a download rate of approximately 10,000 virus definitions per day.13 The gateway’s database of results from previously analyzed samples, which it uses to handle any future submission of these same samples, will hold 10 million results in 1 GB of disk.

Analysis Center In the pilot configuration, the analysis center is equipped with three AIX worker machines (which are used primarily for macro virus replication) and three NT worker machines (which are used for all other analysis tasks). This already provides substantial benefits from parallelism. Additional worker machines can be added dynamically. The input queue in the analysis center is currently capable of holding approximately 8,000 samples that are awaiting analysis. It can be expanded easily by increasing the total disk space on the supervisor system.

Macro Viruses The analysis center can currently analyze Microsoft Word and Microsoft Excel macro viruses in Office 95, Office 97 and Office 2000 formats. It can handle Microsoft Word documents that are in any of ten languages: English, French, German, Italian, Spanish, Polish, Dutch, Brazilian Portuguese, Japanese and Taiwanese. A Japanese virus will not spread in an English version of Microsoft Word, nor will an Office 2000 virus spread in an Office 95 version of Microsoft Word. A separate replication environment is used for each format and language, to ensure that viruses in these formats and languages execute and spread properly in the virtual machines. This means that the analysis center can successfully replicate and analyze viruses that are specific to any of these versions of Microsoft Word or Excel, and specific to any of these languages. Macro viruses that are written for Office 95 will often be converted to Office 97 format when the document they infect is modified with Office 97, and similarly for Office 2000. These are called “upconverted” viruses, and are converted to a different format that cannot be detected by anti-virus software that is looking for their Office 95 version. Similarly, Microsoft Excel viruses in Office 97 documents can be “downconverted” to Office 95 format. The analysis center automatically does all upconversions and downconversions, analyzing all of the resulting formats along with the original one. This helps ensure that all of the various forms of the virus will be handled by the virus definitions that are returned. Polymorphic macro viruses change their appearance every time they spread. Devolving macro viruses can sometimes fail to copy all of their macros when they spread, losing pieces of themselves as they go. Mass copying macro viruses will copy any macros that happen to be in any document they infect, picking up extra macros as they go. The analysis center will currently recognize each of these conditions, produce replicants of the virus, and defer them to a human for analysis. In tests to date, the analysis center analyzes and produces complete definitions for over 80% of the macro viruses that are in the wild. If the analysis center is working on only a single macro virus, it can typically complete analysis of it from beginning to end in 30 minutes in its current configuration. If many macro viruses are queued up for analysis at the same time, so the worker machines are used most efficiently, the analysis center can complete analysis of four viruses per hour on average. As the number of worker machines is increased, the turnaround time will continue to decrease and the throughput will continue to increase, though the increase is not linear. 13

In the pilot system, no attempt is made to minimize the size of the virus updates. We can achieve an improvement of 1,000-10,000 in download rates by minimizing the amount of data transmitted for each update.

25

DOS File Viruses DOS file viruses are replicated in a virtual DOS machine under Windows NT. While a variety of DOS environments could be tried on a given virus, the current system uses only one virtual DOS environment. Polymorphic viruses are deferred for human analysis at present, though the technology for analyzing them automatically has been well understood for some time. In tests to date, the analysis center replicates over 80% of the DOS file viruses that are in the wild, though complete definitions are produced for only about 50% of the viruses in the wild. If the analysis center is working on only a single DOS file virus, it can typically complete analysis of it from beginning to end in 20 minutes. If many DOS file viruses are queued up for analysis at the same time, so the worker machines are used most efficiently, the analysis center can complete analysis of seven viruses per hour on average. Increasing the number of worker machines will continue to increase the throughput, though not linearly. It will not have a significant effect on the turnaround time.

Conclusions and Future Work Solving the problem of epidemics of fast-spreading viruses requires a very different approach than the antivirus industry has taken historically. The immune system that we have developed solves this problem, and does so safely and reliably, so it can be used by real customer organizations in day-to-day operation. We will be using the results of our customer pilot to understand any changes that might be needed for a commercial deployment. In addition to tidying up the existing technology, we have a list of useful technologies to add over time. Beyond simple DOS file viruses and Microsoft Word and Excel macro viruses, there are a number of important virus classes to add to the immune system. We have technology in place that analyzes boot viruses, but have not yet integrated it into the immune system. We are working on technology that handles bimodal and polymorphic viruses, as well as Access and PowerPoint viruses and Win32 viruses. We have started work on the important class of inter-machine worms, which require the ability to emulate an entire network of machines on which the worm might spread. We expect the virus problem to continue to evolve, just as it has for the past decade or so, and sometimes in unexpected directions. The immune system is likely to be an important tool to control their spread for the foreseeable future.

Acknowledgements The authors gratefully acknowledge the assistance of the Norton AntiVirus group at Symantec Corp., and especially the members of SARC and the people working on the Symantec Digital Immune System™. The authors thank all of the people who have, over the years, contributed to IBM AntiVirus and to IBM’s immune system technology, especially Igor Bazarov, Abhay Bhandarkar, Pascal Bizien, Jeff Boston, JeanMichel Boulay, Voytek Chwilka, Anni Coden, Laura Copel, Galen Doak, Gleb Esman, John Evanson, Christian Falterer, Richard Ford, Donny Gilor, Sarah Gordon, Guner Gulyasar, Sanjeev Hatwal, Rob Herstein, Bruce Hicks, Srikant Jalan, Jeff Kephart, Robert B. King, Andy Klapper, Sophia Krasikov, Ken Lockhart, Claudia McGhee, Mahesh Moorthy, Alex Morin, Milosz Muszynski, Daniel Norton, Bill Palis, Charlie Parker, Raju Pavuluri, Frederic Perriot, August Petrillo, Alexey Polyakov, Sankar Ramalingam, Andrew Raybould, Martin Retsch, Rhonda Rosenbaum, Janet Savage, Alla Segal, Rich Segal, Bill Schneider, Bob Schultz, Gregory B. Sorkin, Riad Souissi, Glenn Stubbs, Till Teichmann, Gerry Tesauro,

26

Stefan Tode, Kenny Tran, Hooman Vassef, Senthil Velayudham, Ian Whalley, Michael Wilson, Jonathan Woodbridge and Ahmad Ziadeh.

References [1] The technology of an earlier, demonstration version of the immune system was described in: Jeffrey O. Kephart, Gregory B. Sorkin, Morton Swimmer, and Steve R. White, “Blueprint for a Computer Immune System”, Proceedings of the 1997 International Virus Bulletin Conference, San Francisco, California, October 1-3, 1997. Also http://www.av.ibm.com/InsideTheLab/Bookshelf/ScientificPapers/Kephart/VB97/ [2] Donn Seeley, “A Tour of the Worm”, USENIX Conference Proceedings, pp. 287-304, Winter 1989, San Diego, CA. [3] Eugene H. Spafford, “The Internet Worm: Crisis and Aftermath”, Communications of the ACM, Vol. 32, No. 6, pp. 678-687, June 1989. [4] Internet Software Consortium, http://www.isc.org/dsview.cgi?domainsurvey/host-count-history [5] A description of the Michelangelo virus can be found on the Web at: http://www.symantec.com/avcenter/venc/data/stoned.michelangelo.html [6] Steve R. White, Jeffrey O. Kephart, and David M. Chess, “The Changing Ecology of Computer Viruses”, Proceedings of the Sixth International Virus Bulletin Conference, Brighton, UK, 1996, pp. 189202. [7] Private communication from several anti-virus vendors. [8] A description of the Melissa virus can be found on the Web at: http://www.symantec.com/avcenter/venc/data/mailissa.html [9] CERT Advisory, http://www.cert.org/advisories/CA-99-04-Melissa-Macro-Virus.html [10] Matt Richtel, “Super-Fast Computer Virus Heads Into the Workweek”, New York Times, Technology Section, March 29, 1999. [11] Steve R. White, “All Aboard the Melissa Express”, antivirus online, http://www.av.ibm.com/BreakingNews/VirusAlert/Melissa/ [12] A description of the ExploreZip worm can be found on the Web at: http://www.symantec.com/avcenter/venc/data/worm.explore.zip.html [13] CERT Advisory, http://www.cert.org/advisories/CA-99-06-explorezip.html [14] David Chess, “PrettyPark and ExploreZip – More programs not to run!”, antivirus online, http://www.av.ibm.com/BreakingNews/VirusAlert/PrettyPark/ [15] Associated Press, “Worm Attack May Be Slowing, Experts”, New York Times, Technology Section, June 15, 1999. [16] Steve White, “The Mother of All False Positives”, Virus Bulletin, December 1991, page 2. [17] Steve R. White, Jeffrey O. Kephart, and David M. Chess, “Computer Viruses: A Global Perspective”, in Proceedings of the Fifth International Virus Bulletin Conference, Boston, 1995, pages 185-191.

27

[18] Symantec Corporation, “Understanding Heuristics: Symantec’s Bloodhound Technology”, Symantec White Paper Series, Volume XXXIV, http://www.symantec.com/avcenter/reference/heuristc.pdf [19] Gerald Tesauro, Jeffrey O. Kephart, Gregory B. Sorkin, “Neural Networks for Computer Virus Recognition”, IEEE Expert, Vol. 11, No. 4, Aug. 1996, pp. 5-6. Also http://www.av.ibm.com/InsideTheLab/Bookshelf/ScientificPapers/Tesauro/NeuralNets.html [20] Fred Cohen, “Computer Viruses - Theory and Experiments”, Minutes of the 7th Department of Defense / NBS Computer Security Conference, pp. 240-263, Sept. 24-26, 1984. [21] Jeffrey O. Kephart and William C. Arnold, “Automatic Extraction of Computer Virus Signatures”, Proceedings of the 4th International Virus Bulletin Conference, Jersey, UK, 1994, pp. 179-194. [22] U.S. Patent 5,485,575, David M. Chess, Jeffrey O. Kephart, and Gregory B. Sorkin, “Automatic Analysis of a Computer Virus Structure and Means of Attachment to its Hosts”.

28

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.