How to Defeat Advanced Malware: New Tools for Protection and Forensics (2015)
Chapter 5. Micro-Virtualization
Microvirtualization is a new system architecture that uses hardware-virtualization features along with an innovative hypervisor called a Microvisor, to effortlessly hardware-isolate user-initiated activities or software programs operating on an end point.
Micro-virtualization is a new system architecture that uses hardware-virtualization features, as offered on current CPUs, along with an innovative hypervisor called a Microvisor, to effortlessly hardware-isolate user-initiated activities or software programs operating on an endpoint. Hardware-isolated activities (called micro-VMs) are given a virtualized file system and network stack whose access privileges are contained in accordance to the principles of least privilege: The task has access only to the specific file(s), IP services, websites, and subnets which it needs, and no more. In addition, the task has no access to system hardware or any other privileged system resource.
Micro-virtualization applies granular hardware isolation of individual activities to robustly impose least privilege, access to privileged resources (e.g., networks, devices, and the file system) takes place through a narrow hypercall interface.1 If a task is compromised, malware will be safely contained by the virtualization hardware which ensures CPU, memory, and I/O isolation. To endanger the Microvisor, malware must attack the computer via the hypercall interface that is applied in approximately 10 KLOC and thoroughly hardened. The Microvisor makes sure mandatory access control for access to any privileged system resources to prevent privilege escalation, and it also immediately converts the format of harmful content that accesses privileged resources (printers, clipboard, etc.) to stop potentially harmful content from striking the OS kernel.
Micro-virtualization offers numerous benefits:
• The Microvisor can be utilized on all modern CPU architectures, and can be stacked on a conventional hypervisor. Bromium’s application uses an extension of the open-source Xen Hypervisor2 called micro-Xen3 that runs on x86 and ARM platforms that support hardware virtualization. It has a small code base – about half the size of Xen – making it easier to harden.
• The Microvisor is a late-load hypervisor that is distributed to PC/laptop endpoints as with any program. It may also be simply incorporated by system manufacturers on tablet and mobile phone systems, without the need to modify the operating system.
• Micro-VMs are usually light-weight hardware isolation containers that, unlike VMs, can be created and destroyed quickly (about 10ms), so that they can be applied at the granularity of a single user task (e.g., each tab in the browser) and with negligible effect on the user experience.
• Micro-VMs execute copy-on-write (CoW). All changes to memory space or files are separated in a throw-away cache that is removed when the task ends, making the system organically self-remediating.
• Finally, the granular makeup of a micro-VM facilitates per-task introspection, simplifying the identification and forensic monitoring of malware as it runs in isolation.
5.1. Related work
Despite the fact that mainstream OSes that take time and effort to secure computer systems, such as the CAP4 and Multics,5 have already been built. However, their inherent benefits have not been widely adopted. This reality is most likely just as much a consequence of market expediencies (the growth of DOS to Windows, and thence NT; and the horizontal nature of the computer marketplace); and to the complexity that security normally imposes on systems management and end users. For example, Windows User Account Control (introduced in Vista) alerts an individual anytime he/she opens an “untrusted” file downloaded from the web, or an attachment. Continued alerts for low-risk events have the same outcome as conventional false positives, that is, teaching an individual to disregard them. Outsourcing security-related responsibilities to the consumer result in lowered security, as users go around the bad user experience.
What we need is a technology designed for granular isolation of trust domains that will be easily implemented and controlled at scale – including on legacy systems, which increases system security, and reduces the impact on end user experience.
Isolation technologies abound:
• Classical OS structure utilizes isolation through separation of untrustworthy user processes from the system kernel, and recent studies have concentrated on boosting OS design6,7
• Sandboxes effort to retrofit software-based isolation between user space application processes and existing insecure operating system kernels, using software programs.
• The use of a hypervisor and virtualization has been successfully used as an isolation strategy to increase system security, for example, in the Xbox 3608 and in a variety of embedded systems.9
• Hypervisor-based isolation has been utilized to construct multilevel secure systems, using virtualization to be sure the needed separation of different run-time environments.10
• For both client and server class systems, multiple independent operating system instances in VMs can be mutually isolated by a hypervisor; in the client context, this is often presented in the context of desktop virtualization11 where user interacts with numerous remotely executing desktop VMs whose output is delivered to an endpoint via a remote desktop protocol.12
• Other static isolation approaches have been proposed, for example, the Qubes OS13 in which each application runs in its own VM, Microsoft’s Drawbridge14 which bundles a “library OS” with the application when it is created – similar to the open-source Docker project.15
We should examine these kinds of methods against our requirements: user empowerment, system security, and ease of deployment and management at scale. For example:
• On legacy systems, sandboxing is a proven strategy that is simple to set up as an application component, but has been demonstrated to be unsuccessful against a motivated enemy. The well-designed sandboxes of contemporary OSes (including iOS and Windows 8) are significantly better. However, sandboxing the entire application is not adequately granular as a construct for applications (e.g., Word or a browser) that process content from various trust domains. Finally, no sandbox can safeguard against a kernel-level weakness in the OS upon which it runs.
• A hypervisor delivers powerful inter-VM (inter-OS) isolation on a single device, but cannot safeguard code inside a VM itself (e.g., a virtual desktop) from assault. Moreover, implementing and operating a hypervisor as well as the endpoint OS image(s) and applications is onerous. It can also be impractical for end users because it affects the user experience.
We define a trust domain based on the concept of a user-initiated task: all processing (both user and kernel mode) for a user-initiated workflow related to any application content (e.g., a document), or remote web service represents a task (e.g., each email attachment and each top-level domain (TLD) on the web is an independent task.) We achieve this definition through rigorous application of the principle of least privilege, which also allows us to reason about the security and privacy of the entire system, assuming that any task is compromised.
We seek to granularly and mutually isolate (according to least privilege) the execution of many trust domains on a single device, preserving contextual concurrency for the user, and securely permitting interdomain communication and sharing subject to privacy constraints (and in an enterprise context, protection policies).
We aim to ensure that information exposed and therefore vulnerable to theft is minimized:
• An attacker who successfully elevates his/her privileges in the OS context of a single micro-VM must be unable to access any other micro-VM or the host OS.
• The files and other configuration data (such as the Windows SAM and Registry16) available to a micro-VM are only those that it specifically needs to execute correctly. For example:
• The only files needed to render a website are its cookie and DOM storage.
• When a user is accessing a document, only the document itself is needed.
• Network services, sites, and networks available to a micro-VM are narrowed according to the privilege level of the task. For example:
• Remote sites and networks of value (e.g., corporate SaaS sites, the user’s bank, or a corporate intranet) should not be accessible from a task that only needs access to the untrusted web.
• High-value network infrastructure services such as a corporate DNS or a VPN should only be accessible to a task that requires access to them.
• No micro-VM is given access to privileged system services (clip-board, printers, devices, access to the display, or user input) without specific need, and then only under-user control and subject to additional safety controls that are explained as follows.
• A task may retain files that survive after the task is terminated, but they must be securely tagged with metadata that stores the trust domain of the task. For example:
• An isolated web site for TLD A might save a cookie, DOM storage, and a cache of its pages. These may be persisted, but only ever accessed by another isolated renderer for the same TLD A.
• The user might edit an untrusted, isolated Word document B. She/he can save changes to the document that will be stored in the file system together with metadata that records its provenance. The document can only ever be accessed again from another hardware-isolated Word instance with rights to access a document of that provenance.
• Upon termination, all execution state (both kernel- and user-mode memory, and all execution related changes to files) are discarded, eliminating any malware.
5.2. A practical example
Least privilege dictates the minimum set of system resources (network, file system, desktop) that a given task needs to function correctly, for example, in the context of the browser, a task is an application context defined by the top-level domain (the site top-level domain). What resources does Facebook.com, for example, really need? It needs its cookie and DOM storage, and access to the untrusted web. If the browser tab for Facebook.com is compromised (e.g., it delivers a poisoned advertisement), we can tolerate loss of the cookie (which compromises user privacy, but not system security). We can live with the fact that malware will have access to the untrusted internet. The system will still be safe if malware cannot:
• see any user keystrokes, mouse input, or gain access to the screen (to copy pixels from the display, or display any content to the user),
• access any other privileged data, for example, files other than the Facebook cookie, or registry entries that might leak valuable information
• gain access to valuable networks or sites (e.g., SaaS sites or the intranet),
• access any privileged devices (printers, webcam, the OS file system, or shares)
Least privilege dictates that the task must not have access to any other resources unless they are explicitly required, and then only under precise control, and only for the shortest possible duration. For example:
• If the user wants to upload a photo to Facebook, he/she can select the photo (in the usual way) on the desktop, and then (only) the selected file will be injected into the hardware-isolated task that is rendering the Facebook.com browser tab.
• If the user wants to download a file, it can be allowed to persist outside the confines of the isolated task, but only if we remember the fact that it is untrusted, so that it can only ever be opened in another hardware-isolated task.
5.3. Hardware-enforced task isolation
Hardware isolation of tasks is a core tenet because it offers the most robust barrier to attack. Moreover, it allows isolation of both user-mode and kernel mode execution for a user task, protecting the system from exploits that target the OS kernel directly. Specifically, although sandboxing is becoming popular in many applications, “security by design” vendors aim to bolster system-wide security by extending isolation properties to include kernel processing on behalf of the application. This is crucial because it is often easy to bypass a sandbox by compromising the kernel directly.
5.4. Hardware virtualization technology
In the early years of x86 virtualization, the device hardware was virtualized entirely in software, either by patching the binaries of guest VMs, or through a technique known as enlightenment, pioneered in Xen, and adopted in Microsoft Hyper-V.
Over the last few years Intel, AMD, and ARM have introduced hardware extensions to their CPUs and chipsets that accelerate and automate many low-level virtualization tasks and assist the hypervisor or Virtual Machine Manager (VMM) with dynamic control over hardware resources and increase the security of the hypervisor and the isolation between VMs. Hardware virtualization support today includes functions that virtualize the CPU, memory (including nested page tables), the I/O subsystem, and networking. Hardware virtualization for GPUs is in its infancy, but is expected to become more widely available as use cases for virtualized graphics become more prevalent. Peripheral interfaces, such as USB, can be easily virtualized in software.
Both Intel and AMD support device I/O virtualization and assignment (Intel VT-d, AMD IOMMU) that permits I/O devices to be safely directly assigned to guest VMs, and protects the hypervisor and other guests from device DMA into system memory. Memory used for device I/O is only visible to the guest that owns the device.
In addition, both Intel TXT and AMD SKINIT offer CPU extensions to permit secure system bootstrap and hardware-based attestation using a Trusted Platform Module (TPM) that securely stores signatures for whitelisted code (such as the hypervisor). In a measured boot, the hardware verifies that the hypervisor has not been modified, and the hypervisor can then in turn check that each guest VM is unmodified, prior to it being started. This permits IT to ensure that the system is in the intended state when booted.
Hardware virtualization has played a crucial role in the broad adoption of virtualization. Without hardware guarantees of isolation between guest VMs and between guests and the hypervisor, it would be difficult to adopt virtual infrastructure for mission critical applications, or to comply with regulations that mandate infrastructure isolation, for example, those of the Payment Card Industry (PCI).
5.5. Micro-virtualization at work
Micro-virtualization is a second-generation virtualization technology that extends the isolation, control, and security principles of hypervisor-based virtualization into the OS and its applications. It does this by using hardware virtualization to dynamically isolate user tasks.
5.6. The microvisor
A traditional hypervisor hosts multiple independent guest VMs (each of which executes against a virtual hardware abstraction, and is an independent OS Environment), whereas the Microvisor is a specialized, light-weight, late-load hypervisor that uses hardware virtualization to isolate tasks in Micro-VMs. Unlike traditional VMs, micro-VMs have no virtual hardware interface, they do not boot, cannot be paused, saved, suspended, resumed, moved, or taken snapshot of; they do not have an identity different from the desktop OS, and are temporally dependent on it (they cannot survive a reboot of the host). They are simply user-mode OS tasks that run hardware isolated from the Windows desktop – the OS schedules them for execution, and manages their performance and resource usage.
A traditional VM is “enlightened” with virtualized hardware abstractions (via device drivers for virtual hardware), whereas a micro-VM is enlightened using standard OS mechanisms at the file system and network layer. In addition, all access to the device (graphics, keyboard, mouse, and printing) is virtualized using a virtual access protocol. Each micro-VM renders into local memory, which is securely delivered to the desktop where the user experience is composited. Micro-VMs have no access to USB or other hardware devices.
The minimalist approach of micro-virtualization offers many advantages over traditional hypervisor-based and VM-centric virtualization:
• The Microvisor does not have hardware or device driver dependencies.
• It can rely on the OS for task scheduling and device and power management.
• As a result, it can easily be deployed like a typical application to any endpoint that supports hardware virtualization and managed at scale using existing tools and skill-sets
• Finally, the task-centric nature of micro-virtualization permits an unchanged user experience.
Applications installed by the device, vendor, or enterprise IT, run unchanged, but any application tasks that process content from untrusted domains are hardware isolated from the privileged (and protected) host OS. A hardware isolated task in a micro-VM will take a hardware trap (VM_EXIT) in order to request access to any privileged system service, including network access, file system read/write, copy/paste, input/output events and all device access.
When a micro-VM is created, its only way to access these system resources is via “enlightened” service APIs, which use standard OS interface hooks to direct execution control. Whenever the isolated task attempts to access any of these resources the enlightened service, API invokes a hypercall, which in turn causes the virtualization hardware to force a CPU VM_EXIT, suspending execution and permitting the Microvisor to arbitrate access using a set of resource-access-control policies for the task that are both task and trust/privilege-level specific. The Microvisor implements mandatory access control for access to any system resource (e.g., can the user print an untrusted file?) and also manages any data format changes between privilege domains (e.g., when printing a PDF, the document will be converted to a nonthreatening format such as XPS before being transferred to the host, and then to the printer). Files exchanged between the host and a Micro-VM via a simple shared folder mechanism, and all networking traffic is transferred using a hardened, efficient interdomain transport between a micro-VM and the Microvisor, called Xen v4v.
5.7. Memory and CPU isolation
When a micro-VM is created, its memory map contains entries to the OS kernel, libraries, task-specific code, and state. However, when the task executes, all memory access is “Copy on Write” or CoW – any changes the task makes to memory (both user and kernel space) are to a separate, local copy stored in hardware-isolated memory, and not to the original. Notably, if the task is compromised by malware that modifies the kernel or user-space libraries, the malware will only succeed in modifying a locally cached difference against the original, and not the running host.
5.8. Virtualized file system (VFS)
Each micro-VM is presented with a virtualized file system (VFS) abstraction that provides a view of a golden OS installation and OS configuration state, and a dramatically reduced user file system that contains only the files needed for correct execution of the task. These are determined through application of the “principle of least privilege.” Files that need to persist beyond the lifetime of a single task are tagged with metadata that preserves their provenance and untrusted nature. Untrusted files can only be accessed by a micro-VM with the appropriate privilege. For example:
• For a browser renderer (e.g., Internet Explorer, Chrome, or Firefox), the files required include the cookie for the relevant TLD, its DOM storage, and browser cache. These files are also the only files that a browser task can modify and persist.
• For a Multi-Purpose Internet Mail Extensions (MIME)-type handler for a particular untrusted file type, the only file required is the untrusted file itself. This can be persisted beyond the lifetime of the application (e.g., across editing sessions) as an untrusted file.
The virtualized file system implements CoW semantics for any modifications to files, with CoW differences saved at the block level in hardware-isolated memory, for performance and security reasons. So, if malware modifies a file, the Microvisor stores in-memory cached differences between the file and the original (a logical copy) that efficiently records only block deltas against the original. The actual file in the host file system is unchanged.
If a micro-VM needs to save a file (e.g., the user downloads a file in the browser and wishes to save it), the file is securely passed from the micro-VM via the VFS to a user-selected location on the host OS, with metadata tags indicating lack of trust. The user experience is unchanged. Similarly, any file that the user wishes to inject into a micro-VM (e.g., attach a file to a web-mail) is passed via the VFS. Rich policies can be applied to file export from a micro-VM or import into a micro-VM to ensure data-loss prevention and to otherwise control user workflow.
When the micro-VM exits (the user closes the application, the task terminates or the Microvisor terminates the task), the Microvisor discards the task’s memory image and uses a persistence policy to determine what modified, task-relevant file-system state is persisted. For example, a browser renderer is permitted to modify and save cookie state, DOM storage, and its own cache.
5.9. Virtualized IP networking – the mobile SDN
The Microvisor implements the virtualized network service stack in user mode on the device host OS. If the task in a micro-VM attempts to use the network, a CPU enforced VM_EXIT hands control to the Microvisor that enforces security constraints before delegating processing to the VNS. The VNS virtualizes and controls access to all IP network services, on a per micro-VM basis. Each micro-VM is assigned an anonymous IP address, its connections are NAT-ted, and IP services including the DNS are under control of the Microvisor, which can enforce the use of encryption (SSL or host based VPN) where necessary (e.g., for access to high-value sites), block IP services that are not permitted, and manage authentication and access control on behalf of each task, including single sign on. Finally, the VNS manages security functions typically found on an enterprise network: including the proxy, firewall, and traffic introspection and logging.
The Microvisor enforces granular isolation and privacy on a per-TLD basis. This offers each application (e.g., browser tab, or document) a defensible microperimeter by enforcing least privilege for access to all networks or sites. The mobile SDN hardware isolates, individually virtualizes, and controls all network services independently for each micro-VM, permitting granular policy controls on a per site basis. To understand the power of this capability we provide a detailed example as follows:
The user attaches their PC to the enterprise LAN and visits Facebook.com. The browser tab for Facebook.com will be invisibly isolated in a micro-VM, and by the rules of least privilege it will be granted access to only a single local file – the cookie forFacebook.com. What of its network services? Least privilege demands that the task for Facebook.com should:
• never be allowed to find or query the enterprise DNS, or access any intranet sites,
• never be allowed to resolve or access any high-value enterprise cloud/SaaS sites, such as Salesforce.com or AWS.Amazon.com,
• never be able to resolve or access the user’s high-value sites (if configured) – such as their bank,
• never be able to find or communicate with any other application or micro-VM on the device, any devices on the LAN (including printers), or any other enterprise application or infrastructure service or components – including the proxy, routers, switches or security appliances, networked file shares, and so forth.
• When attached to the LAN, the host manages all authentications, including NTLM authentication to proxies and shares, so that untrusted micro-VMs never have access to credentials.
These networking controls implement least privilege and give the browser tab for Facebook.com the privileges as an application running in the DMZ. If malware compromises the micro-VM, it can only access the untrusted internet.
5.10. Virtualized desktop services
Micro-VMs access the user desktop via a virtual desktop service (VDS). The VDS provides an interface that does the following.
• Enables a micro-VM to deliver a 2-D display frame buffer to the host for compositing into the desktop user interface, and to deliver audio for delivery to host-controlled speakers. With increasing hardware capabilities, including virtualization-safe GPUs, it will shortly be possible to permit the micro-VM to directly render into its own hardware-isolated virtual GPU and portion of the frame-buffer, considerably accelerating graphics support and smoothly permitting a transition to 4K displays.
• Enforces printer redirection to allow the user to print an untrusted document, and
• Offers a virtual clipboard with modified semantics that prevent programmatic access – the user is required to interact with the system to complete any copy/paste action.
Whenever any information is passed to the VDS from a micro-VM, it is always flattened to ensure that the content does not contain latent executable code that could compromise the VDS or the host operating system.
In addition, the VDS manages user interaction with application menus (e.g., clicking “File/Save As” in a word document that is rendered in a micro-VM invokes a workflow that manages the export of an untrusted file from a micro-VM to the host. The system therefore requires an understanding of application menu structure for all applications that are expected to run in a micro-VM. This is achieved through an XML annotation for each such application. Bromium Labs envisages such annotations becoming standard in application virtualization environments in the near future.17
5.11. Creation and management of micro-VMs
In a traditional hypervisor, virtualization-management tools are used to manage the lifecycle of VM instances. By contrast, because micro-VMs are application tasks run by the OS in response to user-initiated workflows, the lifecycle and resource management for micro-VMs needs to be fully integrated into the user experience of the device OS. This is a key requirement, since it permits us to use virtualization to deliver enhanced security and resilience without modifying the end-user experience.
Users should never be aware of the technology. Fortunately, integrating control into today’s OSes is straightforward, for example, utilizing standard MIME-type handling interfaces. Each micro-VM is small when created (a few tens of MB), and grows over time on the basis of CoW differences between its state and the state of the golden host operating system – in both kernel and user mode. Most micro-VMs are short lived, but the system must remain fully functional even in the presence of long-running tasks that become large. To ensure consistent system performance under load, the Microvisor relies on the scheduler of the host OS, but occasionally enforces its own resource optimizations, optionally forcing micro-VMs to swap to disk (if idle) and pinning commonly accessed pages used by multiple micro-VMs in memory.
5.12. Reducing the attack surface
Many use cases of micro-virtualization are security or trust related. It is therefore important to understand the vulnerability of the Microvisor, since compromise of the Microvisor would make it possible for an attacker to attempt to compromise Windows.
The Microvisor attack surface is narrow. Any access to system services outside the micro-VM (such as the file system or network services) occurs via enlightened service APIs. The enlightened services are simply DLL calls triggering a CPU VM_EXIT that allows the Microvisor to enforce access policies for the task. The Microvisor does not trust calls to this API (which is called the hypercall API), and the interface is resilient to attack and checkable by third parties. The Microvisor implements the hypercall API in about 10,000 lines of hardened code.
In summary, the Microvisor implements a Least Privilege Separation Kernel18 between untrusted tasks and the desktop OS. It is the only Separation Kernel that takes advantage of the tiny code base of a specialized hypervisor to dynamically apply Least Privilege at a granular level between tasks within a single running OS instance. Moreover, it is the first general purpose Separation Kernel that can protect existing, widely deployed OSes and their applications, and that can be deployed and managed using today’s management tools (Microsoft System Center, Active Directory, or a security management console).
1 The Linux Foundation, “Hypercall”
2 The Linux Foundation, “The Xen Project”
3 I. Pratt, “Micro-Xen,” Bromium Inc, September 2012
4 Wikipedia, “CAP computer”
5 T. Van Vleck, “Multics”
6 Gordon College, Computer Science Department, “Operating System Organization (CPS312),” 2014.
7 J. N. Herder, H. Bos, B. Gras, P. Homburg and A. S. Tanenbaum, “Isolating Operating System Extensions in User-mode Processes,” Computer Science Dept., Vrije Universiteit, Amsterdam, The Netherlands, 2008.
8 J. Lees, “The hypervisor and its implications” Joystiq
9 D. K. a. M. Kleidermacher, “Embedded Systems Security - Part 3: Hypervisors and system virtualization,” February 2013.
10 Intel Corporation, “SecureView Delivers More Security, Performance, and Savings”
11 Wikipedia, “Desktop virtualization”
12 Microsoft Corporation, “Microsoft Remote Desktop Services (RDS) Explained”
13 The Invisible Things Lab, “Qubes OS”
14 Microsoft Corporation, “Drawbridge”
15 Docker, Inc., “What is Docker?”
16 Wikipedia, “Security Accounts Manager”
17 Wikipedia, “Microsoft Application Virtualization”
18 T. E. Levin, C. E. Irvine and T. D. Nguyen, “Least Privilege in Separation Kernels,” in Security and Cryptography - SECRYPT, 2006.