Normal view

Thinking Outside The Box [dusted off draft from 2017]

16 December 2025 at 10:00

Preface

Hello from the future!

This is a blogpost I originally drafted in early 2017. I wrote what I intended to be the first half of this post (about escaping from the VM to the VirtualBox host userspace process with CVE-2017-3558), but I never got around to writing the second half (going from the VirtualBox host userspace process to the host kernel), and eventually sorta forgot about this old post draft… But it seems a bit sad to just leave this old draft rotting around forever, so I decided to put it in our blogpost queue now, 8 years after I originally drafted it. I’ve very lightly edited it now (added some links, fixed some grammar), but it’s still almost as I drafted it back then.

When you read this post, keep in mind that unless otherwise noted, it is describing the situation as of 2017. Though a lot of the described code seems to not have changed much since then…

Introduction

VM software typically offers multiple networking modes, including a NAT mode that causes traffic from the VM to appear as normal traffic from the host system. Both QEMU and VirtualBox use forks of Slirp for this. Slirp is described as follows on its homepage:

Slirp emulates a PPP or SLIP connection over a normal terminal. This is an actual PPP or SLIP link, firewalled for people’s protection. It makes a quick way to connect your Palm Pilot over the Internet via your Unix or Linux box!!! You don’t need to mess around with your /etc/inetd.conf or your /etc/ppp/options on your system.

Slirp is a useful basis for VM networking because it can parse raw IP packets (coming from the emulated network adapter) and forward their contents to the network using the host operating system’s normal, unprivileged networking APIs. Therefore, Slirp can run in the host’s userspace and doesn’t need any special kernel support.

Both QEMU and VirtualBox don’t directly use the upstream Slirp code, but instead use patched versions where, for example, the feature for setting up port forwards by talking to a magic IP address is removed. Especially in VirtualBox, the Slirp code has been altered a lot.

This post describes an issue in VirtualBox and how it can be exploited. Some parts are specific to the host operating system; in those cases, this post focuses on the situation on Linux.

The packet heap in VirtualBox

The VirtualBox version of Slirp uses a custom zone allocator for storing packet data, in particular, incoming ethernet frames. Each NAT network interface has its own zone (zone_clust) with nmbclusters=1024+32*64=3072 chunks of size MCLBYTES=2048. The initial freelist of each zone starts at the high-address end of the zone and linearly progresses towards the low-address end.

The heap uses inline metadata; each chunk is prefixed with the following structure:

struct item {
    uint32_t magic; // (always 0xdead0001)
    uma_zone_t zone; // (pointer to the zone; uma_zone_t is struct uma_zone *)
    uint32_t ref_count;
    struct {
        struct type *le_next; // (next element)
        struct type **le_prev; // (address of previous le_next)
    } list; // (entry in the freelist or in used_items, the list of used heap chunks)
};

Chunks are freed through the methods m_freem -> m_free -> mb_free_ext -> uma_zfree -> uma_zfree_arg -> slirp_uma_free. The uma_zfree_arg() function takes pointers to the real zone structure and to the chunk data as arguments and checks some assertions before calling slirp_uma_free() as zone->pfFree():

void uma_zfree_arg(uma_zone_t zone, void *mem, void *flags) {
    struct item *it;
    [...]
    it = &((struct item *)mem)[-1];
    Assert((it->magic == ITEM_MAGIC));
    Assert((zone->magic == ZONE_MAGIC && zone == it->zone));

    zone->pfFree(mem,  0, 0); // (zone->pfFree is slirp_uma_free)
    [...]
}

Unfortunately, Assert() is #define‘d to do nothing in release builds - only “strict” builds check for the condition. The builds that are offered on the VirtualBox download page are normal, non-strict release builds.

Next, slirp_uma_free() is executed:

static void slirp_uma_free(void *item, int size, uint8_t flags) {
    struct item *it;
    uma_zone_t zone;
    [...]
    it = &((struct item *)item)[-1];
    [...]
    zone = it->zone;
    [...]
    LIST_REMOVE(it, list);
    if (zone->pfFini)
    {
        zone->pfFini(zone->pData, item, (int /*sigh*/)zone->size);
    }
    if (zone->pfDtor)
    {
        zone->pfDtor(zone->pData, item, (int /*sigh*/)zone->size, NULL);
    }
    LIST_INSERT_HEAD(&zone->free_items, it, list);
}

slirp_uma_free() grabs the zone pointer from the chunk header. Because Assert() is compiled out, there is no validation to ensure that this zone pointer points to the actual zone - an attacker who can overwrite the chunk header could cause this method to use an arbitrary zone pointer. Then, the member pfFini of the zone is executed, which, for an attacker who can point it->zone to controlled data, means that an arbitrary method call like this can be executed:

{controlled pointer}({controlled pointer}, {pointer to packet data}, {controlled u32});

Because the VirtualBox binary, at least for Linux, is not relocatable and has `memcpy()` in its PLT section, this can be used as a write primitive by using the static address of the PLT entry for memcpy() as function address:

memcpy(dest={controlled pointer}, src={packet data}, n={controlled u32})

This means that, even though the packet heap doesn’t contain much interesting data, a heap memory corruption that affects chunk headers could still be used to compromise the VirtualBox process rather easily.

The Vulnerability

In changeset 23155, the following code was added at the top of ip_input(), the method that handles incoming IP packets coming from the VM, before any validation has been performed on the IP headers. m points to the buffer structure containing the packet data pointer and the actual length of the packet data, ip points to the IP header inside the untrusted packet data. RT_N2H_U16() performs an endianness conversion.

if (m->m_len != RT_N2H_U16(ip->ip_len))
    m->m_len = RT_N2H_U16(ip->ip_len);

This overwrites the trusted buffer length with the contents of the untrusted length field from the IP packet. This is particularly bad because all safety checks assume that m->m_len is correct - these two added lines basically make all following length checks useless.

Later, in changeset 59063, the following comment was added on top of those lines:

/*
* XXX: TODO: this is most likely a leftover spooky action at
* a distance from alias_dns.c host resolver code and can be
* g/c'ed.
*/
if (m->m_len != RT_N2H_U16(ip->ip_len))
    m->m_len = RT_N2H_U16(ip->ip_len);

One straightforward way to abuse this issue is to send a small ICMP_ECHO packet with a large ip_len to the address 10.0.2.3, causing Slirp to send back a larger ICMP_ECHOREPLY with out-of-bounds heap data. However, Slirp validates the correctness of the ICMP checksum, meaning that the attacker has to guess the 16-bit checksum of the out-of-bounds heap data that the attacker is trying to leak. While it is possible to bruteforce this checksum, it is inelegant.
An easier way to leak heap data is to use UDP with the help of a helper machine on the other side of the NAT, e.g. on the internet. UDP has a 16-bit checksum over packet data as well, but unlike ICMP, UDP treats the checksum value 0 as “don’t check the checksum”. Therefore, by sending a UDP packet with checksum 0 and a bogus length in the IP header, it is possible to reliably leak out-of-bounds heap data. Since ip_len can be bigger than the chunk size, this also permits leaking the headers (and contents) of following chunks, disclosing information about the heap state, the heap location and the location of the struct uma_zone.

The next step is to somehow use the bug to corrupt chunk headers. Most of the code only reads from incoming packets; however, when a packet with IP options arrives in udp_input() or tcp_input(), the IP payload (meaning the TCP or UDP packet header and everything following it) is moved over the IP options using ip_stripoptions():

void ip_stripoptions(struct mbuf *m, [...])
{
    register int i;
    struct ip *ip = mtod(m, struct ip *);
    register caddr_t opts;
    int olen;
    NOREF(mopt); /** @todo do we really will need this options buffer? */

    olen = (ip->ip_hl<<2) - sizeof(struct ip);
    opts = (caddr_t)(ip + 1);
    i = m->m_len - (sizeof(struct ip) + olen);
    memcpy(opts, opts  + olen, (unsigned)i);
    m->m_len -= olen;

    ip->ip_hl = sizeof(struct ip) >> 2;
}

This means that, by sending a TCP or UDP packet with IP options and a bogus length that is bigger than a heap chunk, it is possible to move the packet payload of the following heap chunk over the corresponding heap chunk header.

Exploitation: Going up to host userspace

In this part of the post, I’m going to show how it’s possible to break out of the VM and run arbitrary shell commands on the host system using system().

Assuming that a sufficiently big portion of the packet heap is unused, the behavior of the allocator can be simplified by allocating all fragmented heap memory, leaving only a pristine freelist that linearly allocates downwards (as shown at the top of the post). Heap chunks can be allocated by sending IP packets with the “more fragments” bit set; such IP packets have to be stored in memory until either the remaining fragments have been received or the maximum number of pending fragments is reached. An attack that is optimized for maximum reliability would probably go a more complex route and use an approach that still works with an arbitrarily fragmented heap.

The first step is to place the command that should be given to system() in memory and determine at which address it was placed. To do this, assuming that the freelist grows downwards linearly, the attacker can first send an IP fragment containing the shell command (causing the IP fragment to be stored), then send a crafted UDP packet to leak data:

(Note: le_prev and le_next are now pointers on the list of used heap chunks (free_items), not the freelist, and therefore the le_next pointer points upwards.)

While the leaked data does not contain a pointer to the chunk containing the shell command, it contains pointers to the adjacent chunk headers, which can be used to calculate the address of the shell command.

The next big step is to figure out the address of system(). Because there is no PLT entry for system(), there is no fixed address the attacker can jump to to invoke the function. However, using the contents of the global offset table, an attacker can first compute the offsets between libc symbols and use them to identify the libc version, then use a GOT entry and the known offset of system() relative to the address the GOT entry points to in that libc version to compute the address of system(). Unfortunately, there seems to be no nice way to directly read from the GOT using the bug, so this has to be done in a somewhat ugly way.

It is possible to use the bug as a write primitive by calling memcpy() as described in the section “The packet heap in VirtualBox”. In general, functions can be called using the bug as follows:

First, the attacker places a fake struct uma_zone (zone header) in memory and determines the address of the fake struct uma_zone, just like the shell command was placed in memory. Next, the attacker sends a packet containing a fake struct vmox_heap_item (chunk header) and moves it over the real chunk header using an adjacent UDP packet with a bogus length field and with IP options:


The result is a chunk with an attacker-controlled header that points to the fake struct uma_zone:


Next, this chunk can be freed by sending a corresponding second IP fragment, causing the member pfFini of the fake uma_zone to be called with arguments zone->pData (attacker-controlled), item (the data directly behind the fake chunk header) and zone->size (again attacker-controlled).

In the case of memcpy(), one issue here is that the fake IP header must be valid; otherwise, the packet might not be recognized during fragment reassembly. Therefore, only the space that would normally be occupied by the ethernet header (14 bytes long) can be used to store the payload; to write larger payloads, multiple function calls must be made.

At this point, using the write primitive, it is possible to leak the GOT contents by overwriting memory as follows (red parts are modified):

First, a fake heap chunk header is placed at the start of the GOT, which is writable and at a fixed address. Because after the VirtualBox process has started, only library code is executed, the corruption of the start of the GOT is not a problem. The le_next pointer of the fake chunk header points to a legitimate chunk that is currently in a pristine area of the original freelist. Now, the attacker can overwrite the freelist head pointer free_items.lh_first in the zone header, causing the fake chunk in the GOT to be returned by a legitimate future allocation.

At this point, the attacker can send another UDP packet with a bogus length field in the IP header. This UDP packet will be placed at the start of the GOT, and out-of-bounds data behind the packet will leak - in other words, the remaining normal GOT entries.

At this point, the attacker can determine the location of system() and call system() with a fully controlled argument.

Conclusion [from the future]

As I noted in the introduction, none of the relevant code seems to have changed much since I found this bug in 2017 - I think if you found a similar bug in the VirtualBox networking code today, it would likely still be exploitable in a similar way.

VirtualBox uses a separate memory region for packet memory allocations - that’s probably intended as a performance optimization. This implementation choice should also make it harder to exploit packet memory UAF bugs as a side effect, since no packets contain pointers, kind of like PartitionAlloc or kalloc_type. However, it might still be possible to exploit a packet memory UAF as TOCTOU by making use of an already-validated length value or such.
This could have also made it harder to exploit packet memory linear OOB write bugs - but the choice of using inline metadata, and not protecting against corruption of this metadata at all, makes OOB write bugs in this allocator region highly exploitable.

Pointer leaks through pointer-keyed data structures

26 September 2025 at 09:00

Introduction

Some time in 2024, during a Project Zero team discussion, we were talking about how remote ASLR leaks would be helpful or necessary for exploiting some types of memory corruption bugs, specifically in the context of Apple devices. Coming from the angle of “where would be a good first place to look for a remote ASLR leak”, this led to the discovery of a trick that could potentially be used to leak a pointer remotely, without any memory safety violations or timing attacks, in scenarios where an attack surface can be reached that deserializes attacker-provided data, re-serializes the resulting objects, and sends the re-serialized data back to the attacker.

The team brainstormed, and we couldn’t immediately come up with any specific attack surface on macOS/iOS that would behave this way, though we did not perform extensive analysis to test whether such attack surface exists. Instead of targeting a real attack surface, I tested the technique described here on macOS with an artificial test case that uses NSKeyedArchiver serialization as the target. Because of the lack of demonstrated real-world impact, I reported the issue to Apple without filing it in our bugtracker. It was fixed in the 31 Mar 2025 security releases. Links to Apple code in this post go to an outdated version of the code that hasn’t been updated in years, and descriptions of how the code works refer to the old unfixed version.

I decided to write about the technique since it is kind of intriguing and novel, and some of the ideas in it might generalize to other contexts. It is closely related to a partial pointer leak and another pointer ordering leak that I discovered in the past, and shows how pointer-keyed data structures can be used to leak addresses under ideal circumstances.

Background - the tech tree

hashDoS

To me, the story of this issue begins in 2011, when the hashDoS attack was presented at 28C3 (slides, recording). In essence, hashDoS is a denial-of-service attack on services (in particular web servers) that populate hash tables with lots of attacker-controlled keys (like POST parameters). It is based on the observation that many hash table implementations have O(1) complexity per insert/lookup operation in the average case, but O(n) complexity for the same operations in the worst case (where the hashes of all keys land in the same hash bucket, and the hash table essentially turns into something like a linked list or an unsorted array depending on how it is implemented). In particular if the hash function used for keys is known to the attacker, then by constructing a request full of parameters whose keys all map to the same hash bucket, an attacker can cause the server to spend O(n²) time processing such a request; this turned out to be enough to keep a web server’s CPU saturated using ridiculously small amounts of network traffic.

There is also much older prior work on the idea of deliberately creating hash table collisions to leak addresses, as pointed out in a 29C3 talk about the same topic. Solar Designer wrote in Phrack issue 53 back in 1998:

—-[ Data Structures and Algorithm Choice

When choosing a sorting or data lookup algorithm to be used for a normal application, people are usually optimizing the typical case. However, for IDS [intrusion detection systems] the worst case scenario should always be considered: an attacker can supply our IDS with whatever data she likes. If the IDS is fail-open, she would then be able to bypass it, and if it’s fail-close, she could cause a DoS for the entire protected system.

Let me illustrate this by an example. In scanlogd, I’m using a hash table to lookup source addresses. This works very well for the typical case as long as the hash table is large enough (since the number of addresses we keep is limited anyway). The average lookup time is better than that of a binary search. However, an attacker can choose her addresses (most likely spoofed) to cause hash collisions, effectively replacing the hash table lookup with a linear search. Depending on how many entries we keep, this might make scanlogd not be able to pick new packets up in time. This will also always take more CPU time from other processes in a host-based IDS like scanlogd.

[…]

It is probably worth mentioning that similar issues also apply to things like operating system kernels. For example, hash tables are widely used there for looking up active connections, listening ports, etc. There’re usually other limits which make these not really dangerous though, but more research might be needed.

hashDoS as a timing attack

From a slightly different perspective, the central observation of hashDoS is: If an attacker can insert a large number of chosen keys into a hash table (or hash set) and knows which hash buckets these keys hash to, then the attacker can (depending on hash table implementation details) essentially slow down future accesses to a chosen hash bucket.

This becomes interesting if the attacker can cause the insertion of other keys whose hashes are secret into the same hash table. In practice, this can for example happen with hash tables which support mixing multiple key types together, like JavaScript’s Map. Back in 2016, in the Firefox implementation, int32 numbers were hashed with a fixed hash function ScrambleHashCode(number), while strings were atomized/interned and then hashed based on their virtual address. That made it possible to first fill an attacker-chosen hash table bucket with lots of elements, then insert a string, observe whether its insertion is fast or slow, and determine from that whether the string’s hash matches the attacker-chosen hash bucket.

With some tricks relying on a pattern in the addresses of interned single-character strings in Firefox, that made it possible to leak the lower 32 bits of a heap address through Map insertions and timing measurements. For more details, see the original writeup and bug report. Of course, nowadays that kind of timing-based in-process partial pointer leak from JavaScript would be considered less interesting, since it is generally assumed that JavaScript can read all memory in the same process anyway

A takeaway from this is: When pointers are used as the basis for object hash codes, this can leak pointers through side channels in keyed data structures.

Linux: object ordering leak through in-order listing of a pointer-keyed tree

As I noted in a blog post a few years ago, on Linux, it is possible for unprivileged userspace to discover in what order struct file instances are stored in kernel virtual memory by reading from /proc/self/fdinfo/<epoll fd> - this file lists all files that are watched by an epoll instance by iterating through a red-black tree that is (essentially) sorted by the virtual address of the referenced struct file, so the data given to userspace is sorted in the same way.

(As I noted in that post, this could be particularly interesting for breaking probabilistic memory safety mitigations that rely on pointer tagging. If the highest bits of pointers are secret tag bits, and an attacker can determine the order of the addresses (including tag bits) of objects, the attacker can infer whether an object’s tag changed after reallocation.)

A takeaway from this is: Keyed data structures don’t just leak information about object hash codes through timing; iterating over a keyed data structure can also generate data whose ordering reveals information about object hash codes.

Serialization attacks

There are various approaches to serializing an object graph. On one side of the spectrum is schema-based serialization, where ideally:

  • serializable types with their members are declared separately from other types
  • fields explicitly declare which other types they can point to (there are no generic pointers that can point to anything)
  • deserialization starts from a specific starting type

On the other side of the serialization spectrum are things like classic Java serialization (without serialization filters), where essentially any class marked as Serializable can be deserialized, serialized fields can often flexibly point to lots of different types, and therefore serialized data can also have a lot of control over the shape of the resulting object graph. There is a lot of public research on the topic of “serialization gadget chains” in Java, where objects can be combined such that deserializing them results in things like remote code execution. This type of serialization is generally considered to be unsafe for use across security boundaries, though Android exposes it across local security boundaries.

Somewhere in the middle of this spectrum is serialization that is fundamentally built like unsafe deserialization, but adds some coarse filters that only allow deserialized objects to have types from an allowlist to make it safe. In Java, that is called “serialization filtering”. This is also approximately the behavior of Apple’s NSKeyedUnarchiver.unarchivedObjectOfClasses, which this post focuses on.

An artificial test case

The goal of the technique described in this post is to leak a pointer to the “shared cache” (a large mapping which is at the same virtual address across all processes on the system, whose address only changes on reboot) through a single execution of the following test case, which uses NSKeyedUnarchiver.unarchivedObjectOfClasses to deserialize an attacker-supplied object graph consisting of the types NSDictionary, NSNumber, NSArray and NSNull, re-serializes the result, and writes back the resulting serialized data:

@import Foundation;
int main() {
  @autoreleasepool {
    NSArray *args = [[NSProcessInfo processInfo] arguments];
    if (args.count != 3) {
      NSLog(@"bad invocation");
      return 1;
    }
    NSString *in_path = args[1];
    NSString *out_path = args[2];

    NSError *error = NULL;

    NSData *input_binary = [NSData dataWithContentsOfFile:in_path];

    /* decode */
    NSArray<Class> *allowed_classes = @[ [NSDictionary class], [NSNumber class], [NSArray class], [NSString class], [NSNull class] ];
    NSObject *decoded_data = [NSKeyedUnarchiver unarchivedObjectOfClasses:[NSSet setWithArray:allowed_classes] fromData:input_binary error:&error];
    if (error) {
      NSLog(@"Error %@ decoding", error);
      return 1;
    }
    NSLog(@"decoded");

    NSData *encoded_binary = [NSKeyedArchiver archivedDataWithRootObject:decoded_data requiringSecureCoding:true error:&error];
    if (error) {
      NSLog(@"Error %@ encoding", error);
      return 1;
    }
    NSLog(@"reencoded");

    [encoded_binary writeToFile:out_path atomically:NO];
  }
  return 0;
}

(The test case also allows NSString but I think that was irrelevant.)

Building blocks

The NSNull / CFNull singleton

The CFNull type is special: There is only one singleton instance of it, kCFNull, implemented in CFBase.c, which is stored in the shared cache. When you deserialize an NSNull object, this doesn’t actually create a new object - instead, the singleton is used.

In the CFRuntimeClass for CFNull, __CFNullClass, no hash handler is provided. When CFHash is called on an object with a type like __CFNullClass that does not implement a ->hash handler, the address of the object is used as the hash code.

Pointer-based hashing is not specific to NSNull; but there probably aren’t many other types for which deserialization uses singletons in the shared cache. There are probably way more types for which instances’ hashes are heap addresses.

NSNumber

The NSNumber type encapsulates a number and supports several types of numbers; its hash handler __CFNumberHash hashes 32-bit integers with _CFHashInt, which pretty much just performs a multiplication with some big prime number.

NSDictionary

Instances of the NSDictionary type are immutable hash tables and can contain arbitrarily-typed keys. Key hashes are mapped to hash table buckets using a simple modulo operation: hash_code % num_buckets. The number of hash buckets in a NSDictionary is always a prime number (see __CFBasicHashTableSizes); hash table sizes are chosen based on __CFBasicHashTableCapacities such that hash tables are normally roughly half-full (around 38%-62%), though the sizing is a bit different for small sizes. These are probing-style hash tables; so rather than having a linked list off each hash bucket, collisions are handled by finding alternate buckets to store colliding elements in using the policy __kCFBasicHashLinearHashingValue / FIND_BUCKET_HASH_STYLE == 1, under which insertion scans forward through the hash table buckets.

I haven’t found source code for serialization of NSDictionary, but it appears to happen in the obvious way, by iterating through the hash buckets in order.

The attack

The basic idea: Infoleak through key ordering in serialized NSDictionary

If a targeted process fills an NSDictionary with attacker-chosen NSNumber keys (through deserialization), the attacker can control which hash buckets will be used by using numbers for which the number’s hash modulo the hash table size results in the desired bucket index. If the targeted process then inserts an NSNull key (still as part of the same deserialization), and then serializes the resulting NSDictionary, the location of the NSNull key in the dictionary’s serialized keys will reveal information about the hash of NSNull.

In particular, the attacker can create a pattern like this using NSNumber keys (where # is a bucket occupied by an NSNumber, and _ is a bucket left empty), where even-numbered buckets are occupied and odd-numbered buckets are empty, here with the example of a hash table of size 7:

bucket index:    0123456
bucket contents: #_#_#_#

This leaves three spots where the NSNull could be inserted (marked with !):

  • At index 1 (#!#_#_#). This happens if hash_code % num_buckets is 6, 0, or 1. (For 6 and 0, insertion would scan linearly through the buckets until finding the free bucket at index 1.) This would result in NSNull being second in the serialized data.
  • At index 3 (#_#!#_#). This happens if hash_code % num_buckets is 2 or 3. This would result in NSNull being third in the serialized data.
  • At index 5 (#_#_#!#). This happens if hash_code % num_buckets is 4 or 5. This would result in NSNull being fourth in the serialized data.

If the serialized data is then sent back to the attacker, the attacker can distinguish between these three states (based on the index of the NSNull key in the serialized data), and learn in which range hash_code % num_buckets is.

Extending it: Leaking the entire bucket index

If the attack from the last section is repeated with the following pattern (occupying odd-numbered buckets and leaving even-numbered ones empty), this yields more information about hash_code % num_buckets:

0123456
_#_#_#_

(Caveat: Don’t think too hard about how a hash table with 3 elements would use only 3 buckets and therefore wouldn’t look like this. The actual reproducer uses hash tables with >=23 buckets.)

Now we have four spots where the NSNull could be inserted: a

  • At index 0, if hash_code % num_buckets is 0.
  • At index 2, if hash_code % num_buckets is 1 or 2.
  • At index 4, if hash_code % num_buckets is 3 or 4.
  • At index 6, if hash_code % num_buckets is 5 or 6.

By combining the information from an NSDictionary that uses the even-buckets-occupied pattern and an NSDictionary that uses the odd-buckets-occupied pattern, the exact value of hash_code % num_buckets can be determined; for example, if the first pattern results in #_#!#_# and the second pattern results in _#!#_#_, then hash_code % num_buckets is 2.

So by sending a serialized NSArray containing two NSDictionary instances with these patterns of NSNumber and NSNull keys to some targeted process, and then receiving a re-serialized copy from the victim, an attacker can determine hash_code % num_buckets for NSArray.

Some math: Leaking the entire hash_code

To leak even more information about the hash_code, this can be repeated with different hash table sizes. The attack from the last section leaks hash_code % num_buckets, where num_buckets is a prime number that the attacker can pick from the possible sizes __CFBasicHashTableSizes based on how many elements are in each NSDictionary.

A useful math trick here is: Based on the values resulting from calculating hash_code modulo a bunch of different prime numbers, hash_code modulo the product of all those prime numbers can be calculated using the extended Euclidean algorithm. Therefore, based on knowing hash_code % num_buckets for the hash table sizes 23, 41, 71, 127, 191, 251, 383, 631 and 1087, it is possible to determine hash_code modulo 23*41*71*127*191*251*383*631*1087 = 0x5'ce23'017b'3bd5'1495. Because 0x5'ce23'017b'3bd5'1495 is bigger than the biggest value hash_code can have (since hash_code is 64-bit), that will be the actual value of hash_code - the address of the NSNull singleton.

Putting it together

So to leak the address of the NSNull singleton in the shared cache, an attacker has to send serialized data consisting of one large container (such as an NSArray) that, for each prime number of interest, contains two NSDictionary instances with the even-indices and odd-indices patterns. (The NSNull keys should come last in the attacker-provided serialized NSDictionary instances, so my reproducer constructs the serialized data manually as an XML plist, and I then convert it to a binary plist with plutil.)

This attacker-provided serialized data is about 50 KiB in size.

The targeted process then has to deserialize this data, serialize it again, and send it back to the attacker.

Afterwards, the attacker can determine in which buckets NSNull was stored in each NSDictionary, use the bucket indices from pairs of NSDictionary to determine hash_code % num_buckets for each hash table size, and then use the extended Euclidean algorithm to obtain hash_code, the address of the NSNull singleton.

The reproducer

I wrote a reproducer for this issue, consisting of my own victim program that runs on the target machine and attacker programs that provide serialized data to the target machine and receive re-serialized data from the target. (For easy reproduction, you can test this on a single machine, that’s also what I did; though I rebooted between “attacker” and “target” to make sure the attacker isn’t using the same shared cache address as the target.)

First, on the attacker machine, generate serialized data:

% clang -o attacker-input-generator attacker-input-generator.c
% ./attacker-input-generator > attacker-input.plist
% plutil -convert binary1 attacker-input.plist

Then, on the target machine, deserialize and re-serialize this data:

% clang round-trip-victim.m -fobjc-arc -fmodules -o round-trip-victim
% ./round-trip-victim attacker-input.plist reencoded.plist
2024-11-25 22:29:44.043 round-trip-victim[1257:11287] decoded
2024-11-25 22:29:44.049 round-trip-victim[1257:11287] reencoded

For validation, you can also use this helper on the target machine to see the real address of the CFNull singleton:

% clang debug-nsnull-hash.m -fobjc-arc -fmodules -o debug-nsnull-hash
% ./debug-nsnull-hash
null singleton pointer = 0x1eb91ab60, null_hash = 0x00000001eb91ab60

Then, on the attacker machine, process the re-serialized data:

% plutil -convert xml1 reencoded.plist
% clang -o extract-pointer extract-pointer.c
% ./extract-pointer < reencoded.plist
serialized data with 1111 objects
NSNull class is 12, NSNull object is 11
NSNull is elem 8 out of 13
NSNull is elem 7 out of 12
NSNull is elem 7 out of 22
NSNull is elem 7 out of 21
NSNull is elem 6 out of 37
NSNull is elem 5 out of 36
NSNull is elem 61 out of 65
NSNull is elem 60 out of 64
NSNull is elem 32 out of 97
NSNull is elem 31 out of 96
NSNull is elem 95 out of 127
NSNull is elem 95 out of 126
NSNull is elem 175 out of 193
NSNull is elem 175 out of 192
NSNull is elem 188 out of 317
NSNull is elem 188 out of 316
NSNull is elem 214 out of 545
NSNull is elem 214 out of 544

NSNull mod 23 = 14
NSNull mod 41 = 13
NSNull mod 71 = 10
NSNull mod 127 = 120
NSNull mod 191 = 62
NSNull mod 251 = 189
NSNull mod 383 = 349
NSNull mod 631 = 375
NSNull mod 1087 = 427

NSNull mod 0x000000000000000000000000000003af =
0x0000000000000000000000000000017e
NSNull mod 0x00000000000000000000000000010589 =
0x000000000000000000000000000059e6
NSNull mod 0x0000000000000000000000000081bef7 =
0xfffffffffffffffffffffffffff4177a
NSNull mod 0x00000000000000000000000060cd7a49 =
0x000000000000000000000000078e47f3
NSNull mod 0x00000000000000000000005ee976e593 =
0x000000000000000000000001eb91ab60
NSNull mod 0x000000000000000000008dff48e176ed =
0x000000000000000000000001eb91ab60
NSNull mod 0x0000000000000000015e003ca3bc222b =
0x000000000000000000000001eb91ab60
NSNull mod 0x0000000000000005ce23017b3bd51495 =
0x000000000000000000000001eb91ab60

NSNull = 0x1eb91ab60

Conclusion

This is a fairly theoretical attack; but I think it demonstrates that using pointers as object hashes for keyed data structures can lead to pointer leaks if everything lines up right, even without using timing attacks.

My example relies on the victim re-serializing the data; but a timing attack version of this might be possible too, with significantly more requests and sufficiently precise measurements.

In my testcase, NSDictionary made it possible to essentially leak information about the ordering of pointers and hashes of numbers by mixing keys of different types; but it is probably possible to leak some amount of information even from data structures that only use pointer keys without mixing key types, especially when the attacker can guess how far apart heap objects are allocated or such and/or can reference the same objects repeatedly across multiple containers.

The most robust mitigation against this is to avoid using object addresses as lookup keys, or alternatively hash them with a keyed hash function (which should reduce the potential address leak to a pointer equality oracle). However, that could come with negative performance effects - in particular, using an ID stored inside an object instead of the object’s address could add a memory load to the critical path of lookups.

From Chrome renderer code exec to kernel with MSG_OOB

8 August 2025 at 09:00

Introduction

In early June, I was reviewing a new Linux kernel feature when I learned about the MSG_OOB feature supported by stream-oriented UNIX domain sockets. I reviewed the implementation of MSG_OOB, and discovered a security bug (CVE-2025-38236) affecting Linux >=6.9. I reported the bug to Linux, and it got fixed. Interestingly, while the MSG_OOB feature is not used by Chrome, it was exposed in the Chrome renderer sandbox. (Since then, sending MSG_OOB messages has been blocked in Chrome renderers in response to this issue.)

The bug is pretty easy to trigger; the following sequence results in UAF:

char dummy;
int socks[2];
socketpair(AF_UNIX, SOCK_STREAM, 0, socks);
send(socks[1], "A", 1, MSG_OOB);
recv(socks[0], &dummy, 1, MSG_OOB);
send(socks[1], "A", 1, MSG_OOB);
recv(socks[0], &dummy, 1, MSG_OOB);
send(socks[1], "A", 1, MSG_OOB);
recv(socks[0], &dummy, 1, 0);
recv(socks[0], &dummy, 1, MSG_OOB);

I was curious to explore how hard it is to actually exploit such a bug from inside the Chrome Linux Desktop renderer sandbox on an x86-64 Debian Trixie system, escalating privileges directly from native code execution in the renderer to the kernel. Even if the bug is reachable, how hard is it to find useful primitives for heap object reallocation, delay injection, and so on?

The exploit code is posted on our bugtracker; you may want to reference it while following along with this post.

Backstory: The feature

Support for using MSG_OOB with AF_UNIX stream sockets was added in 2021 with commit 314001f0bf92 (“af_unix: Add OOB support”, landed in Linux 5.15). With this feature, it is possible to send a single byte of “out-of-band” data that the recipient can read ahead of the rest of the data. The feature is very limited - out-of-band data is always a single byte, and there can only be a single pending byte of out-of-band data at a time. (Sending two out-of-band messages one after another causes the first one to be turned into a normal in-band message.) This feature is used almost nowhere except in Oracle products, as discussed on an email thread from 2024 where removal of the feature was proposed; yet it is enabled by default when AF_UNIX socket support is enabled in the kernel config, and it wasn’t even possible to disable MSG_OOB support until commit 5155cbcdbf03 (“af_unix: Add a prompt to CONFIG_AF_UNIX_OOB”) landed in December 2024.

Because the Chrome renderer sandbox allows stream-oriented UNIX domain sockets and didn’t filter the flags arguments of send()/recv() functions, this esoteric feature was usable inside the sandbox.

When a message (represented by a socket buffer / struct sk_buff, short SKB) is sent between two connected stream-oriented sockets, the message is added to the ->sk_receive_queue of the receiving socket, which is a linked list. An SKB has a length field ->len describing the length of data contained within it (counting both data in the SKB’s “head buffer” as well as data indirectly referenced by the SKB in other ways). An SKB also contains some scratch space that can be used by the subsystem currently owning the SKB (char cb[48] in struct sk_buff); UNIX domain sockets access this scratch space with the helper #define UNIXCB(skb) (*(struct unix_skb_parms *)&((skb)->cb)), and one of the things they store in there is a field u32 consumed which stores the number of bytes of the SKB that have already been read from the socket. UNIX domain sockets count the remaining length of an SKB with the helper unix_skb_len(), which returns skb->len - UNIXCB(skb).consumed.

MSG_OOB messages (sent with something like send(sockfd, &message_byte, 1, MSG_OOB), which goes through queue_oob() in the kernel) are also added to the ->sk_receive_queue just like normal messages; but to allow the receiving socket to access the latest out-of-band message ahead of the rest of the queue, the ->oob_skb pointer of the receiving socket is updated to point to this message. When the receiving socket receives an OOB message with something like recv(sockfd, &received_byte, 1, MSG_OOB) (implemented in unix_stream_recv_urg()), the corresponding socket buffer stays on the ->sk_receive_queue, but its consumed field is incremented, causing its remaining length (unix_skb_len()) to become 0, and the ->oob_skb pointer is cleared; the normal receive path will have to deal with this when encountering the remaining-length-0 SKB.

This means that the normal recv() path (unix_stream_read_generic()), which runs when recv() is called without MSG_OOB, must be able to deal with remaining-length-0 SKBs and must take care to clear the ->oob_skb pointer when it deletes an OOB SKB. manage_oob() is supposed to take care of this. Essentially, when the normal receive path obtains an SKB from the ->sk_receive_queue, it calls manage_oob() to take care of all the fixing-up required to deal with the OOB mechanism; manage_oob() will then return the first SKB that contains at least 1 byte of remaining data, and manage_oob() ensures that this SKB is no longer referenced as ->oob_skb. unix_stream_read_generic() can then proceed as if the OOB mechanism didn’t exist.

Backstory: The bug, and what led to it

In mid-2024, a userspace API inconsistency was discovered, where recv() could spuriously return 0 (which normally signals end-of-file) when trying to read from a socket with a receive queue that contains a remaining-length-0 SKB left behind by receiving an OOB SKB. The fix for this issue introduced two closely related security issues that can lead to UAF; it was marked as fixing a bug introduced by the original MSG_OOB implementation, but luckily was actually only backported to Linux 6.9.8, so the buggy fix did not land in older LTS kernel branches.

After the buggy fix, manage_oob() looked as follows:

static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
                                  int flags, int copied)
{
        struct unix_sock *u = unix_sk(sk);

        if (!unix_skb_len(skb)) {
                struct sk_buff *unlinked_skb = NULL;

                spin_lock(&sk->sk_receive_queue.lock);

                if (copied) {
                        skb = NULL;
                } else if (flags & MSG_PEEK) {
                        skb = skb_peek_next(skb, &sk->sk_receive_queue);
                } else {
                        unlinked_skb = skb;
                        skb = skb_peek_next(skb, &sk->sk_receive_queue);
                        __skb_unlink(unlinked_skb, &sk->sk_receive_queue);
                }

                spin_unlock(&sk->sk_receive_queue.lock);

                consume_skb(unlinked_skb);
        } else {
                struct sk_buff *unlinked_skb = NULL;

                spin_lock(&sk->sk_receive_queue.lock);

                if (skb == u->oob_skb) {
                        if (copied) {
                                skb = NULL;
                        } else if (!(flags & MSG_PEEK)) {
                                if (sock_flag(sk, SOCK_URGINLINE)) {
                                        WRITE_ONCE(u->oob_skb, NULL);
                                        consume_skb(skb);
                                } else {
                                        __skb_unlink(skb, &sk->sk_receive_queue);
                                        WRITE_ONCE(u->oob_skb, NULL);
                                        unlinked_skb = skb;
                                        skb = skb_peek(&sk->sk_receive_queue);
                                }
                        } else if (!sock_flag(sk, SOCK_URGINLINE)) {
                                skb = skb_peek_next(skb, &sk->sk_receive_queue);
                        }
                }

                spin_unlock(&sk->sk_receive_queue.lock);

                if (unlinked_skb) {
                        WARN_ON_ONCE(skb_unref(unlinked_skb));
                        kfree_skb(unlinked_skb);
                }
        }
        return skb;
}

After this change, syzbot (the public syzkaller instance operated by Google) reported that a use-after-free occurs in the following scenario, as described by the fix commit for the syzbot-reported issue:

  1. send(MSG_OOB)
  2. recv(MSG_OOB)
     -> The consumed OOB remains in recv queue
  3. send(MSG_OOB)
  4. recv()
     -> manage_oob() returns the next skb of the consumed OOB
     -> This is also OOB, but unix_sk(sk)->oob_skb is not cleared
  5. recv(MSG_OOB)
     -> unix_sk(sk)->oob_skb is used but already freed

In other words, the issue is that when the receive queue looks like this (shown with the oldest message at the top):

  • SKB 1: unix_skb_len()=0
  • SKB 2: unix_skb_len()=1 <--OOB pointer

and a normal recv() happens, then manage_oob() takes the !unix_skb_len(skb) branch, which deletes the SKB with remaining length 0 and skips forward to the following SKB; but it then doesn’t go through the skb == u->oob_skb check as it otherwise would, which means it doesn’t clear out the ->oob_skb pointer before the SKB is consumed by the normal receive path, creating a dangling pointer that will lead to UAF on a subsequent recv(... MSG_OOB).

This issue was fixed, making the checks for remaining-length-0 SKBs and ->oob_skb in manage_oob() independent:

static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
                                  int flags, int copied)
{
        struct sk_buff *read_skb = NULL, *unread_skb = NULL;
        struct unix_sock *u = unix_sk(sk);

        if (likely(unix_skb_len(skb) && skb != READ_ONCE(u->oob_skb)))
                return skb;

        spin_lock(&sk->sk_receive_queue.lock);

        if (!unix_skb_len(skb)) {
                if (copied && (!u->oob_skb || skb == u->oob_skb)) {
                        skb = NULL;
                } else if (flags & MSG_PEEK) {
                        skb = skb_peek_next(skb, &sk->sk_receive_queue);
                } else {
                        read_skb = skb;
                        skb = skb_peek_next(skb, &sk->sk_receive_queue);
                        __skb_unlink(read_skb, &sk->sk_receive_queue);
                }

                if (!skb)
                        goto unlock;
        }

        if (skb != u->oob_skb)
                goto unlock;

        if (copied) {
                skb = NULL;
        } else if (!(flags & MSG_PEEK)) {
                WRITE_ONCE(u->oob_skb, NULL);

                if (!sock_flag(sk, SOCK_URGINLINE)) {
                        __skb_unlink(skb, &sk->sk_receive_queue);
                        unread_skb = skb;
                        skb = skb_peek(&sk->sk_receive_queue);
                }
        } else if (!sock_flag(sk, SOCK_URGINLINE)) {
                skb = skb_peek_next(skb, &sk->sk_receive_queue);
        }

unlock:
        spin_unlock(&sk->sk_receive_queue.lock);

        consume_skb(read_skb);
        kfree_skb(unread_skb);

        return skb;
}

But a remaining issue is that when this function discovers a remaining-length-0 SKB left behind by recv(..., MSG_OOB), it skips ahead to the next SKB and assumes that it is not also a remaining-length-0 SKB. If this assumption is broken, manage_oob() can return a pointer to the second remaining-length-0 SKB, which is bad because the caller unix_stream_read_generic() does not expect to see remaining-length-0 SKBs:

static int unix_stream_read_generic(struct unix_stream_read_state *state,
                                    bool freezable)
{
[...]
        int flags = state->flags;
[...]
        int skip;
[...]
        skip = max(sk_peek_offset(sk, flags), 0); // 0 if MSG_PEEK isn't set

        do {
                struct sk_buff *skb, *last;
[...]
                last = skb = skb_peek(&sk->sk_receive_queue);
                last_len = last ? last->len : 0;

again:
#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
                if (skb) {
                        skb = manage_oob(skb, sk, flags, copied);
                        if (!skb && copied) {
                                unix_state_unlock(sk);
                                break;
                        }
                }
#endif
                if (skb == NULL) {
[...]
                }

                while (skip >= unix_skb_len(skb)) {
                        skip -= unix_skb_len(skb);
                        last = skb;
                        last_len = skb->len;
                        skb = skb_peek_next(skb, &sk->sk_receive_queue);
                        if (!skb)
                                goto again;
                }
[...]
                /* Mark read part of skb as used */
                if (!(flags & MSG_PEEK)) {
                        UNIXCB(skb).consumed += chunk;
[...]
                        if (unix_skb_len(skb))
                                break;

                        skb_unlink(skb, &sk->sk_receive_queue);
                        consume_skb(skb); // frees the SKB

                        if (scm.fp)
                                break;
                } else {

If MSG_PEEK is not set (which is the only case in which SKBs can actually be freed), skip is always 0, and the while (skip >= unix_skb_len(skb)) loop condition should always be false; but when a remaining-length-0 SKB unexpectedly gets here, the condition turns into 0 >= 0, and the loop skips ahead to the first SKB that does not have remaining length 0. That SKB could be the ->oob_skb; in which case this again bypasses the logic in manage_oob() that is supposed to set ->oob_skb to NULL before the current ->oob_skb can be freed.

So the remaining bug can be triggered by first doing the following twice, creating two remaining-length-0 SKBs in the ->sk_receive_queue:

send(socks[1], "A", 1, MSG_OOB);
recv(socks[0], &dummy, 1, MSG_OOB);

If another OOB SKB is then sent with send(socks[1], "A", 1, MSG_OOB), the ->sk_receive_queue will look like this:

  • SKB 1: unix_skb_len()=0
  • SKB 2: unix_skb_len()=0
  • SKB 3: unix_skb_len()=1 <--OOB pointer

Now, recv(socks[0], &dummy, 1, 0) will trigger the bug and free SKB 3 while leaving ->oob_skb pointing to it; making it possible for subsequent recv() syscalls with MSG_OOB to use the dangling pointer.

The initial primitive

This bug yields a dangling ->msg_oob pointer. Pretty much the only way to use that dangling pointer is the recv() syscall with MSG_OOB, either with or without MSG_PEEK, which is implemented in unix_stream_recv_urg(). (There are other codepaths that touch it, but they’re mostly just pointer comparisons, with the exception of the unix_ioctl() handler for SIOCATMARK, which is blocked in Chrome’s seccomp sandbox.)

unix_stream_recv_urg() does this:

static int unix_stream_recv_urg(struct unix_stream_read_state *state)
{
        struct socket *sock = state->socket;
        struct sock *sk = sock->sk;
        struct unix_sock *u = unix_sk(sk);
        int chunk = 1;
        struct sk_buff *oob_skb;

        mutex_lock(&u->iolock);
        unix_state_lock(sk);
        spin_lock(&sk->sk_receive_queue.lock);

        if (sock_flag(sk, SOCK_URGINLINE) || !u->oob_skb) {
[...]
        }

        // read dangling pointer
        oob_skb = u->oob_skb;

        if (!(state->flags & MSG_PEEK))
                WRITE_ONCE(u->oob_skb, NULL);

        spin_unlock(&sk->sk_receive_queue.lock);
        unix_state_unlock(sk);

        // read primitive
        // ->recv_actor() is unix_stream_read_actor()
        chunk = state->recv_actor(oob_skb, 0, chunk, state);

        if (!(state->flags & MSG_PEEK))
                UNIXCB(oob_skb).consumed += 1; // write primitive

        mutex_unlock(&u->iolock);

        if (chunk < 0)
                return -EFAULT;

        state->msg->msg_flags |= MSG_OOB;
        return 1;
}

At a high level, the call to state->recv_actor() (which goes down the call path unix_stream_read_actor -> skb_copy_datagram_msg -> skb_copy_datagram_iter -> __skb_datagram_iter(cb=simple_copy_to_iter)) gives a read primitive: it is trying to copy one byte of data referenced by the oob_skb to userspace, so by replacing the memory pointed to by oob_skb with controlled, repeatedly writable data, it is possible to repeatedly cause copy_to_user(<userspace pointer>, <kernel pointer>, 1) with arbitrary kernel pointers. As long as MSG_PEEK is set, this can be repeated; only when MSG_PEEK is clear, the ->msg_oob pointer is cleared.

The only write primitive this bug yields is the increment UNIXCB(oob_skb).consumed += 1 that happens when MSG_PEEK is not set. In the build I’m looking at, the consumed field that is incremented is located 0x44 bytes into the oob_skb, an object which is effectively allocated with an alignment of 0x100 bytes. This means that, if the write primitive is applied to a 64-bit length value or a pointer, it would have to do an increment at offset 4 relative to the 8-byte aligned overwrite target, and it would effectively increment the 64-bit pointer/length by 4 GiB.

My exploit for this issue

Discarded strategy for using the write primitive: Pointer increment

It would be possible to free the sk_buff and reallocate it as some structure containing a pointer at offset 0x40. The write primitive would effectively increment this pointer by 4 GiB (because it would increment by 1 at an offset 4 bytes into the pointer). But this would fundamentally rely on the machine having significantly more than 4 GiB of RAM, which feels gross and a bit like cheating.

Overall strategy

Since this issue relatively straightforwardly leads to a semi-arbitrary read (subject to usercopy hardening restrictions), but the write primitive is much more gnarly, I decided to go with the general approach of: first get the read primitive working; then use the read primitive to assist in exploiting the write primitive. This way, ideally everything after the read primitive bootstrapping can be made reliable with enough work.

Dealing with per-cpu state

Lots of things in this exploit rely on per-cpu kernel data structures and will fail if a task is migrated between CPUs at the wrong time. In some places in the exploit, I repeatedly check which CPU the exploit is running on with sched_getcpu(), and retry if the CPU number changed; though I was too lazy to do that everywhere perfectly, and this could be done even better by relying more directly on the “restartable sequences” subsystem.

Note that the Chrome sandbox policy forbids __NR_getcpu; but that has no effect at all on sched_getcpu(), in particular on x86-64, because there are two faster alternatives to the getcpu() syscall that glibc prefers to use instead:

  • The kernel’s rseq subsystem maintains a struct rseq in userspace for each thread, which contains the cpu_id that the thread is currently running on; if rseq is available, glibc will read from the rseq struct.
  • On x86-64, the vDSO contains a pure-userspace implementation of the getcpu() syscall which relies on either the RDPID instruction or, if that is not available, the LSL instruction to determine the ID of the current CPU without having to perform a syscall. (This is implemented in vdso_read_cpunode() in the kernel sources, which is compiled into the vDSO that is mapped into userspace.)

Setting up the read primitive - mostly boring spraying

On the targeted Debian kernel, struct sk_buff is in the skbuff_head_cache SLUB cache, which normally uses order-1 unmovable pages. I had trouble finding a good reallocation primitive that also uses order-1 pages (though maple_node might have been an option); so I went for reallocation as a pipe page (order-0 unmovable), though that means that the reallocation will go through the buddy allocator and requires the order-0 unmovable list to become empty so that an order-1 page is split up.

This is not very novel, so I will only describe a few interesting aspects of the strategy here - if you want a better understanding of how to free a SLUB page and reallocate it as something else, there are plenty of existing writeups, including one I wrote a while ago (section “Attack stage: Freeing the object’s page to the page allocator”), though that one does not discuss the buddy allocator.

To make it more likely for a reallocation of an order-1 page as an order-0 page to succeed, the exploit starts by allocating a large number of order-0 unmovable pages to drain the order-0 and order-1 unmovable freelists. Most ways of allocating large amounts of kernel memory are limited in the sandbox; in particular, the default file descriptor table size soft limit (RLIMIT_NOFILE) is 4096 on Debian (Chrome leaves this limit as-is), and I can neither use setrlimit() to bump that number up (due to seccomp) nor create subprocesses with separate file descriptor tables. (A real exploit might be able to work around this by exploiting several renderer processes, though that seems like a pain.) The one primitive I have for allocating large amounts of unmovable pages are page tables: by creating a gigantic anonymous VMA (read-only to avoid running into Chrome’s RLIMIT_DATA restrictions) and then triggering read faults all over this VMA, an unlimited number of page tables can be allocated. I use this to spam around 10% of total RAM with page tables. (To figure out how much RAM the machine has, I’m testing whether mmap() works with different sizes, relying on the OVERCOMMIT_GUESS behavior of __vm_enough_memory(); though that doesn’t actually work precisely in the sandbox due to the RLIMIT_DATA limit. A cleaner and less noisy way might be to actually fill up RAM and use mincore() to figure out how large the working set can get before pages get swapped out or discarded.)

Afterwards, I create 41 UNIX domain sockets and use them to spam 256 SKB allocations each; since each SKB uses 0x100 bytes, this allocates a bit over 2.5 MiB of kernel memory. That is enough to later flush a slab page out of both SLUB’s per-cpu partial list as well as the page allocator’s per-cpu freelist, all the way into the buddy allocator.

Then I set up a SLUB page containing a dangling pointer, try to flush this page all the way into the buddy allocator, and reallocate it as a pipe page by using 256 pipes to each allocate 2 pages (which is the minimum size that a pipe always has, see PIPE_MIN_DEF_BUFFERS). This allocates 25624KiB = 2 MiB worth of order-0 pages.

At this point, I have probably reallocated the SKB as a pipe page; but I don’t know in which pipe the SKB is located, or at which offset. To figure that out, I store fake SKBs in the pipe pages that point to different data; then, by triggering the bug with recv(..., MSG_OOB|MSG_PEEK), I can read one byte at the pointed-to location and narrow down where in which pipe the SKB is. I don’t know the addresses of any kernel objects yet; but the X86-64 implementation of copy_to_user() is symmetric and also works if you pass a userspace pointer as the source, so I can simply use userspace data pointers in the crafted SKBs for now. (SMAP is not an issue here - SMAP is disabled for all memory accesses in copy_to_user(). On x86-64, copy_to_user() is actually implemented as a wrapper around copy_user_generic(), which is a helper that accepts both kernel and userspace addresses as source and destination.)

Afterwards, I have the ability to call copy_to_user(..., 1) on arbitrary kernel pointers through recv(..., MSG_OOB|MSG_PEEK) using the controlled SKB.

Properties of the read primitive

One really cool aspect of a copy_to_user()-based read primitive on x86-64 is that it doesn’t crash even when called on invalid kernel pointers - if the kernel memory access fails, the recv() syscall will simply return an error (-EFAULT).

The main limitation is that usercopy hardening (__check_object_size()) will catch attempts to read from some specific memory ranges:

  • Ranges that wrap around - not an issue here, only ranges of length 1 can be used anyway.
  • Addresses <=16 - not an issue here.
  • The kernel stack of the current process, if some other criteria are met. Not an issue here - even if I want to read from a kernel stack, I’ll probably want to read the kernel stack of another thread, which isn’t protected.
  • The kernel .text section - all of .data and such is accessible, just .text is restricted. When targeting a specific kernel build, that’s not really relevant.
  • kmap() mappings - those don’t exist on x86-64.
  • Freed vmalloc allocations, or ranges that straddle the bounds of a vmalloc allocation. Not an issue here.
  • Ranges in the direct mapping, or in the kernel image address range, that straddle the bounds of a high-order folio. Not an issue here, only ranges of length 1 can be used anyway.
  • Ranges in the direct mapping, or in the kernel image address range, that are used as SLUB pages in non-kmalloc slab caches, at offsets not allowed by usercopy allowlisting (see __check_heap_object()). This is the most annoying part.

(There might be other ways of using this bug to read memory with different constraints, like by using the frag_iter->len read in __skb_datagram_iter() to influence an offset from which known data is subsequently read, but that seems like a pain to work with.)

Locating the kernel image

To break KASLR of the kernel image at this point, there are lots of options, partially thanks to copy_to_user() not crashing on access to invalid addresses; but one nice option is to read an Interrupt Descriptor Table (IDT) entry through the read-only IDT mapping at the fixed address 0xfffffe0000000000 (CPU_ENTRY_AREA_RO_IDT_VADDR), which yields the address of a kernel interrupt handler.

Using the read primitive to observe allocator state and other things

From here on, my goal is to use the read primitive to assist in exploiting the write primitive; I would like to be able to answer questions like:

  • What is the mapping between struct page */struct ptdesc */struct slab * and the corresponding region in the direct mapping? (This is easy and just requires reading some global variables out of the .data/.bss sections.)
  • At which address will the next sk_buff allocation be?
  • What is the current state of this particular page?
  • Where are my page tables located, and which physical address does a given virtual address map to?

Because usercopy hardening blocks access to objects in specialized slabs, reading the contents of a struct kmem_cache is not possible, because a kmem_cache is allocated from a specialized slab type which does not allow usercopy. But there are many important pieces of kernel memory that are readable, so it is possible to work around that:

  • The kernel .data/.bss sections, which contain things like pointers to kmem_cache instances.
  • The vmemmap region, which contains all instances of struct page/struct folio/struct ptdesc/struct slab (these types all together effectively form a union) which describe the status of each page. These also contain things like a SLUB freelist head pointer; a pointer to the kmem_cache associated with a given SLUB page; or an intrusive linked list element tying together the root page tables of all processes.
  • Kernel stacks of other threads (located in vmalloc memory).
  • Per-CPU memory allocations (located in vmalloc memory), which are used in particular for memory allocation fastpaths in SLUB and the page allocator; and also the metadata describing where the per-cpu memory ranges are located.
  • Page tables.

So to observe the state of the SLUB allocator for a given slab cache, it is possible to first read the corresponding kmem_cache* from the kernel .data/.bss section, then scan through all per-cpu memory for objects that look like a struct kmem_cache_cpu (with a struct slab * and a freelist pointer pointing into the corresponding direct mapping range), and check which kmem_cache the struct slab’s kmem_cache* points to to determine whether the kmem_cache_cpu is for the right slab cache. Afterwards, the read primitive can be used to read the slab cache’s per-cpu freelist head pointer out of the struct kmem_cache_cpu.

To observe the state of a struct page/struct slab/…, the read primitive can be used to simply read the page’s refcount and mapcount (which contains type information). This makes it possible to observe things like “has this page been freed yet or is it still allocated” and “as what type of page has this page been reallocated”.

To locate the page table root of the current process, it is similarly not possible to directly go through the mm_struct because that is allocated from a specialized slab type which does not allow usercopy (except in the saved_auxv field). But one way to work around this is to instead walk the global linked list of all root page tables (pgd_list), which stores its elements inside struct ptdesc, and search for a struct ptdesc which has a pt_mm field that points to the mm_struct of the current process. The address of this mm_struct can be obtained from the per-cpu variable cpu_tlbstate.loaded_mm. Afterwards, the page tables can be walked through the read primitive.

Finding a reallocation target: The magic of CONFIG_RANDOMIZE_KSTACK_OFFSET

Having already discarded the “bump a pointer by 4 GiB” and “reallocate as a maple tree node” strategies, I went looking for some other allocation which would place an object such that incrementing the value at address 0x…44 leads to a nice primitive. It would be nice to have something there like an important flags field, or a length specifying the size of a pointer array, or something like that. I spent a lot of time looking at various object types that can be allocated on the kernel heap from inside the Chrome sandbox, but found nothing great.

Eventually, I realized that I had been going down the wrong path. Clearly trying to target a heap object was foolish, because there is something much better: It is possible to reallocate the target page as the topmost page of a kernel stack!

That might initially sound like a silly idea; but Debian’s kernel config enables CONFIG_RANDOMIZE_KSTACK_OFFSET=y and CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT=y, causing each syscall invocation to randomly shift the stack pointer down by up to 0x3f0 bytes, with 0x10 bytes granularity. That is supposed to be a security mitigation, but works to my advantage when I already have an arbitrary read: instead of having to find an overwrite target that is at a 0x44-byte distance from the preceding 0x100-byte boundary, I effectively just have to find an overwrite target that is at a 0x4-byte distance from the preceding 0x10-byte boundary, and then keep doing syscalls and checking at what stack depth they execute until I randomly get lucky and the stack lands in the right position.

With that in mind, I went looking for an overwrite target on the stack, strongly inspired by Seth’s exploit that overwrote a spilled register containing a length used in copy_from_user. Targeting a normal copy_from_user() directly wouldn’t work here - if I incremented the 64-bit length used inside copy_from_user() by 4 GiB, then even if the copy failed midway through due to a userspace fault, copy_from_user() would try to memset() the remaining kernel memory to zero.

I discovered that, on the codepath pipe_write -> copy_page_from_iter -> copy_from_iter, the 64-bit length variable bytes of copy_page_from_iter() is stored in register R14, which is spilled to the stack frame of copy_from_iter(); and this stack spill is in a stack location where I can clobber it.

When userspace calls write() on a pipe, the kernel constructs an iterator (struct iov_iter) that encapsulates the userspace memory range passed to write(). (There are different types of iterators that can encapsulate a single userspace range, a set of userspace ranges, or various types of kernel memory.) Then, pipe_write() (which is called anon_pipe_write() in newer kernels) essentially runs a loop which allocates a new pipe_buffer slot in the pipe, places a new page allocation in this pipe buffer slot, and copies up to a page worth of data (PAGE_SIZE bytes) from the iov_iter to the pipe buffer slot’s page using copy_page_from_iter(). copy_page_from_iter() effectively receives two length values: The number of bytes that fit into the caller-provided page (bytes, initially set to PAGE_SIZE here) and the number of bytes available in the struct iov_iter encapsulating the userspace memory range (i->count). The amount of data that will actually be copied is limited by both.

If I manage to increment the spilled register R14 which contains bytes by 4 GiB while copy_from_iter() is busy copying data into the kernel, then after copy_from_iter() returns, copy_page_from_iter() will effectively no longer be bounded by bytes, only by i->count (based on the length userspace passed to write()); so it will do a second iteration, which copies into out-of-bounds memory behind the pipe buffer page. If userspace calls write(fd, buf, 0x3000), and the overwrite happens in the middle of copying bytes 0x1000-0x1fff of the userspace buffer into the second pipe buffer page, then bytes 0x2000-0x2fff will be written out-of-bounds behind the second pipe buffer page, at which point i->count will drop to 0, terminating the operation.

Reallocating a SLUB page as a stack page, with arb-read assistance

So to get the ability to increment-after-free a value in a stack page, I again start by draining the low-order page allocator caches. But this time, the arb-read can be used to determine when an object at the right in-page offset is at the top of the SLUB freelist for the sk_buff slub cache; and the arb-read can also determine whether I managed to allocate an entire slab page worth of objects, with no other objects mixed in. Then, when flushing the page out of the SLUB allocator, the arb-read helps to verify that the page really has been freed (its refcount field should drop to 0); and afterwards, the page is flushed out of the page allocator’s per-cpu freelist.

Then, to reallocate the page, I run a loop that first allocates a pipe page, then checks the refcount field of the target page. If the refcount of the target page goes up, I probably found the target page, and can exit the loop; otherwise, I free the pipe page again, reallocate it as a page table to drain the page away, and try again. (Directly allocating as a page table would be cumbersome because page tables have RCU lifetime, so once a page has been allocated as a page table, it is hard to reallocate it. Keeping drained pages in pipe buffers might not work well due to the low file descriptor table size, and each pipe FD pair potentially only being able to reference two pages.)

Once I have reallocated the target page as a pipe buffer, I free it again, then free three more pages (from other helper pipes), and then create a new thread with the clone() syscall. If everything goes well, clone() will allocate four pages for the new kernel stack: First the three other pages I freed last, and then the target page as the last page of the stack. By walking the page tables, I can verify that the target page really got reused as the last page of the target stack.

Remaining prerequisites for using the write primitive

At this point, I have the write primitive set up such that I can trigger it on a specific stack memory location. The write primitive essentially first reads some surrounding (stack) memory (in unix_stream_read_actor() and its callees skb_copy_datagram_msg -> skb_copy_datagram_iter) and expects that memory to have a certain structure before incrementing the value at a specific stack location.

I also know what stack allocation I want to overwrite.

The remaining issues are:

  1. I need to ensure that an OOB copy_from_user() behind a pipe buffer page will overwrite some data that helps in compromising the kernel.
  2. I need to be able to detect at what stack depth pipe_write() is running, and depending on that either try again or proceed to trigger the bug.
  3. The UAF reads preceding the UAF increment need to see the right kind of data to avoid crashing.
  4. copy_from_iter() needs to take enough time to allow me to increment a value in its stack frame.

Selecting an OOB overwrite target

Page tables have several nice properties here:

  • It is easy for me to cause allocation of as many page tables as I want.
  • I can easily determine the physical and kernel-virtual addresses of page tables that the kernel has allocated for my process (by walking the page tables with the arb read).
  • They are order-0 unmovable allocations, just like pipe buffers, so the page allocator will allocate them in the same 2MiB pageblocks.

So I am choosing to use the OOB copy_from_user() to overwrite a page table.

This requires that I can observe where my pipe buffer pages are located; for that, I again use the SLUB per-cpu freelist observing trick, this time on the kmalloc-cg-192 slab cache, to figure out where a newly created pipe’s pipe_inode_info is located. From there, I can walk to the pipe’s pipe_buffer array, which contains pointers to the pages used by the pipe.

With the ability to observe both where my page tables are located and where pipe buffer pages are allocated, I can essentially alternatingly allocate page tables and pipe buffer pages until I get two that are adjacent.

Detecting pipe_write() stack depth

To run pipe_write() with a write() syscall such that I can reliably determine at which depth the function is running and decide whether to go ahead with the corruption, without having to race, I can prepare a pipe such that it initially only has space for one more pipe_buffer, and then call write() with a length of 0x3000. This will cause pipe_write() to first store 0x1000 bytes in the last free pipe_buffer slot, then wait for space to become available again. From another thread, it is possible to detect when pipe_write() has used the last free pipe_buffer slot by repeatedly calling poll() on the pipe: When poll() stops reporting that the pipe is ready for writing (POLLOUT), pipe_write() must have used up the last free pipe_buffer slot.

At that point, I know that the syscall entry part of the kernel stack is no longer changing. To check whether the syscall is executing at a specific depth, it is enough to check whether the return address for the return from x64_sys_call to do_syscall_64 is at the expected position on the kernel stack using the arb read - it can’t be a return address left from a preceding syscall because the same stack location where that return address is stored is always clobbered by a subsequent call to syscall_exit_to_user_mode at the end of a syscall.

If the stack randomization is the correct one, I can then do more setup and resume pipe_write() by using read() to clear pipe buffer entries; otherwise, I will use read() to clear pipe buffer entries, let pipe_write() run to completion, and try again.

Letting the reads in the increment primitive see the right data

The increment primitive happens on this call graph:

unix_stream_recv_urg
  [read dangling pointer from ->oob_skb]
  unix_stream_read_actor [called as state->recv_actor]
    [UAF read UNIXCB(skb).consumed]
    skb_copy_datagram_msg
      skb_copy_datagram_iter
        __skb_datagram_iter
          skb_headlen
            [UAF read skb->len]
            [UAF read skb->data_len]
          skb_frags_readable
            [UAF read skb->unreadable]
          skb_shinfo [for reading nr_frags]
            skb_end_pointer
              [UAF read skb->head]
              [UAF read skb->end]
          skb_walk_frags
            skb_shinfo [for reading frag_list]
            [forward iteration starting at skb_shinfo(skb)->frag_list along ->next pointers]
  [UAF increment of UNIXCB(oob_skb).consumed]

A promising aspect here is that this codepath first does all the reads; then it does a linked list walk through attacker-controlled pointers with skb_walk_frags(); and then it does the write. skb_walk_frags() is defined as follows:

#define skb_walk_frags(skb, iter)	\
	for (iter = skb_shinfo(skb)->frag_list; iter; iter = iter->next)

and is used like this in __skb_datagram_iter():

	skb_walk_frags(skb, frag_iter) {
		int end;

		WARN_ON(start > offset + len);

		end = start + frag_iter->len;
		if ((copy = end - offset) > 0) {
			if (copy > len)
				copy = len;
			if (__skb_datagram_iter(frag_iter, offset - start,
						to, copy, fault_short, cb, data))
				goto fault;
			if ((len -= copy) == 0)
				return 0;
			offset += copy;
		}
		start = end;
	}

So if I run recv(..., MSG_OOB) on the UNIX domain socket while the dangling ->oob_skb pointer points to data I control, and craft that fake SKB such that its skb_shinfo(skb)->frag_list points to another fake SKB with ->len=0 and a ->next pointer pointing back to itself, I can cause the syscall to get stuck in an infinite loop. It will keep looping until I replace the ->next pointer with NULL, at which point it will perform just the UAF increment.

This is great news: instead of needing to ensure that the stack contains the right data for the UAF reads and the overwrite target for the UAF increment at the same time, I can first place controlled data on the stack, and then afterwards separately place the overwrite target on the stack.

To place controlled data on the stack, I initially considered using select() or poll(), since I know that those syscalls copy large-ish amounts of data from userspace onto the stack; however, those have the disadvantage of immediately validating the supplied data, and it would be hard to make them actually stay in the syscall, rather than immediately returning out of the syscall with an error and often clobbering the on-stack data array in the process. Eventually I discovered that sendmsg() on a datagram-oriented UNIX domain socket works great for this: ___sys_sendmsg(), which implements the sendmsg() syscall, will import the destination address pointed to by msg->msg_name into a stack buffer (struct sockaddr_storage address), then call into the protocol-specific ->sendmsg handler - in the case of datagram-oriented UNIX domain sockets, unix_dgram_sendmsg(). This function coarsely validates the structure of the destination address (checking that it specifies the AF_UNIX family and is no larger than struct sockaddr_un), then waits for space to become available in the socket’s queue before doing anything else with the destination address. This makes it possible to place 108 bytes of controlled data on a kernel stack, and that data will stay there until the syscall can continue or bail out when space becomes available in the socket queue or the socket is shut down. I actually need a bit more data on the stack, but luckily the struct iovec iovstack[UIO_FASTIOV] is directly in front of the address, and unused elements at the end of the iovstack are guaranteed to be zeroed thanks to CONFIG_INIT_STACK_ALL_ZERO=y, which happens to be exactly what I need.

It would be helpful to be able to reliably wait for the sendmsg() syscall to enter the kernel and copy the destination address onto the kernel stack before inspecting the state of its stack; this is luckily possible by supplying a single-byte “control message” via msg->msg_control and msg->msg_controllen, which will mostly be ignored because it is too small to be a legitimate control message, but will be copied onto the kernel stack in ____sys_sendmsg() after the destination address has been copied onto the stack. It is possible to detect from userspace when this kernel access to msg->msg_control happens by pointing it to a userspace address which is not yet populated with a page table entry, then polling mincore() on this userspace address.

So now my strategy is roughly:

  1. In a loop, call sendmsg() on the thread with the stack the dangling ->oob_skb pointer points to to place a fake SKB on the stack until the fake SKB lands at the right stack offset thanks to CONFIG_RANDOMIZE_KSTACK_OFFSET, and have that fake SKB’s skb_shinfo(skb)->frag_list point to a second fake SKB with a ->next pointer that refers back to itself. (This second fake SKB can be placed anywhere I want, so I’m putting it in a userspace-owned page, so that userspace can directly write into it.)
  2. On a second thread, use write() on a UNIX domain socket to use the dangling ->oob_skb pointer, which will start looping endlessly, following the ->next pointer.
  3. On the thread that called sendmsg() before, now call write(..., 0x3000) on a pipe with one free pipe_buffer slot in a loop until the syscall handler lands at the right stack offset thanks to CONFIG_RANDOMIZE_KSTACK_OFFSET.
  4. Let the pipe write() continue, and wait until it is in the middle of copying data from userspace memory to a pipe buffer page.
  5. Set the ->next pointer in the second fake SKB to NULL, so that the write() on the UNIX domain socket stops looping, performs the UAF increment, and returns.
  6. Wait for the pipe write() to finish, at which point the page table behind the pipe data page should have been overwritten with controlled data.

Slowing down copy_from_iter()

I need to slow down a copy_from_iter() call. There are several strategies for this that don’t work (or don’t work well) in a Chrome renderer sandbox:

  • userfaultfd: not accessible in the Chrome Desktop renderer sandbox, and nowadays usually anyways nerfed such that only root can use it to intercept usercopy operations
  • FUSE: not accessible in the Chrome Desktop renderer sandbox
  • causing lots of major page faults: I’m not sure if there is some indirect way to get a file descriptor to a writable on-disk file; but either way, this seems like it would be a pain from a renderer.

But as long as only a single userspace memory read needs to be delayed, there is another option: I can create a very large anonymous VMA; fill it with mappings of the 4KiB zeropage; ensure that no page is mapped at one specific location in the VMA (for example with madvise(..., MADV_DONTNEED), which zaps page table entries in the specified range); and then have one thread run an mprotect() operation on this large anonymous VMA while another thread tries to access the part of the userspace region where no page is currently mapped. The mprotect() operation will keep the VMA write-locked while it walks through all the associated page table entries, modifies the page table entries as required, and performs TLB flushes if necessary; so a concurrent page fault in this VMA will have to wait until the mprotect() has finished. One limitation of this technique is that the part of the accessed userspace range that causes the slowdown will be filled with zeroes; but that can just be a single byte at the start or end of the range being copied, so it’s not a major limitation.

Based on some rough testing on my machine, if mprotect() has to iterate through 128 MiB of page tables populated with zeropage mappings, it takes something like 500-1000ms depending on which way the page table entries are changed.

Page table control

Putting all this together, I can overwrite the contents of a page table with controlled data. I’m using that controlled write to place a new entry in the page table that points back to the page table, effectively creating a userspace mapping of the page table; and then I can use this to map arbitrary kernel memory writably into userspace.

My exploit demonstrates its ability to modify kernel memory with this by using it to overwrite the UTS information printed by uname.

Takeaway: Chrome sandbox attack surface

One thing that stood out to me about this is that I was able to use a somewhat large number of kernel interfaces in this exploit; in particular:

interface usecase
anonymous VMA creation page table allocations
madvise() fast VMA splitting and merging
AF_UNIX SOCK_STREAM sockets triggering the bug; SKB allocation and freeing
sched_getcpu() (via syscall-less fastpaths) interacting with per-cpu kernel structures
eventfd() synchronization between threads
pipe() allocation and freeing of order-0 unmovable pages with controlled contents
pipe() stack overwrite target
AF_UNIX SOCK_DGRAM sockets placing controlled data on the stack
sendmsg() placing controlled data on the stack
mprotect() slowing down copy_from_user()
munmap() TLB flushing
madvise(..., MADV_DONTNEED) zapping PTEs for slowing down subsequent copy_from_user() or subsequently detecting copy_from_user()
mincore() detecting copy_from_user()
clone() racing operations on multiple threads; reallocating pages as kernel stack
poll() detecting progress of concurrent pipe_write()

Some of these are obviously needed to implement necessary features of the sandboxed renderer; others seem like unnecessary attack surface. I hope to look at this more systematically in the future.

Takeaway: Esoteric kernel features in core interfaces are an issue for browser sandboxes

One thing I’ve noticed, not just with this issue, but several issues before that, is that core kernel subsystems (which are exposed in renderer sandbox policies and such) sometimes have flags that trigger esoteric ancillary features that are unintentionally exposed by Chrome’s renderer sandbox. Such features seem to often be more buggy than the core feature that the policy intended to expose. Examples of this from Chrome’s past include:

  • futex() was broadly exposed in the sandbox, making it possible to reach a bug in Priority Inheritance futexes from the renderer sandbox.
  • memfd_create() was exposed in the sandbox without checking its flags, making it possible to create HugeTLB mappings using the MFD_HUGETLB flag. There have been several bugs in HugeTLB, which is to my knowledge almost exclusively used by some server applications that use large amounts of RAM, such as databases.
  • pipe2() was exposed in the sandbox without checking its flags, making it possible to create “notification pipes” using the O_NOTIFICATION_PIPE flag, which behave very differently from normal pipes and are used exclusively for posting notifications from the kernel “keys” subsystem to userspace.

Takeaway: probabilistic mitigations against attackers with arbitrary read

When faced with an attacker who already has an arbitrary read primitive, probabilistic mitigations that randomize something differently on every operation can be ineffective if the attacker can keep retrying until the arbitrary read confirms that the randomization picked a suitable value or even work to the attacker’s advantage by lining up memory locations that could otherwise never overlap, as done here using the kernel stack randomization feature.

Picking per-syscall random stack offsets at boottime might avoid this issue, since to retry with different offsets, the attacker would have to wait for the machine to reboot or try again on another machine. However, that would break the protection for cases where the attacker wants to line up two syscalls that use the same syscall number (such as different ioctl() calls); and it could also weaken the protection in cases where the attacker just needs to know what the randomization offset for some syscall will be.

Somewhat relatedly, Blindside demonstrated that this style of attack can be pulled off without a normal arbitrary read primitive, by “exploiting” a real kernel memory corruption bug during speculative execution in order to leak information needed for subsequently exploiting the same memory corruption bug for real.

Takeaway: syzkaller fuzzing and complex data structures

The first memory corruption bug described in this post was introduced in late June 2024, and discovered by syzkaller in late August 2024. Hitting that bug required 6 syscalls: One to set up a socket pair, four send()/recv() calls to set up a dangling pointer, and one more recv() call to actually trigger UAF by accessing the dangling pointer.

Hitting the second memory corruption bug, which I found by code review, required 8 syscalls: One to set up a socket pair, six send()/recv() calls to set up a dangling pointer, and one more recv() to cause UAF.

This was not a racy bug; in a KASAN build, running the buggy syscall sequence once would be enough to get a kernel splat. But when a fuzzer chains together syscalls more or less at random, the chance of running the right sequence of syscalls drops exponentially with each syscall required…

The most important takeaway from this is that data structures with complex safety rules (in this case, rules about the ordering of different types of SKBs in the receive queues of UNIX domain stream sockets) don’t just make it hard for human programmers to keep track of safety rules, they also make it hard for fuzzers to construct inputs that explore all relevant state patterns. This might be an area for fuzzer improvement - perhaps fuzzers could reach deeper into specific subsystems by generating samples that focus on interaction with a single kernel subsystem, or by monitoring whether additional syscalls chained to the end of a base sample cause additional activity in a particular subsystem.

Takeaway: copy_from_user() delays don’t require FUSE or userfaultfd

FUSE and userfaultfd are the most effective and reliable ways to inject delays on copy_from_user() calls because they can set up separate delays for multiple memory regions, provide precise control over the timing of the injected delay, don’t require large allocations or slow preparation, and allow placing arbitrary data in the page that is eventually installed. However, applying mprotect() to a large anonymous VMA filled with zeropage mappings (with 128 MiB of page tables) turns out to be sufficient to delay kernel execution by around a second. In the past, I have pushed for restricting userfaultfd because of how it can delay operations like copy_from_user(), but perhaps userfaultfd was not actually significantly more useful in this regard than mprotect().

Takeaway: Usercopy hardening

The hardening checks I encountered when calling copy_to_user() on arbitrary kernel addresses were a major annoyance, but could be worked around, since access to almost anything except type-specific SLUB pages is allowed. That said, I’m not sure how important improving these checks is - trying to protect against an attacker who can pass arbitrary kernel pointers to copy_to_user() might be futile, and guarding against out-of-bounds/use-after-free copy_to_user() or such is the major focus of this hardening.

Conclusions

Even in somewhat constrained environments, it is possible to pull off moderately complex Linux kernel exploits.

Chrome’s Linux desktop renderer sandbox exposes kernel attack surface that is never legitimately used in the sandbox. This needless functionality doesn’t just allow attackers to exercise vulnerabilities they otherwise couldn’t; it also exposes kernel interfaces that are useful for exploitation, enabling heap grooming, delay injection and more. The Linux kernel contributes to this issue by exposing esoteric features through the same syscalls as commonly-used core kernel functionality. I hope to do a more in-depth analysis of Chrome’s renderer sandbox on Linux in a follow-up blogpost.

❌