Normal view

Bypassing Windows Administrator Protection

26 January 2026 at 09:00

A headline feature introduced in the latest release of Windows 11, 25H2 is Administrator Protection. The goal of this feature is to replace User Account Control (UAC) with a more robust and importantly, securable system to allow a local user to access administrator privileges only when necessary.

This blog post will give a brief overview of the new feature, how it works and how it’s different from UAC. I’ll then describe some of the security research I undertook while it was in the insider preview builds on Windows 11. Finally I’ll detail one of the nine separate vulnerabilities that I found to bypass the feature to silently gain full administrator privileges. All the issues that I reported to Microsoft have been fixed, either prior to the feature being officially released (in optional update KB5067036) or as subsequent security bulletins.

Note: As of 1st December 2025 the Administrator Protection feature has been disabled by Microsoft while an application compatibility issue is dealt with. The issue is unlikely to be related to anything described in this blog post so the analysis doesn’t change.

The Problem Administration Protection is Trying to Solve

UAC was introduced in Windows Vista to facilitate granting a user administrator privileges temporarily, while the majority of the user’s processes run with limited privileges. Unfortunately, due to the way it was designed, it was quickly apparent it didn’t represent a hard security boundary, and Microsoft downgraded it to a security feature. This was an important change as it made it no longer a priority to fix bypasses of the UAC which allowed a limited process to silently gain administrator privileges.

The main issue with the design of UAC was that both the limited user and the administrator user were the same account just with different sets of groups and privileges. This meant they shared profile resources such as the user directory and registry hive. It was also possible to open an administrators process’ access token and impersonate it to grant administrator privileges as the impersonation permission checks didn’t originally consider if an access token was “elevated” or not, it just considered the user and the integrity level.

Even so, on Vista it wasn’t that easy to silently acquire administrator privileges as most routes still showed a prompt to the user. Unfortunately, Microsoft decided to reduce the number of elevation prompts a user would see when modifying system configuration and introduced an “auto-elevation” feature in Windows 7. Select Microsoft binaries could be opted in to be automatically elevated. However, it also meant that in some cases it was possible to repurpose the binaries to silently gain administrator privileges. It was possible to configure UAC to always show a prompt, but the default, which few people change, would allow the auto-elevation.

A good repository of known bypasses is the UACMe tool which currently lists 81 separate techniques for gaining administrator privileges. A proportion of those have been fixed through major updates to the OS, even though Microsoft never officially acknowledges when a UAC bypass is fixed. However, there still exist silent bypasses that impact the latest version of Windows 11 that remain unfixed.

The fact that malware is regularly using known bypasses to gain administrator privileges is what Administrator Protection aims to solve. If the weaknesses in UAC can be mitigated then it can be made a secure boundary which not only requires more work to bypass but also any vulnerabilities in the implementation could be fixed as security issues.

In fact there is already a more secure mechanism that UAC can use that doesn’t suffer from many of the problems of the so-called “admin approval” elevation. This mechanism is used when the user is not a member of the administrators group, it’s referred to as “over-the-shoulder” elevation. This mechanism requires a user to know the credentials of a local administrator user which must be input into the UAC elevation prompt. It’s more secure than admin approval elevation for the following reasons:

  • The profile data is no longer shared, which prevents the limited user from modifying files or registry keys which might be used by an elevated administrator process.
  • It’s no longer possible to get an access token for the administrator user and impersonate it as limited users cannot impersonate other user accounts.
  • Auto-elevation of Microsoft binaries is not supported, all elevation requests require confirmation through a prompt.

Unfortunately, the mechanism is difficult to use securely in practice as sharing the credentials to another local administrator account would be a big risk. Thus it’s primarily useful as a means for technical support where a sysadmin types in the credentials over the user’s shoulder.

Administrator Protection improves on over-the-shoulder elevation by using a separate shadow administrator account that is automatically configured by the UAC service. This has all the benefits of over-the-shoulder elevation plus the following:

  • The user does not need to know the credentials for the shadow administrator as there aren’t any. Instead UAC can be configured to prompt for the limited user’s credentials, including using biometrics if desired.
  • A separate local administrator account isn’t required, only the user needs to be configured to be a member of the administrators group making deployment easier.

While Microsoft is referring to Administrator Protection as a separate feature it can really be considered a third UAC mechanism as it uses the same infrastructure and code to perform elevation, just with some tweaks. However, the feature replaces admin-approval mode so you can’t use the “legacy” mode and Administrator Protection at the same time. If you want to enable it there’s currently no UI to do so but you can modify the local security policy to do so.

The big question, will this make UAC a securable boundary so malware no longer has a free ride? I guess we better take a look and find out.

Researching Administrator Protection

I typically avoid researching new Windows features before they’re released. It hasn’t been a good use of time in the past where I’ve found a security issue in a new feature during the insider preview stages only for that bug to be due to temporary code that is subsequently removed. Also if security issues are fixed in the insider preview stage they do not result in a security bulletin, making it harder to track when something is fixed. Therefore, there’s little incentive to research features until they are released when I can be confident any bugs that are discovered are real security issues and they’re fixed in a timely manner.

This case was slightly different, Microsoft reached out to me to see if I wanted to help them find issues in the implementation during the insider preview stage. No doubt part of the reason they reached out was my history of finding complex logical UAC bypasses. Also, I’d already taken a brief look and noted that the feature was still vulnerable to a few well known public bypasses such as my abuse of loopback Kerberos.

I agreed to look at a design document and provide feedback without doing a full “pentest”. However, if I did find issues, considering the goal was for Administration Protection to be a securable boundary I was assured that they would be fixed through a bulletin, or at least would be remediated before the final release of the feature.

The Microsoft document provided an overview, but not all design details. For example, I did have a question around what the developers considered the security boundary. In keeping with the removal of auto-elevation I made the assumption that bypassing the boundary would require one or more of the following:

  • Compromising the shadow administrators profile, such as writing arbitrary files or registry keys.
  • Hijacking an existing process running as the shadow administrator.
  • Get a process executing as an administrator without showing a prompt.

The prompt being a boundary is important, there’s a number of UAC bypasses, such as those which rely on elevated COM objects that would still work in Administrator Protection. However as auto-elevation is no longer permitted they will always show a prompt, therefore these are not considered bypasses. Of course, what is shown in the prompt, such as the executable being elevated, doesn’t necessarily correlate with the operation that is about to be performed with administrator rights.

In the document there was some lack of consideration of some associated UAC features such as UI Access processes (this will be discussed in part 2 of this series) but even so some descriptions stuck out to me. Therefore, I couldn’t help myself and decided to at least take a look at the current implementation in the canary build of insider preview. This research was a mix of reverse engineering of the UAC service code in appinfo.dll as well as behavioral analysis.

At the end of the research I found 9 separate means to bypass the feature and silently gain administrator privileges. Some of the bypasses were long standing UAC issues with publicly available test cases. Others were due to implementation flaws in the feature itself. But the most interesting bug class was where there wasn’t a bug at all, until the rest of the OS got involved.

Let’s dive into this most interesting bypass I identified during the research. If you want to skip ahead you can read the full details on the issue tracker. This issue is interesting, not just because it allowed me to bypass the protection but also because it was a potential UAC bypass that I had known about for many years, but only became practically exploitable because of the introduction of this feature.

Logon Sessions

First a little bit of background knowledge to understand the vulnerability. When a user authenticates to a Windows system successfully they’re assigned a unique logon session. This session is used to control the information about the user, for example it keeps a copy of the user’s credentials so that they can be used for network authentication.

The logon session is added as a reference in the access token created during the logon process, so that it can be easily referred to during any kernel operations using the token. You can find the unique 64-bit authentication ID for the session by querying the token using the NtQueryInformationToken system call. In UAC, separate logon sessions are assigned to the limited and the linked administrator access tokens as shown in the following script where you can observe that the limited token and linked token have distinct authentication ID LUID values:

# Get authentication ID of current token
PS> Get-NtTokenId -Authentication
LUID
----
00000000-11457F17

# Query linked administrator token and get its authentication ID.
PS> $t = Get-NtToken -Linked
PS> Get-NtTokenId -Authentication -Token $t
LUID
----
00000000-11457E9E

One important place the logon session is referenced by the kernel is when looking up DOS drive letters. From the kernels perspective drive letters are stored in a special object directory \??. When this path is looked up by the kernel it’ll first see if there’s a logon session specific directory to check, this is stored under the path \Sessions\0\DosDevices\X-Y, where X-Y is the hexadecimal representation of the authentication ID for the logon session. If the drive letter symbolic link isn’t found in that directory the kernel falls back to checking the \GLOBAL?? directory. You can observe this behavior by opening the \?? object directory using the NtOpenDirectoryObject system call as shown:

PS> $d = Get-NtDirectory "\??"
PS> $d.FullPath
\Sessions\0\DosDevices\00000000-11457f17

It’s well known that if you can write a symbolic link to a DOS device object directory you can hijack the C: drive of any process running with that access token in that logon session. Even though the C: drive is defined in the global object directory, the logon session specific directory is checked first and so it can be overridden.

If a user can write into another logon session’s DOS device object directory they can redirect any file access to the system drive. For example you could redirect system DLL loading to force arbitrary code to run in the context of a process running in that logon session. In the case of UAC this isn’t an issue as the separate DOS device object directories have different access control and therefore the limited user can’t hijack the C: drive of an administrator process. The access control for the administrator’s DOS device object directory is shown below:

PS> Get-NtTokenSid
Name           Sid
----           ---
DOMAIN\user    S-1-5-21-5242245-89012345-3239842-1001

PS> $d = Get-NtDirectory "\??"
PS> Format-NtSecurityDescriptor $d -Summary
<Owner> : BUILTIN\Administrators
<Group> : DOMAIN\Domain Users
<DACL>
NT AUTHORITY\SYSTEM: (Allowed)(ObjectInherit, ContainerInherit)(Full Access)
BUILTIN\Administrators: (Allowed)(ObjectInherit, ContainerInherit)(Full Access)
BUILTIN\Administrators: (Allowed)(None)(Full Access)
CREATOR OWNER: (Allowed)(ObjectInherit, ContainerInherit, InheritOnly)(GenericAll)

Creating a DOS Device Object Directory

A question you might have is who creates this DOS device object directory? It turns out the kernel creates it on demand when the directory is first accessed. The code to do the creation is in SeGetTokenDeviceMap, which looks roughly like the following:

NTSTATUS SeGetTokenDeviceMap(PTOKEN Token, PDEVICE_MAP *ppDeviceMap) {
  *ppDeviceMap = Token->LogonSession->pDeviceMap;
  if (*ppDeviceMap) {
    return STATUS_SUCCESS;
  }
  WCHAR path[64];
    swprintf_s(
      path,
      64,
      L"\\Sessions\\0\\DosDevices\\%08x-%08x",
      Token->AuthenticationId.HighPart,
      Token->AuthenticationId.LowPart);
  PUNICODE_STRING PathString;
  RtlInitUnicodeString(&PathString, path);
  OBJECT_ATTRIBUTES ObjectAttributes;
  InitializeObjectAttributes(&ObjectAttributes, 
                             &PathString, 
                             OBJ_CASE_INSENSITIVE |
                             OBJ_OPENIF |
                             OBJ_KERNEL_HANDLE |
                             OBJ_PERMANENT, 0, NULL);
  HANDLE Handle;
  NTSTATUS status = ZwCreateDirectoryObject(&Handle, 
                                            0xF000F, 
                                            &ObjectAttributes);
  if (NT_ERROR(status)) {
    return status;
  }
  status = ObpSetDeviceMap(Token->LogonSession, Handle);
  if (NT_ERROR(status)) {
    return status;
  }
  *ppDeviceMap = Token->LogonSession->pDeviceMap;
  return STATUS_SUCCESS;
}

One thing you might notice is that the object directory is created using the ZwCreateDirectoryObject system call. One important security detail of using a Zw system call in the kernel is it disables security access checking unless the optional OBJ_FORCE_ACCESS_CHECK flag is set in the OBJECT_ATTRIBUTES, which isn’t the case here.

Bypassing access checking is necessary for this code to function correctly; let’s look at the access control of the \Sessions\0\DosDevices directory.

PS> Format-NtSecurityDescriptor -Path \Sessions\0\DosDevices -Summary
<Owner> : BUILTIN\Administrators
<Group> : NT AUTHORITY\SYSTEM
<DACL>
NT AUTHORITY\SYSTEM: (Allowed)(ObjectInherit, ContainerInherit)(Full Access)
BUILTIN\Administrators: (Allowed)(ObjectInherit, ContainerInherit)(Full Access)
CREATOR OWNER: (Allowed)(ObjectInherit, ContainerInherit, InheritOnly)(GenericAll)

The directory cannot be written to by a non-administrator user, but as this code is called in the security context of the user it needs to disable access checking to create the directory as it can’t be sure the user is an administrator. Importantly the access control of the directory has an inheritable rule for the special CREATOR OWNER group granting full access. This is automatically replaced by the assigned owner of the access token used during object creation.

Therefore even though the access checking has been disabled the final directory that’s created can be accessed by the caller. This explains how the UAC administrator DOS device object directory blocks access to the limited user. The administrator token is created with the local administrators group set as its owner and so that’s what CREATOR OWNER is replaced with. However, the limited user can only set their own SID as the owner and so it just grants access to the user.

How is this useful? I noticed a long time ago that this behavior is a potential UAC bypass, in fact it’s a potential EoP, but UAC bypass was the most likely outcome. Specifically it’s possible to get a handle to the access token for the administrator user by calling NtQueryInformationToken with the TokenLinkedToken information class. For security reasons this token is limited to SecurityIdentification impersonation level so it can’t be used to grant access to any resources.

However if you impersonate the token and open the \?? directory then the kernel will call SeGetTokenDeviceMap using the identification token and if it’s not currently created it’ll use ZwCreateDirectoryObject to create the DOS device object directory. As access checking is disabled the creation will still succeed, however once it’s created the kernel will do an access check for the directory itself and will fail due to the identification token being impersonated.

This might not seem to get us very much, while the directory is created it’ll use the owner from the identification token which would be the local administrator’s group. But we can change the token’s owner SID to the user’s SID before impersonation, as that’s a permitted operation. Now the final DOS device object directory will be owned by the user and can be written to. As there’s only a single logon session used for the administrator side of UAC then any elevated process can now have its C: directory hijacked.

There’s just one problem with this as a UAC bypass, I could never find a scenario where the limited user got code running before any administrator process was created. Once the process was created and running there’s almost a certainty that some code would open a file and therefore access the \?? directory. By the time the limited user has control the DOS device object directory has already been created and assigned the expected access control. Still as UAC is not a security boundary there was no point reporting it, so I filed this behavior away for another day in case it ever became relevant.

Bypassing Administrator Protection

Fast forward to today, and along comes Administrator Protection. For reasons of compatibility Microsoft made calling NtQueryInformationToken with the TokenLinkedToken information class still returns an identification handle to the administrator token. But in this case it’s the shadow administrator’s token instead of the administrator version of the user’s token. But a crucial difference is while for UAC this token is the same every time, in Administrator Protection the kernel calls into the LSA and authenticates a new instance of the shadow administrator. This results in every token returned from TokenLinkedToken having a unique logon session, and thus does not currently have the DOS device object directory created as can be seen below:

PS> $t = Get-NtToken -Linked
PS> $auth_id = Get-NtTokenId -Authentication -Token $t
PS> $auth_id
LUID
----
00000000-01C23BB3

PS> Get-NtDirectory "\Sessions\0\DosDevices\$auth_id"
Get-NtDirectory : (0xC0000034) - Object Name not found.

While in theory we can now force the creation of the DOS device object directory, unfortunately this doesn’t help us much. As the UAC service also uses TokenLinkedToken to get the token to create the new process with it means every administrator process currently running or will run in the future doesn’t share logon sessions, thus doesn’t share the same DOS device object directories and we can’t hijack their C: drives using the token we queried in our own process.

To exploit this we’d need to use the token for an actual running process. This is possible, because when creating an elevated process it can be started suspended. With this suspended process we can open the process token for reading, duplicate it as an identification token then create the DOS device object directory while impersonating it. The process can then be resumed with its hijacked C: drive.

There’s only two problems with this as a bypass, first creating an elevated process suspended will require clicking through an elevation prompt. For UAC with auto-elevation this wasn’t a problem, but for Administrator Protection it will always prompt, and showing a prompt isn’t considered to be crossing the security boundary. There are ways around this, for example the UAC service exposes the RAiProcessRunOnce API which will run an elevated binary silently. The only problem is the process isn’t suspended and so you’d have to win a race condition to open the process and perform the bypass before any code runs in that process. This is something which should be doable, say by playing with thread priorities to prevent the new process’ main thread from being scheduled.

The second issue seems more of a deal breaker. When setting the owner for an access token it will only allow you to set a SID that’s either the user SID for the token, or a member group that has the SE_GROUP_OWNER flag set. The only group with the owner flag is the local administrators group, and of course the shadow administrator’s SID differs from the limited user’s. Therefore setting either of these SIDs as the owner doesn’t help us when it comes to accessing the directory after creation.

Turns out this isn’t a problem as I was not telling the whole truth about the owner assignment process. When building the access control for a new object the kernel doesn’t trust the impersonation token if it’s at identification level. This is for a good security reason, an identification token is not supposed to be usable to make access control decisions, therefore it makes no sense to assign its owner when creating the object. Instead the kernel uses the primary token of the process to make that decision, and so the assigned owner is the limited user’s SID. In fact setting the owner SID for the UAC bypass was never necessary, it was never used. You can verify this behavior by creating an object without a name so that it can be created while impersonating an identification token and checking the assigned owner SID:

PS> $t = Get-NtToken -Anonymous
# Impersonate anonymous token and create directory
PS> $d = Invoke-NtToken $t { New-NtDirectory }
PS> $d.SecurityDescriptor.Owner.Sid.Name
NT AUTHORITY\ANONYMOUS LOGON
# Impersonate at identification level
PS> $d = Invoke-NtToken $t -ImpersonationLevel Identification {
      New-NtDirectory
}
PS> $d.SecurityDescriptor.Owner.Sid.Name
DOMAIN\user

One final question you might have is how come creating a process with the shadow admin’s token doesn’t end up accessing some DOS drive’s file resource as that user thus causing the DOS device object directory to be created? The implementation of the CreateProcessAsUser API runs all its code in the security context of the caller, regardless of what access token is being assigned so by default it wouldn’t ever open a file under the new logon session.

However, if you know about how to securely create a process in a system service you might expect that you’re supposed to impersonate the new token over the call to CreateProcessAsUser to ensure you don’t allow a user to create a process for an executable file they can’t access. The UAC service is doing this correctly, so surely it must have accessed a drive to create the process and the DOS device object directory should have been created, why isn’t it?

In a small irony what’s happening is the UAC service is tripping over a recently introduced security mitigation to prevent the hijack of the C: drive when impersonating a low privileged user in a system service. This mitigation kicks in if the caller of a system call is the SYSTEM user and it’s trying to access the C: drive. This was added by Microsoft in response to multiple vulnerabilities in manifest file parsing, if you want an overview here’s a video of the talk me and Maddie Stone did at OffensiveCon 23 describing some of the attack surface.

It just so happens that the UAC service is running as SYSTEM and as long as the elevated executable is on the C: drive, which is very likely, the mitigation ignores the impersonated token’s DOS device object directory entirely. Thus SeGetTokenDeviceMap never gets calls and so the first time a file is accessed under the logon session is once the process is up and running. As long as we can perform the exploit before the new process touches a file we can create the DOS device object directory and redirect the process’ C: drive.

To conclude, the steps to exploit this bypass is as follows:

  1. Spawn a shadow admin process through RAiProcessRunOnce, which will run the runonce.exe from the C: drive.
  2. Open the new process before it has accessed a file resource, and query the primary token.
  3. Duplicate the token to an identification token.
  4. Force the DOS device object directory to be created while impersonating the shadow admin token. This can be done by opening \?? through a call to NtOpenDirectoryObject.
  5. Create a C: drive symlink in the new DOS device directory to hijack the system drive.
  6. Let the process resume and wait for a redirected DLL to be loaded.

Final Thoughts

The bypass was interesting because it’s hard to point to the specific bug that causes it. The vulnerability is a result of 5 separate OS behaviors:

  • The Administrator Protection feature changes to the TokenLinkedToken query generates a new logon session for every shadow admin token.
  • The per-token DOS device directory is lazily initialized for each new logon session meaning when the linked token is first created the directory does not currently exist.
  • The kernel creates the DOS device directory when it’s accessed by using Zw functions, which disables access checking. This allows a limited user to impersonate the shadow admin token at identification level and create the directory by opening \??.
  • If a thread impersonates a token at identification level any security descriptor assignment takes the owner SID from the primary token, not the impersonation token. This results in the limited user being granted full access to the shadow admin token’s DOS device object directory.
  • The DOS device object directory isn’t already created once the low-privileged user gets access to the process token because of the security mitigation which disables the impersonated DOS device object directory when opening files from the C: drive in a SYSTEM process.

I don’t necessarily blame Microsoft for not finding this issue during testing. It’s a complex vulnerability with many moving pieces. It’s likely I only found it because I knew about the weird behavior when creating the DOS device object directory.

The fix Microsoft implemented was to prevent creating the DOS device object directory when impersonating a shadow administrator token at identification level. As this fix was added into the final released build as part of the optional update KB5067036 it doesn’t have a security bulletin associated with it. I would like to thank the Administrator Protection team and MSRC for the quick response in fixing all the issues and demonstrating that this feature will be taken seriously as a security boundary. I’d also like to thank them for providing additional information such as the design document which aided in the research.

As for my views on Administrator Protection as a feature, I feel that Microsoft have not been as bold as they could have been. Making small tweaks to UAC resulted in carrying along the almost 20 years of unfixed bypasses which manifest as security vulnerabilities in the feature. What I would have liked to have seen was something more configurable and controllable, perhaps a proper version of sudo or Linux capabilities where a user can be granted specific additional access for certain tasks.

I guess app compatibility is ultimately the problem here, Windows isn’t designed for such a radical change. I’d have also liked to have seen this as a separate configurable mode rather than replacing admin-approval completely. That way a sysadmin could choose when people are opted in to the new model rather than requiring everyone to use it.

I do think it improves security over admin-approval UAC assuming it becomes enabled by default. It presents a more significant security boundary that should be defendable unless more serious design issues are discovered. I expect that malware will still be able to get administrator privileges even if that’s just by forcing a user to accept the elevation prompt, but any silent bypasses they might use should get fixed which would be a significant improvement on the current situation. Regardless of all that, the safest way to use Windows is to never run as an administrator, with any version of UAC. And ideally avoid getting malware on your machine in the first place.

A 0-click exploit chain for the Pixel 9 Part 3: Where do we go from here?

14 January 2026 at 19:01

While our previous two blog posts provided technical recommendations for increasing the effort required by attackers to develop 0-click exploit chains, our experience finding, reporting and exploiting these vulnerabilities highlighted some broader issues in the Android ecosystem. This post describes the problems we encountered and recommendations for improvement.

Audio Attack Surface

The Dolby UDC is part of the 0-click attack surface of most Android devices because of audio transcription in the Google Messages application. Incoming audio messages are transcribed before a user interacts with the message. On Pixel 9, a second process com.google.android.tts also decodes incoming audio. Its purpose is not completely clear, but it seems to be related to making incoming messages searchable.

Both processes decode audio using all decoders available on the device, including the UDC, which is integrated by the OEMs of most devices, though the bulk of incoming messages use a small number of audio formats. In particular, it is very unlikely that an incoming message will contain audio in formats supported by the Dolby UDC, as Android devices do not provide encoders for these formats, and they are mostly used by commercial media, such as movies and TV shows. Removing the UDC and other uncommonly-used decoders from the 0-click attack surface of Android would protect users from the worst consequences of vulnerabilities in these codecs.

The explosion of AI-powered features on mobile phones has the potential to greatly increase their 0-click attack surface. While this trade-off can sometimes benefit users, it is important for mobile vendors to be aware of the impact on security. It is not uncommon for software changes to unintentionally increase the amount of code that can be exercised by attackers remotely. Ongoing review of how new features affect 0 and 1-click attack surfaces coupled with deliberate decisions are necessary to protect users.

Bug Discovery Time Frames

One surprising aspect of this research was how quickly we found both vulnerabilities used in the exploit chain. Project Zero reviewed the Dolby UDC as a part of a one-week team hackathon, and it took less than two days for Ivan to find CVE-2025-54957. Likewise, Seth found CVE-2025-36934 after less than one day of reviewing the BigWave driver.

Of course, it’s easy to forget the effort that went into finding these attack surfaces– the Dolby hackathon required roughly three weeks of preparation to study the entry points of the codec and set-up tooling to debug it, and likewise, reviewing the BigWave driver involved a driver analysis tool that took roughly 4 weeks to develop. We also reviewed other audio codecs with mixed results before reviewing the Dolby UDC.

Still, the time investment required to find the necessary vulnerabilities was small compared to the impact of this exploit, especially for the privilege escalation stage. Moreover, a lot of the time we spent finding the UDC bug was a one-time cost that we expect to enable future research. The time needed to find the bugs for a 0-click exploit chain on Android can almost certainly be measured in person-weeks for a well-resourced attacker.

Android has invested a fair amount in the security of media codecs through vulnerability rewards programs and by fuzzing them with tools like OSS-Fuzz. While it is unlikely that fuzzing would have uncovered this particular UDC bug, as far as we know, Pixel’s’s fuzzing efforts do not cover the UDC. Gaps in vendors’ understanding of their attack surface is a common source of 0-click vulnerabilities. While bugs occur in heavily-secured components, it can be easier for attackers to focus on areas that are overlooked. Android and OEMs could benefit from a rigorous analysis of its 0-click attack surface, and comprehensive efforts to fuzz and review them.

Drivers, on the other hand, continue to be a ‘soft target’ on Android. While Android, and its upstream driver vendors such as Samsung, Qualcomm, ARM and Imagination have made some efforts to improve driver security, they have been outpaced by attackers’ ability to find and exploit these bugs. Google’s Threat Intelligence Group (GTIG) has detected and reported 16 Android driver vulnerabilities being used by attackers in the wild since 2023. Driver security remains an urgent problem affecting Android’s users that will likely require multiple approaches to improve. Rewriting the most vulnerable drivers in managed languages such as Rust, performing consistent security reviews on new drivers, reducing driver access from unprivileged contexts and making driver code more easily updatable on Android devices are likely all necessary to counter attacker’s extensive capabilities in this area.

Ease of Exploitability

We estimate that exploiting the Dolby UDC vulnerability in the exploit chain took eight person-weeks and exploiting the BigWave driver vulnerability took 3 weeks for a basic proof-of-concept. This is not a lot of time considering the vast capabilities this type of exploit chain gives attackers. While many Android security features increased the challenge we faced in exploiting these issues, we were also surprised by two mitigations that did not provide their documented protection.

The Dolby UDC decoder process on the Pixel 9 lacked a seccomp policy, though this policy is implemented in AOSP and several other Android 16 devices we tested. If the policy in AOSP had been enforced on the Pixel 9, it likely would have added at least one person-month to the time spent developing this exploit. For security features to be effective, it is important that they are verified on a regular basis, ideally for every release, otherwise it is possible that regressions go unnoticed.

We also discovered that kASLR is not effective on Pixel devices, due to a problem that has been known since 2016, detailed in this blog post. Both Android and Linux made a decision to deprioritize development work that would have restored its effectiveness. This decision made exploiting the BigWave vulnerability easier, we estimate it would have taken roughly six weeks longer to exploit this vulnerability with effective kASLR, though with the additional time required, we may not have pursued it.

It is also notable that we have not been able to successfully exploit the Dolby UDC vulnerability on Mac or iPhone so far, as it was compiled with the -fbounds-safety compiler flag, which added a memory bounds check that prevents the bug from writing out of bounds. Dolby should consider providing such compiler based protections across all platforms. Apple also recently implemented MIE, a hardware-based memory-protection technology similar to Memory Tagging (MTE), on new devices. While MIE would not prevent the Dolby UDC vulnerability from being exploited in the absence of -fbounds-safety due to UDC using a custom allocator, it would probabilistically hinder an iOS kernel vulnerability similar to the BigWave driver bug from being exploitable.

Pixel 8 onwards shipped with MTE, but unfortunately, the feature has not been enabled except for users who opt into Advanced Protection mode, to the detriment of Pixel’s other users. Apple’s inclusion of memory protection features, despite their financial and performance cost, clearly paid off with regards to protecting its users from the UDC exploit as well as possible kernel privilege escalation. There is the potential to protect Android users similarly.

Another remarkable aspect of this exploit chain is how few bugs it contains. Gaining kernel privileges from a 0-click context required only two software defects. Longer exploit chains are typically required on certain platforms because of effective sandboxing and other privilege limitation features. To bypass these, attackers need to find multiple bugs to escalate privileges through multiple contexts. This suggests potential sandboxing opportunities on Android, especially with regards to reducing the privileges of the frequently-targeted media decoding processes.

Patch Timeframe

Both vulnerabilities in this exploit chain were public and unfixed on Pixel for some time. The UDC vulnerability was reported to Dolby on June 26, 2025, and the first binary fixes were pushed to ChromeOS on September 18, 2025. Pixel shared with us that they did not receive binary patches from Dolby until October 8, 2025. We disclosed the bug publicly on October 15, 2025, after 30 days patch adoption time, as per Project Zero’s disclosure policy. Samsung was the first mobile vendor to patch the vulnerability, on November 12, 2025. Pixel did not ship a patch for the vulnerability until January 5, 2026.

It is alarming that it took 139 days for a vulnerability exploitable in a 0-click context to get patched on any Android device, and it took Pixel 54 days longer. The vulnerability was public for 82 days before it was patched by Pixel.

One cause of the slow fix time was likely Dolby’s advisory. We informed Dolby that this issue was highly exploitable when we filed the bug, and provided status updates, including technical details of our exploit, as the work progressed. Despite this, the advisory describes the vulnerability’s impact as follows:

We are aware of a report found with Google Pixel devices indicating that there is a possible increased risk of vulnerability if this bug is used alongside other known Pixel vulnerabilities. Other Android mobile devices could be at risk of similar vulnerabilities.

This is not an accurate assessment of the risk this vulnerability poses. As shown in Part 1 of this blog post, the vulnerability is exploitable on its own, with no additional bugs. Dolby is likely referring to the fact that additional vulnerabilities are required to escalate privileges from the mediacodec context on Android, but almost all modern vulnerabilities require this, and we informed them that there is strong evidence that exploit vendors have access to kernel privilege escalation vulnerabilities on most Android devices. No other vendor we’ve encountered has described a vulnerability allowing code execution in a sandboxed context as requiring the bug to be “used alongside other known […] vulnerabilities.”

Dolby’s advisory also says:

For other device classes, we believe the risk of using this bug maliciously is low and the most commonly observed outcome is a media player crash or restart.

We believe this understates the risk of the vulnerability to other platforms. It’s difficult to determine the “risk of [attackers] using this bug maliciously,” even well-resourced threat analysis teams like GTIG have difficulty determining this for a particular bug with any accuracy. Moreover, “most commonly observed outcome is a media player crash or restart” is true of even the most severe memory corruption vulnerabilities. This is why most security teams classify vulnerabilities based on the maximum access an attacker could achieve with them. Except for on Apple devices, where the UDC is compiled with -fbounds-safety, this bug enables code execution in the context that the UDC runs. The impact of this bug on users is also platform-dependent, for example, it presents a higher risk on Android, where untrusted audio files are processed without user interaction than on a smart TV which only plays audio from a small number of trusted streaming sources, but this doesn’t change that an attacker can generally achieve code execution by exploiting this bug. Ideally, Dolby would have provided its integrators with this information, and allowed them to make risk decisions depending on how they use and sandbox the UDC.

It’s not clear what information Dolby provided Android and Pixel, but Android publishes its priority matrix here. Since mediacodec is considered a constrained context, when we reported it, the UDC bug fell into the category of “remote arbitrary code execution in a constrained context”, and it was rated Moderate. Conversely, Samsung rated this bug as Critical. Android shared with us they recently updated their priority matrix, and future vulnerabilities of this type will be classified as Critical.

We reported the BigWave vulnerability to Pixel on June 20, 2025 and it was also rated Moderate. As per the matrix above, “Local arbitrary code execution in a privileged context, the bootloader chain, THB, or the OS kernel” makes this bug High base severity, but the severity modifier “Requires running as a privileged context to execute the attack” was applied. While the modifier text states a “privileged context”, our experience is that the modifier is frequently applied to vulnerabilities that are not directly accessible from an unprivileged context, including those accessible from constrained contexts like mediacodec. The severity was changed to High on September 18, 2025 and a fix was shipped to devices on January 6, 2026. We shared the bug publicly after 90 days, on September 19, 2025, in accordance with our disclosure policy.

While different software vendors and projects have different philosophies with regards to vulnerability prioritization, deprioritizing both of these bugs left users vulnerable to a 0-click exploit chain. Some vendors make bugs in 0-click entrypoints high priority, while others choose to prioritize bugs in the sandboxes that isolate these entrypoints. There are benefits and downsides to each approach, but vendors need to prioritize at least one bug in the chain in order to provide users with a basic level of protection against 0-click exploits.

This type of diffusion of responsibility isn’t uncommon in vulnerability management. Series of bugs that can be combined to cause severe user harm are often individually deprioritized, and codec vendors like Dolby often consider it largely the platform’s responsibility to mitigate the impact of memory corruption vulnerabilities, while platforms like Android rely too heavily on their supply chain being bug-free. Developers of software with the best security posture tend to take the stance that all external software should be considered compromised, and invest in protecting against this eventuality. This and other defense-in-depth approaches is what makes exploit chains difficult for attackers, and has the best chance of protecting users.

Patch Propagation

Even though the Dolby UDC vulnerability was eventually patched by Pixel, it will take some time for all other Android users to receive an update. This is because mobile updates are gated on a variety of factors, including carrier approval, and not every OEM provides security updates in a timely manner, if at all.

Android has a mechanism to update specific system libraries that circumvents this process called APEX. Libraries packaged with APEX can be updated by Google directly through the Google Play Store, leading to a much faster update cycle. Since the UDC does not ship as part of Android, it does not have this capability, though this could be changed with significant licensing and shipping ownership changes.

Conclusion

It’s easy to look at a 0-click exploit chain like the one we developed and see a unique technical feat, when what it really reveals is capabilities currently available to many attackers. While developing the exploit was time-consuming, and required certain technical knowledge, it involved nothing that isn’t achievable with sufficient investment. All considered, we were surprised by how small that investment turned out to be.

It can also be tempting to see this exploit as a series of esoteric, difficult-to-detect errors, but there are actions that can reduce the risk of such exploits, including analysis and reduction of 0-click attack surface, consistent testing of security mitigations, rapid patching and investment in memory mitigations.

Most humans alive today trust their privacy, financial well-being and sometimes personal safety to a mobile device. Many measures are available that could protect them against the most dangerous adversaries. Vendors should take action to reduce the risk of memory-corruption vulnerabilities to the platform and deliver security patches to users in a reasonable timeframe.

A 0-click exploit chain for the Pixel 9 Part 2: Cracking the Sandbox with a Big Wave

14 January 2026 at 19:00

With the advent of a potential Dolby Unified Decoder RCE exploit, it seemed prudent to see what kind of Linux kernel drivers might be accessible from the resulting userland context, the mediacodec context. As per the AOSP documentation, the mediacodec SELinux context is intended to be a constrained (a.k.a sandboxed) context where non-secure software decoders are utilized. Nevertheless, using my DriverCartographer tool, I discovered an interesting device driver, /dev/bigwave that was accessible from the mediacodec SELinux context. BigWave is hardware present on the Pixel SOC that accelerates AV1 decoding tasks, which explains why it is accessible from the mediacodec context. As previous research has copiously affirmed, Android drivers for hardware devices are prime places to find powerful local privilege escalation bugs. The BigWave driver was no exception - across a couple hours of auditing the code, I discovered three separate bugs, including one that was powerful enough to escape the mediacodec sandbox and get kernel arbitrary read/write on the Pixel 9.

The (Very Short) Bug Hunt

The first bug I found was a duplicate that was originally reported in February of 2024 but remained unfixed at the time of re-discovery in June of 2025, over a year later, despite the bugfix being a transposition of two lines of code. The second bug presented a really fascinating bug-class that is analogous to the double-free kmalloc exploitation primitive - but with a different linked list entirely. However it was the third bug I discovered that created the nicest exploitation primitive. Fixes were made available for all three bugs on January 5, 2026.

The Nicest Bug

Every time the /dev/bigwave device is opened, the driver allocates a new kernel struct called inst which is stored in the private_data field of the fd. Within the inst is a sub-struct called job, which tracks the register values and status associated with an individual invocation of the BigWave hardware to perform a task. In order to submit some work to the bigo hardware, a process uses the ioctl BIGO_IOCX_PROCESS, which fetches Bigwave register values from the ioctl caller in AP userland, and places the job on a queue that gets picked up and used by a separate thread, the bigo worker thread. That means that an object whose lifetime is inherently bound to a file descriptor is transiently accessed on a separate kernel thread that isn’t explicitly synced to the existence of that file descriptor. During BIGO_IOCX_PROCESS ioctl handling, after submitting a job to get executed on bigo_worker_thread, the ioctl call enters wait_for_completion_timeout with a timeout of 16 seconds waiting for bigo_worker_thread to complete the job. After those 16 seconds, if bigo_worker_thread has not signaled job completion, the timeout period ends and the ioctl dequeues the job from the priority queue. However, if a sufficient number of previous jobs were stacked onto the bigo_worker_thread, it is possible that bigo_worker_thread was so delayed that it has only just dequeued and is concurrently processing the very job that the ioctl has considered to have timed out and is trying to dequeue. The syscall context in this case simply returns back to userland, and if at this point userland closes the fd associated with the BigWave instance, the inst (and thusly the job) is destroyed while bigo_worker_thread continues to reference the job.

The highlights indicate any accesses to the UAF’d object:

static int bigo_worker_thread(void *data)
{
	...

	while(1) {
		rc = wait_event_timeout(core->worker,
			dequeue_prioq(core, &job, &should_stop),
			msecs_to_jiffies(BIGO_IDLE_TIMEOUT_MS)); //The job is fetched from the queue
		...

		inst = container_of(job, struct bigo_inst, job); //The job is an inline struct inside of the inst which gets UAF'd

		...

		rc = bigo_run_job(core, job);

		...
		job->status = rc;
		complete(&inst->job_comp);
	}
	return 0;
}

...

static int bigo_run_job(struct bigo_core *core, struct bigo_job *job)
{
	...

	inst = container_of(job, struct bigo_inst, job);
	bigo_bypass_ssmt_pid(core, inst->is_decoder_usage);
	bigo_push_regs(core, job->regs); //The register values of the bigwave processor are set (defined by userland)
	bigo_core_enable(core);
	ret = wait_for_completion_timeout(&core->frame_done,
			msecs_to_jiffies(core->debugfs.timeout)); //pause for 1 second
	...
        //At this point inst/job have been freed
	bigo_pull_regs(core, job->regs); //A pointer is taken directly from the freed object
	*(u32 *)(job->regs + BIGO_REG_STAT) = status;
	if (rc || ret)
		rc = -ETIMEDOUT;
	return rc;
}
void bigo_pull_regs(struct bigo_core *core, void *regs)
{
	memcpy_fromio(regs, core->base, core->regs_size); //And the current register values of the bigwave processor are written to that location
}

By spraying attacker-controlled kmalloc allocations (for example via Unix Domain Socket messages) we can control the underlying UAF pointer job->regs, so we can control the destination of our write. Additionally since we set the registers at the beginning of execution, by setting the registers in such a way that the BigWave processor does not execute at all, we can ensure that the end register state is nearly identical to the original register state - hence we can control what is written as well. And just like that, we have a half decent 2144-byte arbitrary write! And all without leaking the KASLR slide!

Defeating KASLR (by doing nothing at all)

Exploiting this issue with KASLR enabled would normally involve reallocating some other object over the bigo inst with a pointer at the location of inst->job.regs, leading to memory corruption of the object pointed to by that overlapped pointer. That would require finding some allocatable object with a pointer at that location, and also finding a way to take advantage of being able to overwrite the sub-object. Finding such an object is difficult but not impossible, especially if you consider cross-cache attacks. It is, however, quite tedious and is not really my idea of a fun time. Thankfully I found a much simpler strategy which essentially allows the generic bypass of KASLR on Pixel in its entirety, the details of which you can read about in my previous blog post. The end-result of that sidequest is the discovery that instead of needing to leak the KASLR base, you can just use 0xffffff8000010000 instead, particularly when it comes to overwriting .data in the kernel. This dramatically simplifies the exploit, and substantially improves the exploit’s potential reliability.

Creating an arbitrary read/write

At this point, I have a mostly-arbitrary write primitive anywhere in kernel .data - I have an aliased location for, and can modify, any kernel globals I want. However the complete call at the end of the bigo_worker_thread job execution loop serves to complicate exploitation a little bit. complete calls swake_up_locked which performs a set of list operations on a list_head node inside of the bigo inst:

static inline int list_empty(const struct list_head *head)
{
return READ_ONCE(head->next) == head;
}

void swake_up_locked(struct swait_queue_head *q) //The q is located at &inst->job_comp.wait (so attacker controlled)
{
	struct swait_queue *curr;

	if (list_empty(&q->task_list))
		return;

	curr = list_first_entry(&q->task_list, typeof(*curr), task_list);
	wake_up_process(curr->task);
	list_del_init(&curr->task_list);
}

While the first list_empty call would be the simplest to forge, it would also require knowing the location of the inst in kernel memory as q is an inline struct inside of inst. Unfortunately, our KASLR bypass does not give us this, nor is it particularly easy to acquire, as the inst is in kernel heap, not kernel .data. That means we need to instead forge a valid list entry for the q to point to as well as know the location of a task to pass to wake_up_process(). Finally we need to actually forge enough of a list to survive a list_del_init on an entry in the q->task_list, which involves list nodes, and second list nodes that point to the first list node. This might sound quite difficult to forge given the limitation we’ve previously noted about our KASLR bypass, but in fact, it’s not so bad, since our arbitrary write has already happened by this point - so we know the location of memory that we control somewhere in kernel .data. This means we can forge arbitrary list nodes within that space in .data, and we can place pointers to those future forged list nodes in the original heap spray we use to replace the inst. We ALSO know the location of a single task struct in the kernel virtual address space - the init task! init’s task struct is in the kernel .data, so we can reference it through the linear map. A spurious wake_up_process on the init_task will be entirely inconsequential while avoiding a crash. You can see the code to set up these linked list nodes in setup_linked_list in the exploit.

With that roadblock resolved, it’s time to figure out what in .data to target with our arbitrary write. Our goal is to change our unreliable arbitrary write of 2144 bytes to a reliable arbitrary read/write that causes significantly less collateral damage to the memory around it. I decided to try reimplementing the strategy I reversed from an ITW exploit a couple years ago. This technique involves creating a type-confusion by replacing some of the VFS/fops handlers in the ashmem_misc data structure with other VFS handlers for other file types. In fact, because of CFI you cannot replace the handler function pointers with pointers to just any location in the kernel .text. You must replace the VFS handlers with other VFS handlers. Rather conveniently however, I can use configfs VFS handlers for my exploit, just like the ITW exploit. The final layout of the fops table and private_data of the struct file look like this:

The fops handlers in green will access the private_data structure as a struct ashmem_area, or asma, while the fops handlers in yellow access the same private_data structure as a configfs buffer. For the configfs fops handlers, the memory pointed to by page will be accessed - that is where we will want our arbitrary read/write to read or write. We will set our target using the ASHMEM_SET_NAME ioctl.

One additional complication however, is that the linear mapping of the kernel .text is not executable, so I can’t use .text region linear map addresses to the VFS handlers when forging my ashmem_misc data structure. In practice, it’s not particularly difficult to leak the actual KASLR slide. Before targeting ashmem_misc, I first use my arbitrary write to target the sel_fs_type object in the kernel .data. This structure has a string, name, that is printed when reading /proc/self/mounts. By replacing that string pointer using my arbitrary write, and then reading /proc/self/mounts, I can turn my unreliable arbitrary write into an arbitrary read instead! Using this arbitrary read, I can read the ashmem_fops structure (also through the linear map) which gives me pointers at an offset from the kernel base, allowing me to calculate the KASLR slide.

I then perform my arbitrary write again to overwrite the ashmem_misc structure with a pointer to a new forged ashmem_fops table that I construct at the same time - such is the perk of overwriting far more data than I need.

However, the astute among you may have realized that this massive 2144 byte arbitrary write has a major drawback too, as such a large write will clobber all of the data surrounding whatever I’m actually targeting with the write - this could lead to all sorts of extraneous crashes and kernel panics. In practice, spurious crashing can occur, but the phone is surprisingly quite stable. My experience was that it seemed to crash upon toggling the wifi on/off - but otherwise the phone seems to work mostly fine.

Once the forged ashmem_misc structure has been inserted, we now have a perfectly reliable arbitrary read/write, albeit with the phone extraneously crashing sometimes. Upon getting arb read/write, I set SELinux to permissive (just flip the flag in the selinux_state kernel object), fork off a new process, then use my arb read/write to point the new process’s task creds to init_cred. At this point, I now have a process with root credentials, and SELinux disabled.

Integrating into the Dolby exploit

Combining two exploits into one chain requires a fair amount of engineering effort from both exploits. The Dolby exploit will be delivering the Bigwave exploit as a shellcode payload, (patched into the process using /proc/self/mem) so I need to convert my exploit to work as a binary blob. It also needs to be much smaller than my static compilation environment supported. The lowest hanging fruit was to remove the static libc requirement and have the exploit include wrappers for all the syscalls and libc functions it needs. When I set about to complete this rather tedious task, I realized that this is something an LLM would probably be quite good at. So instead of implementing the sycall wrappers myself, I simply copy-pasted my source code into Gemini and asked it to create the needed header file of syscall wrappers for me. Naturally the AI-generated header file caused many compilation errors (as it surely would have if I had tried to do it too). I took those compilation errors, gave them back to the same Gemini window, and asked it to amend the header file to resolve those errors. The amended header file caused gcc to emit whole new and exciting compilation failures - but the errors looked different than before, so I simply repeated the process. After 4 or 5 attempts, Gemini was able to generate a header file that not only compiled - it worked perfectly. This provides some insight into how attackers might be able to use (or more likely are already using) LLMs to make their exploit process more efficient.

This effort results in a much smaller ELF than before (7 KB instead of 500 KB) but just an ELF is not enough - I need the generated blob to work if the dolby exploit simply starts executing from the top of the shellcode. The good news however is that my exploit can operate entirely without a linker - all that is necessary is to prepend a jump to the ELF that sets the PC to the entrypoint. I also include “-mcmodel=tiny -fPIC -pie” in the gcc arguments so that the generated code will work agnostic to the shellcode’s location or alignment in memory.

Finalizing the exploit

Kernel arbitrary read/write is motivating enough as a security researcher to demonstrate the impact of the vulnerability, but it seemed incumbent to create some more accessible demo in order to demonstrate impact more broadly. I added code so that the exploit executed an included shell script, then wrote a shell script that took a picture and sent that picture back to an arbitrary IP address.

In the final part of this blog series, we will discuss what lessons we learned from this research.

A 0-click exploit chain for the Pixel 9 Part 1: Decoding Dolby

14 January 2026 at 18:59

Over the past few years, several AI-powered features have been added to mobile phones that allow users to better search and understand their messages. One effect of this change is increased 0-click attack surface, as efficient analysis often requires message media to be decoded before the message is opened by the user. One such feature is audio transcription. Incoming SMS and RCS audio attachments received by Google Messages are now automatically decoded with no user interaction. As a result, audio decoders are now in the 0-click attack surface of most Android phones.

I’ve spent a fair bit of time investigating these decoders, first reporting CVE-2025-49415 in the Monkey’s Audio codec on Samsung devices. Based on this research, the team reviewed the Dolby Unified Decoder, and Ivan Fratric and I reported CVE-2025-54957. This vulnerability is likely in the 0-click attack surface of most Android devices in use today. In parallel, Seth Jenkins investigated a driver accessible from the sandbox the decoder runs in on a Pixel 9, and reported CVE-2025-36934.

As I’ve shared this research, vendors as well as members of the security community have questioned whether such vulnerabilities are exploitable, as well as whether 0-click exploits are possible for all but the most well-resourced attackers in the modern Android Security environment. We were also asked whether code execution in the context of a media decoder is practically useful to an attacker and how platforms can reduce the risks such a capability presents to users.

To answer these questions, Project Zero wrote a 0-click exploit chain targeting the Pixel 9. We hope this research will help defenders better understand how these attacks work in the wild, the strengths and weaknesses of Android’s security features with regards to preventing such attacks, and the importance of remediating media and driver vulnerabilities on mobile devices.

The exploit will be detailed in three blog posts.

Part 1 of this series will describe how we exploited CVE-2025-54957 to gain arbitrary code execution in the mediacodec context of a Google Pixel 9.

Part 2 of this series will describe how we exploited CVE-2025-36934 to escalate privileges from mediacodec to kernel on this device.

Part 3 will discuss lessons learned and recommendations for preventing similar exploits on mobile devices.

The vulnerabilities discussed in these posts were fixed as of January 5, 2026.

The Dolby Unified Decoder

The Dolby Unified Decoder component (UDC) is a library that provides support for the Dolby Digital (DD) and Dolby Digital Plus (DD+) audio formats. These formats are also known as AC-3 and EAC-3 respectively. A public specification is available for these formats. The UDC is integrated into a variety of hardware and platforms, including Android, iOS, Windows and media streaming devices. It is shipped to most OEMs as a binary ‘blob’ with limited symbols, which is then statically linked into a shared library. On the Pixel 9, the UDC is integrated into /vendor/lib64/libcodec2_soft_ddpdec.so.

The Bug

DD+ audio is processed from a bitstream, which consists of independently decodable syncframes, each representing a series of audio samples. During normal operation, the UDC consecutively decodes each syncframe from the bitstream.

One element of a syncframe is the audio block which, according to the specification, can contain the following fields. A syncframe can contain up to 6 audio blocks.

Syntax Number of bits
skiple 1
if(skiple)  
skipl 9
skipfld 9 * 8
}  

This means the decoder can copy up to 0x1FF (skipl) bytes per audio block from the bitstream into a buffer we’ll call the ‘skip buffer’.

The skip buffer contains data in a format called Extensible Metadata Delivery Format (EMDF). This format is synchronized, meaning that the UDC looks for a specific series of bytes in the skip buffer, then processes the data afterwards as EMDF. The EMDF in a single syncframe is called an ‘EMDF container’. This is represented in the specifications as:

Syntax Number of bits
emdf_sync(){  
syncword 16
emdf_container_length 16
}  

The EMDF syncword is ‘X8’.

An EMDF container is defined as follows:

Syntax Number of bits
emdf_container() {  
emdf_version 2
if (emdf_version == 3) {  
  emdf_version += variable_bits(2)  
}  
key_id 3
if (key_id == 7) {  
  key_id += variable_bits(3)  
}  
while (emdf_payload_id != 0x0) { 5
  if (emdf_payload_id == 0x1F) {  
   emdf_payload_id += variable_bits(5)  
  }  
}  
emdf_payload_config()  
emdf_payload_size</b> variable_bits(8)
for (i = 0; i < payload_size; i++) {  
  emdf_payload_byte 8
}  
emdf_protection()  
}  

variable_bits is defined as:

Syntax Number of bits
variable_bits (n_bits) {  
value = 0;  
do {  
  value += read n_bits
  read_more 1
  if (read_more) {  
   value <<= n_bits;  
   value += (1<<n_bits);  
  }  
}  
while (read_more);  
return value  
}  

If you’ve spent time looking for vulnerabilities in this type of specification, a problem might already be apparent. There is no stated limit for the size of emdf_payload_size, meanwhile the output of variable_bits could be very large, essentially any numeric value.

Indeed, this is the root of the problem Ivan Fratric found while analyzing the Android UDC binary. In pseudo-code, it reads the EMDF payload into a custom ‘evo’ heap as follows:

 result = read_variable_bits(this, 8, &payload_length);
  if ( !result )
  {
    if ( evo_heap )
    {
      buffer = ddp_udc_int_evo_malloc(evo_heap, payload_length, param.extra_len);
      outstruct.buf = buffer;
      if ( !buffer )
        return 2;
      if ( payload_length )
      {
        index = 0;
        while ( !ddp_udc_int_evo_brw_read(this, 8, &byte_read) )
        {
          outstruct.buf[index++] = byte_read;
          if ( index >= payload_length )
            goto ERROR;
        }
        return 10;
      }
    }

So, memory is allocated, then the bytes of the payload are copied into the allocated memory. How does this allocation work?

void ddp_udc_int_evo_malloc(heap *h, size_t alloc_size, size_t extra)
{
  size_t total_size;
  unsigned __int8 *mem;

  total_size = alloc_size + extra;
  if ( alloc_size + extra < alloc_size )
    return 0;
  if ( total_size % 8 )
    total_size += (8 - total_size) % total_size;
  if ( total_size > heap->remaining )
    return 0;
  mem = heap->curr_mem;
  heap->remaining -= total_size;
  heap->curr_mem += total_size;
  return mem;
}

The evo heap is a single slab, with a single tracking pointer that is incremented when memory is allocated. There is no way to free memory on the evo heap. It is only used to process EMDF payloads for a single syncframe (the specification provides no limit on the number of payloads a syncframe can contain, outside of limits on the size of the skip buffer), and once that frame is processed, the entire evo heap is cleared and re-used for the next frame, with no persistence between syncframes.

While evo_malloc performs a fair number of length checks on allocations, this check is flawed, as it lacks an integer overflow check:

 if ( total_size % 8 )
    total_size += (8 - total_size) % total_size;

If total allocation size on a 64-bit platform is between 0xFFFFFFFFFFFFFFF9 and 0xFFFFFFFFFFFFFFFF, the value of total_size will wrap, leading to a small allocation, meanwhile, the loop that writes to the buffer uses the original payload_length as its bounds.

Integer overflow bugs are often challenging to exploit because they perform very large writes, but this code has a feature that makes this not the case. Each byte that is written is read from the skip buffer using ddp_udc_int_evo_brw_read, and that function checks read bounds based on emdf_container_length, which is also read from the skip buffer. If the read bounds check fails, the loop exits, and no more data is written to the buffer allocated by evo_malloc. This means that the size of the overflow is controllable, as are the values of the bytes written out of bounds, to the limit of the size of skipl (0x1FF * 6 audio blocks).

This is a powerful primitive that I will refer to as the ‘buffer overrun capability’ of this vulnerability. But if you look closely, this bug also contains a leak.

EMDF content is written to the skip buffer with length skipl, but the EMDF container also has a size, emdf_container_length. What happens when emdf_container_length is larger than skipl?

 if ( skipflde && ... )
 {
 	int skip_copy_len = 0;
 	for ( int block_num = 0; block_num < total_blocks; ++block_num )
{
      	if ( skiple )
      	{
			...
        		for ( skip_copy_len; skip_copy_len < skipl; skip_copy_len++ )
        		{
          			b = read_byte_from_syncframe();
          			skip_buffer[skip_copy_len] = b;
        		}
      	}
}
int i = 0;
for (i = 0; i < skip_copy_len; i+=2 )
{
      	int16_t word = skip_buffer[i] | skip_buffer[i+1]);
      	if ( word == "X8" )
      	{
        		has_syncword = 1;
        		break;
      	}
}
if ( has_syncword )
{
		
emdf_container_length = skip_buffer[i + 1] | ( skip_buffer[i] << 8);
      	bit_reader.size = emdf_container_length;
     		bit_reader.data = skip_buffer[i + 2];
	}
}

So while the skip buffer data is written based on skipl, the bit reader used to process the EMDF container has its length set to emdf_container_length. This means that EMDF data can be read outside of the initialized skip buffer. I will refer to this as the ‘leak capability’ of this vulnerability going forward.

We didn’t report the leak capability is a separate vulnerability from CVE-2025-54957, as it doesn’t have a security impact independent of the bug. The skip buffer is initialized to all zeros when the decoder starts, and afterwards, only syncframe data (i.e. the contents of the media being processed) is written to it. So in normal circumstances, an attacker couldn’t use the leak capability to leak anything they don’t already know. Only when combined with the buffer overrun capability of the vulnerability, does the leak capability become useful.

Decoder Memory Layout

The next step in exploiting this bug was understanding what structures in memory it can overwrite. This required understanding the memory layout of the UDC. The UDC performs a total of four system heap allocations when decoding DD+ audio, all occurring when the decoder is created, before any syncframes are processed. These allocations are freed and re-allocated between processing each media file. This is fairly typical of media decoders, as system heap allocations have non-deterministic timing, which can cause lag when the media is played.

One buffer that is allocated is the ‘static buffer’. This buffer contains a large struct, which supports all the functionality of the decoder. The evo heap is part of this buffer. On Android, the size of the static buffer is 692855. Another buffer that is allocated is the ‘dynamic buffer’. This buffer is used as ‘scratch space’ for a variety of calculations, and is also the location of the skip buffer. It is 85827 bytes long. The other two allocations are for input parameters and output data, and aren’t relevant to this exploit.

The terms ‘static buffer’ and ‘dynamic buffer’ are somewhat confusing, as there are other static and dynamic buffers used by the decoder, and both buffers are dynamically allocated. However, these are the names used by Android when integrating the UDC. Throughout this post, the term ‘static buffer’ will always refer to the 692855-byte buffer allocated by the UDC on initialization, and the term ‘dynamic buffer’ will always refer to the 85827-byte buffer allocated by the UDC on initialization, and no other static or dynamic buffers.

The following diagram shows where the skip buffer and evo heap are located in relation to these buffers:


The evo heap is located at offset 0x61d28 in the static buffer, and immediately afterwards is the pointer used to write to the skip buffer when processing EMDF, which I will call the ‘skip pointer’. It points 0x1000 below the skip buffer, and 0x1000 is added to its value to calculate the address that skip data (skipfld) is written to each time a syncframe is processed.

This means the vulnerability has the potential to overwrite a pointer that is later written to with attacker-controllable content, the skip data of the next syncframe. Unfortunately, this is not as simple as using the buffer overrun capability to overwrite the pointer, as the evo heap is 0x1f08 bytes long, and the maximum value of skipl is 3066 (0xbfa = 0x1ff * 6 audio blocks), meaning that the value the skip pointer would be overwritten with is not immediately controllable by simply decoding an EMDF payload that contains the bug.

This behavior is demonstrated by the original proof-of-concept attached to CVE-2025-54957. This file causes the buffer overrun to occur, but because the skip pointer is more than 3066 bytes away from the evo heap allocation that is overwritten, data is copied from outside the skip buffer. Since this memory is always zero, the skip pointer is overwritten with 0, and a null pointer crash occurs when the skip data from the next syncframe is written.

To get around this, the buffer overrun needs to be triggered on an evo heap allocation when the heap is partially filled. Fortunately, an EMDF container can contain multiple EMDF payloads, and parsing each payload allocates memory on the evo heap. Analyzing ddp_udc_int_evo_parse_bitstream, the function that performs this parsing and allocation, the smallest possible payload consumes 19 bits from the skip buffer. Meanwhile, every EMDF payload processed causes 96 bytes to be allocated on the evo heap. This means it would take roughly 99 payloads to fill up the evo heap, which translates to 235 bytes of skip data. This is well within the available skip data space. Using this technique, it was possible to overwrite the skip pointer with a controllable absolute value, then write arbitrary data to it.

Write what where?

While this is a useful primitive, its utility is limited by ASLR, as an attacker would need to know the absolute value of a pointer to write to, which is unlikely in a 0-click context. Another possibility is partially overwriting the skip pointer, for example, 0x7AAAAA00A0 could be overwritten to be 0x7AAAAA1234. Since the skip pointer originally points to the dynamic buffer, this allows most of the dynamic buffer to be overwritten. Unfortunately, the dynamic buffer is only used to store temporary numeric data and does not contain any pointers or other structures that would be helpful for exploitation, but there is one useful aspect of this primitive. Normally, only 3066 bytes of skip data can be written to the skip buffer, but it can allow an attacker to write more.

For example, imagine the following series of syncframes:

  1. Sets skip pointer to 0x7XXXXX4000
  2. Writes 3066 bytes of skip data to skip pointer
  3. Sets skip pointer to 0x7XXXXX3800
  4. Writes 0x800+ bytes of skip data to skip pointer

Now the length of the available data in the skip buffer is 3066 + 0x800, and this can be chained with more syncframes to write up to 0xFFFF bytes into the dynamic buffer. This isn’t on its own a path to exploitation, but it is a primitive that will become useful later. I will refer to it as WRITE DYNAMIC in future sections.

There is one subtlety that is important to notice. Why does syncframe 3 only move the skip pointer back 0x800 (2048) bytes when it could move it back 3066 bytes? This is because setting the skip pointer overwrites the data in the skip buffer. So syncframe 2 writes 3066 bytes, but syncframe 3 overwrites, for example, 200 bytes of that, then syncframe 4 needs to write 0x800+200 bytes to ‘fix’ the overwritten data. So to accurately write a long buffer to the dynamic buffer, the memory overwritten by each syncframe needs to overlap. But never fear, with enough syncframes, it is possible to fill almost the entire dynamic buffer with attacker controlled data. It is also possible to set the skip pointer to process the written data without modifying it by setting the skip pointer to the start of the data to be processed in one syncframe, then processing a second syncframe with skipl of 2, which will only write the syncword (‘X8’). The skip data will then be processed based on the emdf_container_length already written.

Regardless, the WRITE DYNAMIC primitive was clearly not sufficient for exploitation, so I decided to take a step back and figure out what memory I could overwrite to gain code execution, even if I didn’t have an immediate strategy for overwriting it. Analyzing the static buffer, I learned that my options were fairly limited. There are only two function pointers in the entire static buffer, called very frequently by the function DLB_CLqmf_analysisL, at offsets 0x8a410 and 0x8a438. This appears to be the only dynamically allocated memory used by the UDC that contains any function pointers.

Note that 0x8a410 and 0x8a438 are absolutely gargantuan offsets. They are more than 0x20000 bytes from the end of the evo heap, at address 0x63c30. A typical exploitation approach might be to directly overflow the heap to overwrite one of these pointers, but this offset is far too large. Even if the above primitive was used to fill the entire dynamic buffer (writable length 0xFFFF) with EMDF container data, it would still not be enough data to overwrite these pointers.

Extending the evo heap

A different approach was needed, so I revisited the static buffer, looking for other fields I could overflow near the end of the evo_heap. One looked interesting:

The heap_len is used to set the allocation limit of the evo heap during the processing of each syncframe. If it could be overwritten, it would be possible for the evo heap to allocate memory outside of its original bounds.This was a very promising possibility, as it had the potential to enable a primitive that would allow relative writes within the static buffer. For example, if I overwrote the heap length with a very large value, then allocated 0x286e8 bytes, since the evo heap starts at offset 0x61d28 and I am able to allocate and write to evo heap memory, would I then be able to write to offset 0x61d28 + 0x286e8 = 0x8a410?

Of course, this is still limited by the available size of the skip data, which is now 0xFFFF due to the WRITE DYNAMIC primitive. But since payloads use skip buffer memory at a ratio of 19 bits to 90 bytes, the function pointer could theoretically be overwritten using 0x286e8 / 90 * 19 / 8 = ~ 0xa000 bytes of skip data, which is smaller than the available 0xFFFF bytes.

Overwriting heap_len presents a challenge, though, as a write that reaches it will also overwrite the skip pointer, and if the skip pointer is invalid, it will cause a crash before the new value of heap_len is processed. One way to get around this would be to know the absolute value of a writable pointer and include it in the data that overwrites the memory, but without an information leak, this isn’t practical on a Pixel. Another would be if there was a valid pointer in the dynamic buffer, as using the leak capability, it would be possible to embed it in the skip data for a frame and use it for the overwrite, but the dynamic buffer only contains numeric data.

Then I realized that the dynamic buffer does contain pointers. Not in the allocated portion, but in the contiguous metadata included in the allocation by Android’s scudo allocator. Inspecting the dynamic buffer in a debugger, the pointer always has the address format 0x000000XXXXXXX0A0. The offset of 0xa0 leaves space for the heap header.

The heap header of the dynamic buffer is as follows:

The memory between offset 0x00 and 0x50 is unused by the scudo heap because this is a secondary (large) allocation, but unfortunately, there is a guard page before the header, and 0x50 bytes is not enough space for the EMDF container needed to overwrite the skip pointer and heap length, so I investigated ways to increase the unused memory between the guard page and allocation header. I discovered:

  • If a secondary allocation is freed, and a chunk that is up to 0x2000 bytes smaller is then allocated, the freed chunk will be reallocated to satisfy the request. More importantly, the heap header will be shifted upwards. For example, if a heap chunk of size 0x17000 is allocated at 0x7f00000000 then freed, and then an allocation of size 0x15000 is made, then the chunk will be reused, but the heap header will now be at 0x7f00002000.
  • When a secondary chunk is freed, scudo determines the size entirely based on the “curr chunk len” field shown above

It’s also important to note that the dynamic and static buffers are such large allocations with such unusual sizes that scudo always allocates them in the same location in a specific process, allocating the memory when the decoder is initialized and freeing it when it is uninitialized, as once the chunks are created by the heap, they are the only suitable existing chunks to fulfill an allocation request of that size. (Note that the UDC runs in a separate process from other codecs on Android.)

Putting this all together, it is possible to point the skip pointer to the ‘curr chunk len’ field of the dynamic buffer’s header, then overwrite it, so the chunk’s length is 0x17000 instead of 0x15000. Then, when the decoder is reset (i.e. when a new file is played), the buffer will be reallocated, with an extra 0x2000 bytes of writable space before the heap header. This means the exploit will require decoding multiple files, but that isn’t a problem when exploiting this bug via transcription, as multiple audio attachments to a single message are decoded in sequence.

There is a small ASLR problem with this step. As mentioned above, the dynamic buffer is allocated at a pointer with the format 0x000000XXXXXXY0a0, with X and Y being bits randomized by ASLR. The desired value to be written to is 0x000000XXXXXXY065. But remember, the skip buffer is actually at an offset of 0x1000 from the address the skip pointer references. So to perform the write, the skip pointer needs to be set to 0x000000XXXXXXZ065, where Z is one less than Y. This means the exploit needs to overwrite the nibble Y, and therefore know the value of Y, which is randomized by ASLR.

I did an experiment on a Pixel to see how this value was randomized and it seemed fairly even.

So the only option here is to guess this value, which means this exploit would work 1 out of every 16 times. This isn’t prohibitive, though, as an attacker could send the exploit repeatedly until it works, and if the heap nibble value is wrong, the decoding process crashes and respawns after roughly three seconds, which means the exploit would succeed on average in 24 seconds.

My exploit assumes the nibble value is 3. With this, and the shifting of the scudo heap header described above, it’s possible to insert an EMDF container before the heap header and use the leak capability of the bug to copy it over the skip pointer, then continue the copy to set the heap length. The heap length ends up being overwritten by audio data from early in the dynamic buffer (bit allocation pointers to be specific), which for the syncframe I used, is a value of 0x77007700770077.

Controlling PC

Now everything is ready to go: we can write and EMDF container with roughly 2070 EMDF payloads into the dynamic buffer, and when its processed ~0x28000 bytes of the evo heap gets allocated, then the final payload overwrites the function pointer at 0x8a410. Unfortunately, this didn’t work.

It turns out that there are some other fields after the heap length in the static buffer.

To understand what these are, and why they are causing problems, we need to look more closely at how evo memory is allocated when EMDF payloads are processed. In highly simplified pseudocode, it works something like this.

int num_payloads = 0;

while(true){

	int error = evo_parse_payload_id(&reader, &payload_id);

	if(payload_id == 0 || error)
		break;

	num_payloads++;
error = evo_parse_payload(reader, payload_id, 0, 0, &payload, 0); //allocates no memory
	if(error)
		break;
}

void** payload_array = evo_malloc(evo_heap, 8 * num_payloads, 8 *  array_extra);

for (int i = 0; i < num_payload; i++){
	payload_array[i] = evo_alloc(88, 0);
}

reader.seek(0);

for (int i = 0; i < num_payload; i++){

	int error = evo_parse_payload_id(&reader, &payload_id);

	if(payload_id == 0 || error)
		break;

error = evo_parse_payload(reader, payload_id, evo_heap, 0, payload_array[i], 0); 

	if(error)
		break;
}

Within the second call to evo_parse_payload, a single allocation (the same one which can overflow when the bug occurs) is performed as follows:

void* payload_mem = evo_alloc(payload_size, payload_extra);

On a high level, this code counts the number of EMDF payloads, then allocates an array of that size to hold pointers to a struct for each payload, then allocates a struct to represent each payload, and sets the corresponding pointer in the array to the struct allocation, then reads each EMDF object into its payload struct, optionally allocating payload memory if it contains payload bytes.

Two fields from the static buffer are marked in bold in the code above. array_extra and payload_extra are both integrator-configurable parameters that cause specific calls to evo_alloc to allocate extra memory.

So why does this cause my attempt to overwrite the function pointer in the static buffer to fail? When the decoder processes the EMDF container with a large number of payloads, it starts to allocate memory outside of the evo heap, because the heap length was overwritten with a very large size. The first evo heap memory allocated is the payload_array, an array of pointers that are later set to 88-byte evo heap allocations, one for each payload. With 2070 EMDF containers, this array is very large, 0x40B0 bytes. It overlaps payload_extra, and many other fields in the static buffer, setting them to pointer values. For fields that are interpreted as integers, like payload_extra, the end result is that they now contain numeric values that are very large.

Soon after payload_extra is overwritten, evo_parse_payload is called, which attempts the allocation:

void* payload_mem = evo_alloc(payload_size, payload_extra);

The allocation size is calculated by adding payload_size + payload_extra (with an integer overflow check) before the buggy addition of alignment padding that leads to the vulnerability padding occurs. Since pointers are tagged on Android, this will end up being something like:

total_size = payload_size + 0xB400007XXXXXXXXX;

Meanwhile, the heap length was overwritten to be 0x77007700770077, which is always smaller than total_size, so this allocation fails. Even worse, the overwritten payload_extra persists across syncframes, meaning that no payload_mem allocation will ever succeed again. This prevents the bug from ever triggering again, as it requires a successful allocation, so there is no possibility of correcting these values in the static buffer.

But maybe it isn’t necessary to ever trigger the bug again, as the skip pointer is one of the many fields that gets overwritten by the huge payload_array allocation, causing it to point into the static buffer, above the evo heap. I’m going to skip over some details here, because I ended up not using this strategy in the final exploit, but by writing data to the altered skip pointer, it was possible to overwrite the function pointer, which demonstrated that this vulnerability could set the program counter!

Non-contiguous Overwrites

Controlling the PC showed this bug has excellent exploitability, but the above strategy had a serious downside: it prevented the bug from being triggered again, so I could only perform one overwrite, which would make achieving shellcode execution challenging. So my next step was to find a way to perform multiple non-contiguous writes to the static buffer.

When setting the PC, the unavoidable corruption of payload_extra prevented future overwrites, but I eventually realized that I could use the ability to set this field to my advantage.

The layout of allocations on the evo heap is as follows:

If an EMDF container contained two EMDF payloads, the data for the second payload would be allocated at num_payloads × 96 + payload_1_size + payload_extra. This allows payload_extra bytes to be allocated in the static buffer, but not overwritten by the payload. Since the length and contents of payload data is controllable by the attacker, it would be possible to write basically any data at any relative location in the static buffer if I could find some way to overwrite payload_extra with controlled data. The fact that payload_1_size is also set from syncframe data makes this even more convenient. Since all the writes this exploit requires are fairly close to each other in memory, payload_extra only needs to be written once, so heap_base + num_payloads × 96 + payload_1_size + payload_extra is equal to the X0 parameter of DLB_CLqmf_analysisL (more on why this is a good choice later.) Then, by modifying the size of payload_1_size, the address of individual writes can be shifted by that many bytes. For example, if payload_1_size is 14 × 8, the function pointer in the static buffer discussed above will be overwritten.

Overwriting payload_extra

Unfortunately, the method used for overwriting the heap length is not sufficient to overwrite payload_extra as well, and the corruption that occurred while gaining PC control did not provide adequate control of the values overwriting payload_extra to perform the steps above. Remember, the heap length was overwritten by audio data in the dynamic buffer that happened to be written at an address soon after the static buffers’s scudo heap header, and payload_extra was overwritten by a pointer. For just extending the heap length, setting the value to ‘random garbage’ was enough, but for multiple overwrites via payload_extra, a specific value is needed.

A simple solution would be to use WRITE DYNAMIC to write the data after the heap header to the needed value, but this isn’t possible, because this address is written by the decoder while decoding a portion of the audio blocks called bit allocation pointers (baps), between when attacker-controlled data is written and when it is processed by the next syncframe. So even if the needed values are written with WRITE DYNAMIC, they are overwritten before they can be used to set payload_extra and nearby fields. I tried stopping the write from happening by including erroneous data in the syncframe that prevented baps from being written, but this also stopped EMDF data from being processed. I also tried altering an audio block to write controlled data in this location, but the possible values of baps are fairly limited, only low 16-bit integers.

I eventually wondered if it would be possible to get the scudo heap to write an ‘inactive’ header, i.e. one that contains pointer values, but isn’t currently in use. I experimented with scudo, and discovered that if a secondary chunk is the first one of that size ever allocated by a process (like the dynamic buffer is), its previous pointer will point to itself, and if the previous pointer is partially overwritten (for example, so the last two bytes are 0x5000 instead of 0x3000), the next time the chunk is allocated, the address returned by the allocator will be at the 0x5000 address, but the scudo header at 0x3000 will not be cleared. This only works because the dynamic buffer is the only buffer anywhere near its size that is allocated by the process, otherwise, there would be a risk that this buffer would be allocated again, leading to memory corruption that could cause a crash before the exploit is finished running.

Since the decoder needs to be reset to cause the dynamic buffer to be reallocated, implementing this required adding a third media file to the exploit, but this isn’t a big cost in a fully-remote exploit, as three attachments can easily be added to the same SMS or RCS message. Now the exploit has three files:

  • first.mp4 -- Using WRITE DYNAMIC, writes dynamic_base + 0x3061 to 0x48, causing the dynamic buffer to be reallocated at dynamic_base + 0x4800 when second.mp4 is loaded
  • second.mp4 -- Using WRITE DYNAMIC, writes dynamic_base + 0x4861 to 0x50, causing the dynamic buffer to be reallocated at dynamic_base + 0x5000 when third.mp4 is loaded
  • third.mp4 -- contains the rest of the exploit

Note that dynamic_base is the location of the dynamic buffer with the lower two bytes cleared, i.e. dynamic_buffer && 0xFFFFFFFFFFFF0000 When the ASLR state needed for the exploit to work is correct, the dynamic buffer is at dynamic_base + 0x3000.

Now, there is a scudo heap header at dynamic_base + 0x4800 that is not actively in use and does not get overwritten by baps that can be used to create an EMDF container that will overwrite payload_extra. But there is one problem. I explained earlier that, when filling a buffer using DYNAMIC WRITE, the exploit needs to perform overlapping writes downwards, because the next EMDF container, which is needed to move the skip pointer for the next step, overwrites some data at the start of the write. This doesn’t matter when writing a long page of data, because the next write can fix the previous one, but it does in this case. The layout of the heap header is as follows:

I needed to write specific data to exactly offset 0xc8, but couldn’t corrupt the ‘prev chunk ptr’ because it was needed to overwrite the skip pointer during the copy. There’s 0x60 bytes between these, which is not enough for a payload that moves the skip pointer.

So I needed a new primitive. Thankfully, the way the decoder handles the EMDF syncword provides this. Basically, once skip data is copied into the skip buffer, the buffer is searched for the syncword (‘X8’), and EMDF container parsing starts after the syncword. So it is possible to put some data before the syncword, and that gets written to the skip pointer, then put the container that moves the skip pointer after that. This allows the data to be written to the skip pointer, and then then skip pointer to be moved in a single syncframe, so that data doesn’t get corrupted by a future skip pointer write. I will call this primitive WRITE DYNAMIC FAST. There’s two downsides of this primitive compared to WRITE DYNAMIC. One is that since the EMDF container that moves the skip pointer and the data written are in the same syncframe, a smaller amount of data can be written. The other is that it is more difficult to debug. In a WRITE DYNAMIC syncframe, the address written to is always at the same offset, so it is easy to visually inspect many syncframes and determine where they are writing, but this is not the case with WRITE DYNAMIC FAST. So, my exploit uses WRITE DYNAMIC wherever possible, and only uses WRITE DYNAMIC FAST for writes that can’t be accomplished with WRITE DYNAMIC.

With this primitive, I could create a syncframe that overwrites the skip pointer with a valid pointer to the dynamic buffer, then overwrites the heap length and payload_extra. This created a new primitive, which I will call WRITE STATIC. This allows a write to any offset in the static buffer larger than 0x63c30 relative to the static buffer’s base!

Calling Controllable Functions

Now that I had the ability to perform multiple writes to the static buffer, it was time to figure out a path to shellcode execution. This required analyzing how the function pointers in the static buffer are called. It happens in the following function:

void* DLB_CLqmf_analysisL(void **static_buffer, __int64 *output_index, __int64 in_param)
{

//static_buffer is static buffer at offset 0x8a3c8


int loop_times = *(int*)static_buffer + 5);
int index = *(_DWORD *)static_buffer;
do
{
    index_val = *output_index++;
    param_X0 = static_buffer[12];
    param_val = param_X0 + 8 * index;
    (static_buffer[14])(
        param_X0,
        static_buffer[5],
        static_buffer[1],
        static_buffer[7],
        in_param);

    result = dlb_forwardModulationComplex(
        param_X0,
        index_val,
        param_val,
        *static_buffer,
        static_buffer[13],
        static_buffer[8],
        static_buffer[9]);

    index = *(unsigned int *)static_buffer;
    --loop_times;
    
    }
    while ( loop_times );
    return result;
}

The function dlb_forwardModulationComplex contains the following condition:

  if ( a7 )
  {
    result = (__int64 (__fastcall *)(__int64, __int64, _QWORD))(*a7)(a3, a1, a4);
  }

This function’s behavior is extremely promising with regards to exploitation. It reads a function pointer and parameters out of memory that can be written with WRITE STATIC, then calls the function pointer with those parameters. There is also an option to make an indirect function call using dlb_forwardModulationComplex, if there happens to be a situation where a pointer to a function pointer is available instead of the function pointer itself. Finally, the call is repeated a specific number of times, based on a controllable value read out of the static buffer. Combining DLB_CLqmf_analysisL with WRITE STATIC, I could partially overwrite function pointers to run ROP with controllable parameters.

What’s the plan, (Seth and) Jann?

As I developed this exploit, Jann Horn asked several times how I was planning to get from ROP to code execution in the mediacodec context, as Android has several security features intended to make this step difficult. I put this off as a ‘future problem’, but now was at a point where this needed to be solved.

Normally, my strategy would be to write a shared library to the filesystem, then call dlopen on it. Or write shellcode to a buffer and call a mprotect with ROP to make it executable. SELinux prevented both of these. It turns out the mediacodec SELinux context does not have any allow rule that allows it to open and write the same file, so dlopen was a non-starter. Additionally, mediacodec does not have execmem permissions, so making memory executable was also out. Making matters worse, libcodec2_soft_ddpdec.so makes limited calls to libc. So not very many functions were available for ROP purposes. For example, the library imports fopen and fread, but not fwrite or fseek.

Eventually, I got together with Jann Horn and Seth Jenkins to figure out a strategy to get from ROP to arbitrary instruction execution. Jann had the idea to write to /proc/self/mem This ProcFS file allows for any memory in a process to be overwritten for debugging purposes (i.e. to support software breakpoints), and could potentially be used to overwrite a function, and then execute it.

After investigating the mediacodec context’s permissions, we came up with the following strategy:

  1. Map shellcode into memory using WRITE DYNAMIC

  2. Call fopen on /proc/self/mem many times, so a file descriptor number associated with /proc/self/mem can be easily guessed

  3. Call pwrite to write the shellcode to a function that can later be executed. (Note that pwrite is not imported by libcodec2_soft_ddpdec.so, but nothing else that can write to a file handle is either).

Translating this sequence into ROP calls made by WRITE STATIC was more difficult than expected. One problem was that partially overwriting the function pointers in DLB_CLqmf_analysisL provided less functionality than I’d imagined. If you recall, DLB_CLqmf_analysisL makes two function calls that can be overwritten. The first is a direct call to analysisPolyphaseFiltering_P4 at 0x26BDEC (note this isn’t symbolized in the Android version of the library). The second is an indirect call to DLB_r8_fft_64 via a pointer at offset 0x2A7B60.

The upper nibble of the second byte of where these functions are loaded is randomized by ASLR on Android. I tested this, and saw the behavior below, which is fairly uniform.

So my only options were to use ROP gadgets that involve only overwriting the first byte of the function pointers, or add additional unreliability to the exploit. The available gadgets weren’t promising, so I decided to just guess this offset in my exploit, which adds another 1/16 probability, meaning the exploit will work one out of 256 times total. Considering the decoder process takes three seconds to respawn, this means the exploit would take on average around six minutes to succeed, which isn’t prohibitive.

Guessing this nibble expands the available ROP gadgets to a span of 0xFFFF bytes, and it’s possible to shift this span somewhat, depending on what value the exploit guesses this nibble to be. Still, this is only about 5% of the 1.3 MB of code in libcodec2_soft_ddpdec.so. For the indirect call, 0xFFFF spans almost the entire export table, as well as the global offset table (GOT), so there’s some options there, but the library exports only about 40 functions from libc.

But it wasn’t hopeless. For one, it is possible to call memcpy with these limitations, and if the parameters are unmodified, dst is a location in the dynamic buffer, and src is a location in the static buffer. Also, there was a promising ROP gadget in the accessible range:

0x000000000026ae38 : 

ldr w8, [x1]
add w8, w8, #0x157 
str w8, [x1] 
ret	

I will call this the “increment gadget”.

With this, I had a plan:

  1. Change the indirect call to the fopen pointer in the GOT, and call it several times on /proc/self/mem

  2. Change the indirect call to memcpy, and copy the fopen GOT entry to the dynamic buffer

  3. Set the dst parameter of memcpy to the location of the GOT pointer in the dynamic buffer and call it again, causing a pointer to the fopen function in libc to be copied to the dynamic buffer

  4. Use DYNAMIC WRITE to overwrite the last byte of the function pointer, so the distance between the pointer and pwrite is a multiple of 0x157

  5. Call the increment gadget over and over to increment the function pointer in the dynamic buffer by 0x157 until its value is pwrite

  6. Call pwrite

  7. Profit?

This plan obviously glosses over a lot, most of which will be explained in the next section, but it is the plan I wrote up at the time.

One immediate question is “does the math work”? It seems to. In the version of the library I looked at, fopen is at 0x92E90 and pwrite is 0xDD6C0. A one-byte overwrite could change a fopen pointer to 0x92E4A, then:

0x157 × 890 + 0x92E4A = 0xDD6C0

Another question is whether this math would work generally, even on devices that have libc compiled with different offsets. I believe it would. In each version of libc, there are at least four call locations that will end up calling pwrite: pwrite, pwrite’s PLT, pwrite64 and pwrite64’s PLT. If those don’t work, there’s combinations of seek and write or fseek and fwrite. Worst case, the exploit could change the GOT entry that’s read, so the math starts with a different function pointer than fopen. There are a very large number of possibilities and more than one is likely to work on every libc compilation.

The Exploit

Now, it was time to write the third file of the exploit. This turned out to be fairly complicated, with some unexpected problems. In order to explain these, this section will go through the third file of the exploit, one syncframe at a time. You can follow along here. Note that filenames that begin with numbers, for example, 10_write_x0 contain the actual syncframe data for that syncframe, meanwhile files with names like make_10_write_x0.py contain Python that generates the frame, often created with Gemini. Files with no corresponding Python were either handforged or exact copies of previous syncframes. Files appended with the suffix _special were generated with the corresponding Python, then altered by hand. The syncframes can be combined into a single MP4 file with correct checksums by running combine_frames.py.

longmem

The third exploit MP4 starts with the 36 syncframes in the longmem directory, containing the shellcode that the exploit eventually runs. The shellcode is copied to the dynamic buffer at descending addresses using DYNAMIC WRITE. As the exploit progresses, it performs actions that break DYNAMIC WRITE, so it’s easiest to get this into memory now.

1_adjust_write_heap

This syncframe sets the skip pointer to dynamic_base + 0xF000.

2_adjust_write_heap_special

This syncframe uses DYNAMIC WRITE FAST to write ‘wb’ and “/self/proc/mem” to the address above, so they are available as parameters for a future fopen call, then moves the skip pointer to dynamic_base + 0xD000, so they aren’t immediately corrupted.

3_adjust_write_heap

This syncframe sets the skip pointer to dynamic_base + 0x48c8, an offset that will correspond to the evo heap length and payload_extra once the memory is copied. (In hindsight, this could have been done in the previous frame, but too late now.)

4_adjust_write_heap_special

This synframe uses DYNAMIC WRITE FAST to write the memory at the offset corresponding to the evo heap length to 0xFFFFFFFFFFFFFFFF and the offset corresponding to payload_extra to 0x28530. It then sets the skip pointer to dynamic_base + 0x473a.

5_do_heap_write

This syncframe writes the start of an EMDF container to the address set in the previous frame, so that the data written by 3_adjust_write_heap, 4_adjust_write_heap_special and this syncframe together form a valid EMDF container, which is then parsed, triggering the bug and setting the heap length to 0xFFFFFFFFFFFFFFFF and payload_extra to 0x28530. This makes the WRITE STATIC primitive available, but also makes WRITE DYNAMIC and DYNAMIC WRITE FAST no longer function, as evo heap allocations no longer take up the same amount of space on the heap.

6_write_pc

To understand this and future syncframes, it’s important to understand the functionality of WRITE STATIC in a bit more detail. The memory this primitive can write, which is eventually the X0 parameter to DLB_CLqmf_analysisL is laid out as follows:

The function pointer for the direct call is available to be overwritten, as are its parameters, ARM64 registers X0 through X3. The indirect function parameters are also calculated from values in this structure, which I will explain in more detail later.

Each 64-bit slot can be considered an ‘entry’ that needs to be individually overwritten in order to do non-contiguous partial overwrites. WRITE STATIC can alter a single entry per syncframe. Unfortunately, DLB_CLqmf_analysisL also executes once per syncframe, which can cause crashes or undesired behavior if the exploit is in the process of setting parameters when the call occurs.

This syncframe sets direct_call_fptr at entry 14 to a gadget that contains only the instruction ret, by doing a partial overwrite of the existing pointer. This prevents the direct function call from causing unexpected behavior.

7_garbage

Executing any frame with a valid EMDF header caused a crash after the previous frame, due to an out-of-bounds memset. Based on its parameters, this call is obviously intended to zero the evo heap, but since the heap length is now larger than the static buffer, it writes out of bounds. I performed a minimal analysis of what triggers this call and discovered that it requires processing two syncframes containing EMDF containers in a row, so I added in a syncframe that contains random invalid data to reset this. This ‘garbage’ syncframe is now required after every valid syncframe to avoid crashes. I will omit it as I continue through the exploit, but note that every future frame is even-numbered, because all the odd-numbered frames are ‘garbage’.

8_write_str_str

Similar to syncframe 6, it is necessary to overwrite the indirect function pointer at entry 9 to avoid crashes as parameters are set, however, it is not possible to use ROP, as the entry needs to be set to a pointer to a function pointer. This syncframe sets entry 9 to the GOT entry pointing to strstr by doing a partial overwrite. While this isn’t ideal, for the time being, X0 and X1 of the indirect call will always be pointers, and strstr doesn’t modify any memory, so running it repeatedly won’t cause crashes or other problems.

10_write_x0

This syncframe prepares the X0 parameter for the indirect call to fopen. For this call, X0’s value will be the pointer at entry 12 (direct_call_X0) plus an offset calculated from entry 0 (index). The entire calculation is:

indirect_call_x0 = direct_call_X0 + 8 *  index;

In syncframe 1, “/proc/self/mem” was already loaded into the dynamic buffer, and this syncframe sets index to 1, so X0 references this string, 8 bytes away from the string ‘wb’.

12_write_x1

This syncframe partially overwrites entry 10, which is currently a pointer to the dynamic buffer so that its value is dynamic_base + 0xF000, making it point to the string ‘wb’.

14_write_fopen

This syncframe partially overwrites entry 9, so the indirect function pointer now references fopen. fopen will immediately be called four times, the default value of loop_count.

16_garbage to 23_garbage

The exploit now processes a few garbage syncframes to run fopen repeatedly to ‘spray’ the file handle so it can be guessed. This works because the UDC process opens very few files, so the handles are predictable over a certain number.

24_write_str_str

Returns entry 9 (the indirect function pointer) to strstr, so fopen stops being called.

26_write_x2

This syncframe sets direct_call_X2 (entry 1) to 0xb8 in preparation for a call to memcpy.

28_write_x0

This syncframe partially overwrites the dynamic buffer pointer in direct_call_X0 (entry 12) to dynamic_base + 0xEC00, in preparation for a call to memcpy.

30_loop_count

This syncframe sets the loop_count in entry 2 to 1, so future function calls do not execute multiple times per sycframe.

32_memcpy

This syncframe sets the direct function pointer (entry 14) to a memcpy gadget at 0x26cc2c, which is then called, causing the static buffer to be copied to the dynamic buffer, including an indirect pointer to strstr, set at entry 9 above. Note that the copy will occur every syncframe until entry 14 is overwritten again.

34_write_x0

The previously-set value of direct_call_X0 was a dummy value, to keep the copy away from skip buffer while the previous, especially large, EMDF container was being processed. This syncframe sets it to the actual copy destination, dynamic_base + 0x5F83.

36_zero_page and 38_copy_x1_special

The next two syncframes copy the newly written strstr GOT entry pointer to direct_call_X1 using the leak capability of the vulnerability, so it can be the src parameter of the next memcpy.

36_zero_page writes zeros, followed by the end of an EMDF container to the skip pointer.

The memcpy then occurs, copying the GOT pointer into the middle of the EMDF container.

38_copy_x1_special writes the head of the EMDF container to the skip pointer, then the container is parsed, causing direct_call_X1 (entry 5) to be set to the GOT pointer.

40_write_x0 and 42_write_x0

Syncframe 40 sets direct_call_X0 (entry 12) to dynamic_base + 0xEF00. memcpy is then called, causing a direct pointer to strstr to be copied to that address. Syncframe 42 sets it to dynamic_base + 0x6043, so the copied memory doesn’t get corrupted, and to set up the next memcpy call.

44_write_x2, 46_write_scf, 48_zero_page and 50_write_x3_special

Though it wasn’t strictly necessary at this point, I wanted to set direct_call_X3 to strstr, so it would be available as offset, the fourth parameter to the eventual pwrite call. This made sense because the pointer was currently available in the dynamic buffer, and all other direct calls needed by the exploit had fewer than four parameters. Flash forward to the future: this was a bad idea.

The offset parameter specifies the location pwrite writes to, which for /proc/self/mem in this exploit is the address of a function that will be overwritten with shellcode. strstr seemed perfect, because I could already make controlled calls to it, and it otherwise doesn’t get called a lot, but when I ran the finished exploit, it didn’t work, because getpid, munlock and several other frequently-called functions were located immediately after it in libc. They usually got called first, causing the exploit to jump into the middle of the shellcode.

It was easiest just to use memcpy to copy a different function pointer, and after some testing, I selected __stack_chk_fail, as it doesn’t get called during normal operation and the functions after it in libc aren’t used by the UDC either. So this combination of syncframes uses the same trick as was used to copy the strstr GOT into direct_call_X1 to copy a pointer to __stack_chk_fail into direct_call_X3. Note that this only takes one ‘round’ of using the leak capability to copy a pointer, versus two for strstr, because I was able to partially overwrite the pointer to the strstr GOT entry in direct_call_X1 to so it pointed to the __stack_chk_fail GOT entry, so didn’t need to copy the static buffer a second time.

52_set_pc_back

This syncframe sets the direct function call back to the ret gadget, so it stops calling memcpy.

54_write_skip, 56_write_x1_end_special and 58_write_x1_start_special

When starting this exploit, I genuinely believed it would be possible to get shellcode execution without WRITE DYNAMIC once WRITE STATIC was unlocked. This turned out to be wrong. In the plan I wrote up for the exploit, I missed the fact that direct_call_X1 was set to the GOT at this point in the exploit, but needed to be set to the dynamic buffer.

Some nice pointers to the dynamic buffer were already in the dynamic buffer from when I had copied the static buffer there to get the address of the GOT, and I could use the same trick to copy one to direct_call_X1 that I’d used to copy the other pointers, but I’d need to move and write to the skip pointer to their address. I decided at this point the easiest path forward would be to regain the WRITE DYNAMIC primitive.

This was really just a math problem: the original WRITE DYNAMIC primitive would allocate a lot of EMDF payloads to exhaust the heap, then trigger the buffer overwrite capability to alter the skip pointer, meanwhile, with payload_extra overwritten, this would fail due to an integer overflow check failing when it is added to the payload size. But it’s not actually necessary to trigger the vulnerability once the heap length is overwritten, as the evo heap no longer accurately checks whether heap writes are out of bounds.

As a refresher, the evo heap is laid out as follows:

The new WRITE DYNAMIC allocates the perfect number of payloads so that the allocation size of the pointer array plus the payload structs is exactly even with the skip pointer, and then the first payload’s data overlaps with the pointer, and can be used to overwrite it.

These syncframes use a series of WRITE DYNAMIC and WRITE DYNAMIC FAST calls to set direct_call_X1 to the dynamic buffer.

60_write_skip, 62_write_single_byte and 64_move_skip

The first two syncframes use DYNAMIC WRITE to overwrite the final byte of the pointer to strstr, so it is a multiple of 0x157 away from pwrite. The final syncframe moves the skip pointer to another address so it doesn’t write the byte a second time.

66_write_index

The exploit is about to call the increment gadget a large number of times, which will also increment the variable index at entry 0 in DLB_CLqmf_analysisL. This syncframe sets its value to zero, so that these future increments don’t lead to reads out of bounds.

68_loop_count

This syncframe sets the loop_count in entry 2 to 0x7B, so that the increment gadget runs the correct number of times. Note that DLB_CLqmf_analysisL will run twice, causing the gadget to run 0xF6 times.

70_write_x1

direct_call_X1 currently points somewhere in the dynamic buffer. This syncframe makes it point exactly to the modified pointer to strstr.

72_inc_157

This syncframe sets the direct function pointer to the increment gadget, which is then called 0xf6 times, causing the function pointer in the dynamic buffer to point to pwrite.

74_set_pc_back

Sets the direct call pointer back to the ret gadget, so incrementing stops.

76_set_malloc

The indirect function pointer is currently set to strstr. This will become a problem as its parameters are prepared for calling pwrite, as pwrite‘s first parameter is a file handle (i.e. an integer), which will crash as the first parameter of strstr. This syncframe sets the indirect function pointer to malloc, as its GOT entry is within range and the call will succeed with a single integer parameter.

78_write_x0

This syncframe writes direct_call_X0 with 40, the estimated handle to /proc/self/mem.

80_write_x1

This syncframe partially overwrites direct_call_X1 so it points to the shellcode in the dynamic buffer.

82_write_x2

This syncframe writes direct_call_X2 with the integer length of the shellcode.

84_write_end_special and 86_write_start_special

These syncframes copy the pointer to pwrite to the direct_call_fptr (entry 14), using the same method as other pointer copies from the dynamic buffer. pwrite is immediately called, overwriting __stack_chk_fail with the shellcode.

88_write_scf

This syncframe partially overwrites the indirect call register, so it points to the GOT entry for _stack_chk_fail. __stack_chk_fail immediately executes, running the shellcode!

How reliable is this exploit?

Due to ASLR guessing, this exploit works roughly 1 in 255 times. There is one other source of unreliability. Occasionally, binder performs a secondary allocation while the exploit is running, in which case, header checks fail and it crashes. This happens a lot when the debugger is attached, but I observed it less than 10% of the time when the process is running normally.

Another question is whether the exploit could be made more reliable. I have two ideas in this regard, both which would require substantial development effort.

To remove the 1/16 probability when guessing the dynamic buffer location, it might be possible to overwrite the second lowest byte of the prev pointer in the dynamic buffer allocation before exploitation starts. As discussed previously, this causes the buffer to be reallocated at that address, so this would have the end result of moving the allocation to a consistent offset from the dynamic_base before the exploit runs.

The challenge here would be to find a way to write to the header of the dynamic buffer while only overwriting the lowest byte of the pointer, as this is the only byte that can be overwritten without knowing the ASLR bits. One possibility is using the bap write feature of the decoder, as it writes data close to the skip pointer, but very limited data can be written. The evod_process function also writes to low addresses of the skip buffer after the EMDF container is parsed, so it might be possible to use this write as well.

This strategy would not make determining the dynamic buffer allocation 100% reliable, because the location where the dynamic buffer is reallocated needs to be mapped. For example, if an allocation at dynamic_base + 0x3000 has its prev pointer overwritten to be dynamic_base + 0xF000, it will be shifted to that address, but if an allocation at dynamic_base + 0xF000 is overwritten to be dynamic_base + 0x3000, it will crash when scudo attempts to write a heap header to the lower address, because that memory is not mapped. Overwriting the prev pointer to dynamic_base + 0xF000 would theoretically always work, but that would limit DYNAMIC WRITE to addresses between dynamic_base + 0xF000 and dynamic_base + 0xFFFF, because the primitive can only overwrite bytes in the address it writes to, it cannot increment the third lowest byte to extend this range. So this strategy would require reducing the amount of memory in the dynamic buffer that the exploit needs, but if that’s possible, it could potentially remove the unreliability caused by the second nibble randomization of the dynamic buffer.

To remove the 1/16 probability when guessing the load address of libcodec2_soft_ddpdec.so, if it was possible to copy a pointer to the dynamic buffer, it would then be possible to use the second nibble of that pointer as the emdf_container_length of a syncframe. For most lengths, it’s then possible to craft an EMDF container that would not trigger the bug if the length is too short, because the bytes triggering the bug aren’t processed, and not trigger the bug if the length is too long (as evo_parse_payload is called twice, triggering the bug on the second call, so an invalid payload occurs after the trigger, it prevents the trigger from running). Then, a series of syncframes that work with all 16 possible library offsets could be crafted, and only the correct ones would be processed.

The real challenge here would be copying from the static buffer to the dynamic buffer without guessing the library location, as both the direct and indirect calls available are quite limited. But if this was possible, the unreliability due to not knowing the library load address could be avoided, at the cost of substantial development effort.

Overall, I suspect it’s possible to substantially improve the reliability of this exploit, though it would likely require several months more development effort.

Reflections on Mitigations

My progress writing this exploit was impeded by several Android platform mitigations, meanwhile others were not as effective as I expected, so I want to take this chance to reflect on what worked and can be improved.

ASLR was by far the most challenging mitigation to bypass, this exploit would have been substantially easier to write without it. Partially overwriting pointers to bypass ASLR is a common exploit strategy, and I was surprised by how much more difficult randomization of low bits of the pointer made it. While it’s also important that pointers have enough overall randomization that they can’t be guessed, my takeaway from this is that randomization at low address bits does a lot more to increase exploit development time than randomization at high bits.

I also performed a lot of testing of Android ASLR, and I did not find any areas that were not randomized enough to prevent exploitation. This has not always been true of Android in the past, and I was pleased to see that Android ASLR appears to be well implemented and tested.

SELinux also made exploitation more difficult, as a lot of ‘classic’ techniques for running shellcode didn’t work, and I was lucky to have access to experts like Seth and Jann who could help me understand the restrictions on the system and how to get around them. That said, that is likely a one-time cost for attackers: once they learn strategies for bypassing SELinux, they will work for multiple exploits.

The mediacodec context usually has seccomp rules that prevent a process from executing syscalls that aren’t needed for its normal functionality. A policy is implemented in AOSP, and I tested that the Samsung S24 enforces this policy on its media decoding processes. However, this was somehow left out of the Pixel 9. A seccomp policy similar to Samsung’s would have prevented the call to pwrite used by the exploit. This wouldn’t have prevented exploitation, as every syscall needed to access the BigWave vulnerability this exploit chains into must be callable by the decoder process for decoding to function correctly, but it likely would have forced the exploit to be written entirely in ROP, versus jumping to shellcode. This would have added at least a few more weeks of exploit development effort.

Likewise, the accessibility of /self/proc/mem was a big shortcut to exploitation. Since it is only used during debugging, I wonder if it is possible to implement some sort of mitigation that makes it inaccessible when a device is not being debugged.

scudo also lacked mitigations that could have made this exploit much more difficult, or even impossible. It was surprisingly easy to modify secondary headers to ‘trick’ the allocator into moving an allocation, meanwhile, in the primary partition, this would have been prevented by checksums. While vulnerabilities that allow a scudo secondary header to be modified are fairly rare, as every scudo secondary allocation is preceded by a guard page, the performance cost of adding checksums to the secondary partition would likely be limited, as in most applications, there are far fewer secondary allocations compared to primary allocations.

It’s also important to note that part of why this vulnerability was exploitable in a 0-click context was because it is an exceptionally high quality bug. It contained both the ability to leak memory and to overwrite it, provided a high level of control over each and the structures that could be corrupted by the overwrite were unusually fortuitous. That said, the memory layout that enabled this isn’t unusual among media decoders. For example, the H264 decoder that I reported this 2022 vulnerability in has a similar layout, with large structs, and could potentially be prone to similar exploitation techniques involving overflows between struct members.

On Mac and iOS devices we tested, the UDC is compiled with -fbounds-safety, a compiler mitigation which injects bounds checks into a compiled binary, including the bounds of arrays within C structs. We believe CVE-2025-54957 is not exploitable on binaries compiled with this mitigation. While there is a performance cost, compiling all media libraries with this flag would greatly reduce the number of exploitable vulnerabilities of this type. Even in situations where this is not practical in production, testing and fuzzing media libraries with -fbounds-safety enabled could make it easier to find and fix this type of exceptionally exploitable vulnerability.

The Next Step

Now that we’ve gained code execution in the mediacodec context, it is time to escalate to kernel! Stay tuned for Part 2: Cracking the Sandbox with a Big Wave.

Welcome to the new Project Zero Blog

16 December 2025 at 11:00

While on Project Zero, we aim for our research to be leading-edge, our blog design was … not so much. We welcome readers to our shiny new blog!

For the occasion, we asked members of Project Zero to dust off old blog posts that never quite saw the light of day. And while we wish we could say the techniques they cover are no longer relevant, there is still a lot of work that needs to be done to protect users against zero days. Our new blog will continue to shine a light on the capabilities of attackers and the many opportunities that exist to protect against them.

From 2016: Windows Exploitation Techniques: Race conditions with path lookups by James Forshaw
From 2017: Thinking Outside The Box by Jann Horn

Thinking Outside The Box [dusted off draft from 2017]

16 December 2025 at 10:00

Preface

Hello from the future!

This is a blogpost I originally drafted in early 2017. I wrote what I intended to be the first half of this post (about escaping from the VM to the VirtualBox host userspace process with CVE-2017-3558), but I never got around to writing the second half (going from the VirtualBox host userspace process to the host kernel), and eventually sorta forgot about this old post draft… But it seems a bit sad to just leave this old draft rotting around forever, so I decided to put it in our blogpost queue now, 8 years after I originally drafted it. I’ve very lightly edited it now (added some links, fixed some grammar), but it’s still almost as I drafted it back then.

When you read this post, keep in mind that unless otherwise noted, it is describing the situation as of 2017. Though a lot of the described code seems to not have changed much since then…

Introduction

VM software typically offers multiple networking modes, including a NAT mode that causes traffic from the VM to appear as normal traffic from the host system. Both QEMU and VirtualBox use forks of Slirp for this. Slirp is described as follows on its homepage:

Slirp emulates a PPP or SLIP connection over a normal terminal. This is an actual PPP or SLIP link, firewalled for people’s protection. It makes a quick way to connect your Palm Pilot over the Internet via your Unix or Linux box!!! You don’t need to mess around with your /etc/inetd.conf or your /etc/ppp/options on your system.

Slirp is a useful basis for VM networking because it can parse raw IP packets (coming from the emulated network adapter) and forward their contents to the network using the host operating system’s normal, unprivileged networking APIs. Therefore, Slirp can run in the host’s userspace and doesn’t need any special kernel support.

Both QEMU and VirtualBox don’t directly use the upstream Slirp code, but instead use patched versions where, for example, the feature for setting up port forwards by talking to a magic IP address is removed. Especially in VirtualBox, the Slirp code has been altered a lot.

This post describes an issue in VirtualBox and how it can be exploited. Some parts are specific to the host operating system; in those cases, this post focuses on the situation on Linux.

The packet heap in VirtualBox

The VirtualBox version of Slirp uses a custom zone allocator for storing packet data, in particular, incoming ethernet frames. Each NAT network interface has its own zone (zone_clust) with nmbclusters=1024+32*64=3072 chunks of size MCLBYTES=2048. The initial freelist of each zone starts at the high-address end of the zone and linearly progresses towards the low-address end.

The heap uses inline metadata; each chunk is prefixed with the following structure:

struct item {
    uint32_t magic; // (always 0xdead0001)
    uma_zone_t zone; // (pointer to the zone; uma_zone_t is struct uma_zone *)
    uint32_t ref_count;
    struct {
        struct type *le_next; // (next element)
        struct type **le_prev; // (address of previous le_next)
    } list; // (entry in the freelist or in used_items, the list of used heap chunks)
};

Chunks are freed through the methods m_freem -> m_free -> mb_free_ext -> uma_zfree -> uma_zfree_arg -> slirp_uma_free. The uma_zfree_arg() function takes pointers to the real zone structure and to the chunk data as arguments and checks some assertions before calling slirp_uma_free() as zone->pfFree():

void uma_zfree_arg(uma_zone_t zone, void *mem, void *flags) {
    struct item *it;
    [...]
    it = &((struct item *)mem)[-1];
    Assert((it->magic == ITEM_MAGIC));
    Assert((zone->magic == ZONE_MAGIC && zone == it->zone));

    zone->pfFree(mem,  0, 0); // (zone->pfFree is slirp_uma_free)
    [...]
}

Unfortunately, Assert() is #define‘d to do nothing in release builds - only “strict” builds check for the condition. The builds that are offered on the VirtualBox download page are normal, non-strict release builds.

Next, slirp_uma_free() is executed:

static void slirp_uma_free(void *item, int size, uint8_t flags) {
    struct item *it;
    uma_zone_t zone;
    [...]
    it = &((struct item *)item)[-1];
    [...]
    zone = it->zone;
    [...]
    LIST_REMOVE(it, list);
    if (zone->pfFini)
    {
        zone->pfFini(zone->pData, item, (int /*sigh*/)zone->size);
    }
    if (zone->pfDtor)
    {
        zone->pfDtor(zone->pData, item, (int /*sigh*/)zone->size, NULL);
    }
    LIST_INSERT_HEAD(&zone->free_items, it, list);
}

slirp_uma_free() grabs the zone pointer from the chunk header. Because Assert() is compiled out, there is no validation to ensure that this zone pointer points to the actual zone - an attacker who can overwrite the chunk header could cause this method to use an arbitrary zone pointer. Then, the member pfFini of the zone is executed, which, for an attacker who can point it->zone to controlled data, means that an arbitrary method call like this can be executed:

{controlled pointer}({controlled pointer}, {pointer to packet data}, {controlled u32});

Because the VirtualBox binary, at least for Linux, is not relocatable and has `memcpy()` in its PLT section, this can be used as a write primitive by using the static address of the PLT entry for memcpy() as function address:

memcpy(dest={controlled pointer}, src={packet data}, n={controlled u32})

This means that, even though the packet heap doesn’t contain much interesting data, a heap memory corruption that affects chunk headers could still be used to compromise the VirtualBox process rather easily.

The Vulnerability

In changeset 23155, the following code was added at the top of ip_input(), the method that handles incoming IP packets coming from the VM, before any validation has been performed on the IP headers. m points to the buffer structure containing the packet data pointer and the actual length of the packet data, ip points to the IP header inside the untrusted packet data. RT_N2H_U16() performs an endianness conversion.

if (m->m_len != RT_N2H_U16(ip->ip_len))
    m->m_len = RT_N2H_U16(ip->ip_len);

This overwrites the trusted buffer length with the contents of the untrusted length field from the IP packet. This is particularly bad because all safety checks assume that m->m_len is correct - these two added lines basically make all following length checks useless.

Later, in changeset 59063, the following comment was added on top of those lines:

/*
* XXX: TODO: this is most likely a leftover spooky action at
* a distance from alias_dns.c host resolver code and can be
* g/c'ed.
*/
if (m->m_len != RT_N2H_U16(ip->ip_len))
    m->m_len = RT_N2H_U16(ip->ip_len);

One straightforward way to abuse this issue is to send a small ICMP_ECHO packet with a large ip_len to the address 10.0.2.3, causing Slirp to send back a larger ICMP_ECHOREPLY with out-of-bounds heap data. However, Slirp validates the correctness of the ICMP checksum, meaning that the attacker has to guess the 16-bit checksum of the out-of-bounds heap data that the attacker is trying to leak. While it is possible to bruteforce this checksum, it is inelegant.
An easier way to leak heap data is to use UDP with the help of a helper machine on the other side of the NAT, e.g. on the internet. UDP has a 16-bit checksum over packet data as well, but unlike ICMP, UDP treats the checksum value 0 as “don’t check the checksum”. Therefore, by sending a UDP packet with checksum 0 and a bogus length in the IP header, it is possible to reliably leak out-of-bounds heap data. Since ip_len can be bigger than the chunk size, this also permits leaking the headers (and contents) of following chunks, disclosing information about the heap state, the heap location and the location of the struct uma_zone.

The next step is to somehow use the bug to corrupt chunk headers. Most of the code only reads from incoming packets; however, when a packet with IP options arrives in udp_input() or tcp_input(), the IP payload (meaning the TCP or UDP packet header and everything following it) is moved over the IP options using ip_stripoptions():

void ip_stripoptions(struct mbuf *m, [...])
{
    register int i;
    struct ip *ip = mtod(m, struct ip *);
    register caddr_t opts;
    int olen;
    NOREF(mopt); /** @todo do we really will need this options buffer? */

    olen = (ip->ip_hl<<2) - sizeof(struct ip);
    opts = (caddr_t)(ip + 1);
    i = m->m_len - (sizeof(struct ip) + olen);
    memcpy(opts, opts  + olen, (unsigned)i);
    m->m_len -= olen;

    ip->ip_hl = sizeof(struct ip) >> 2;
}

This means that, by sending a TCP or UDP packet with IP options and a bogus length that is bigger than a heap chunk, it is possible to move the packet payload of the following heap chunk over the corresponding heap chunk header.

Exploitation: Going up to host userspace

In this part of the post, I’m going to show how it’s possible to break out of the VM and run arbitrary shell commands on the host system using system().

Assuming that a sufficiently big portion of the packet heap is unused, the behavior of the allocator can be simplified by allocating all fragmented heap memory, leaving only a pristine freelist that linearly allocates downwards (as shown at the top of the post). Heap chunks can be allocated by sending IP packets with the “more fragments” bit set; such IP packets have to be stored in memory until either the remaining fragments have been received or the maximum number of pending fragments is reached. An attack that is optimized for maximum reliability would probably go a more complex route and use an approach that still works with an arbitrarily fragmented heap.

The first step is to place the command that should be given to system() in memory and determine at which address it was placed. To do this, assuming that the freelist grows downwards linearly, the attacker can first send an IP fragment containing the shell command (causing the IP fragment to be stored), then send a crafted UDP packet to leak data:

(Note: le_prev and le_next are now pointers on the list of used heap chunks (free_items), not the freelist, and therefore the le_next pointer points upwards.)

While the leaked data does not contain a pointer to the chunk containing the shell command, it contains pointers to the adjacent chunk headers, which can be used to calculate the address of the shell command.

The next big step is to figure out the address of system(). Because there is no PLT entry for system(), there is no fixed address the attacker can jump to to invoke the function. However, using the contents of the global offset table, an attacker can first compute the offsets between libc symbols and use them to identify the libc version, then use a GOT entry and the known offset of system() relative to the address the GOT entry points to in that libc version to compute the address of system(). Unfortunately, there seems to be no nice way to directly read from the GOT using the bug, so this has to be done in a somewhat ugly way.

It is possible to use the bug as a write primitive by calling memcpy() as described in the section “The packet heap in VirtualBox”. In general, functions can be called using the bug as follows:

First, the attacker places a fake struct uma_zone (zone header) in memory and determines the address of the fake struct uma_zone, just like the shell command was placed in memory. Next, the attacker sends a packet containing a fake struct vmox_heap_item (chunk header) and moves it over the real chunk header using an adjacent UDP packet with a bogus length field and with IP options:


The result is a chunk with an attacker-controlled header that points to the fake struct uma_zone:


Next, this chunk can be freed by sending a corresponding second IP fragment, causing the member pfFini of the fake uma_zone to be called with arguments zone->pData (attacker-controlled), item (the data directly behind the fake chunk header) and zone->size (again attacker-controlled).

In the case of memcpy(), one issue here is that the fake IP header must be valid; otherwise, the packet might not be recognized during fragment reassembly. Therefore, only the space that would normally be occupied by the ethernet header (14 bytes long) can be used to store the payload; to write larger payloads, multiple function calls must be made.

At this point, using the write primitive, it is possible to leak the GOT contents by overwriting memory as follows (red parts are modified):

First, a fake heap chunk header is placed at the start of the GOT, which is writable and at a fixed address. Because after the VirtualBox process has started, only library code is executed, the corruption of the start of the GOT is not a problem. The le_next pointer of the fake chunk header points to a legitimate chunk that is currently in a pristine area of the original freelist. Now, the attacker can overwrite the freelist head pointer free_items.lh_first in the zone header, causing the fake chunk in the GOT to be returned by a legitimate future allocation.

At this point, the attacker can send another UDP packet with a bogus length field in the IP header. This UDP packet will be placed at the start of the GOT, and out-of-bounds data behind the packet will leak - in other words, the remaining normal GOT entries.

At this point, the attacker can determine the location of system() and call system() with a fully controlled argument.

Conclusion [from the future]

As I noted in the introduction, none of the relevant code seems to have changed much since I found this bug in 2017 - I think if you found a similar bug in the VirtualBox networking code today, it would likely still be exploitable in a similar way.

VirtualBox uses a separate memory region for packet memory allocations - that’s probably intended as a performance optimization. This implementation choice should also make it harder to exploit packet memory UAF bugs as a side effect, since no packets contain pointers, kind of like PartitionAlloc or kalloc_type. However, it might still be possible to exploit a packet memory UAF as TOCTOU by making use of an already-validated length value or such.
This could have also made it harder to exploit packet memory linear OOB write bugs - but the choice of using inline metadata, and not protecting against corruption of this metadata at all, makes OOB write bugs in this allocator region highly exploitable.

Windows Exploitation Techniques: Winning Race Conditions with Path Lookups

16 December 2025 at 09:00

This post was originally written in 2016 for the Project Zero blog. However, in the end it was published separately in the journal PoC||GTFO issue #13 as well as in the second volume of the printed version. In honor of our new blog we’re republishing it on this blog and included an updated analysis to see if it still works on a modern Windows 11 system.

During my Windows research I tend to find quite a few race condition vulnerabilities. A fairly typical exploitable form look something like this:

  1. Do some security check
  2. Access some resource
  3. Perform secure action

If you can change the state of the system between steps 1 and 3 you might be able to bypass a security check or cause other security issues. The big problem is the race window is generally extremely short. In some cases it might be exploitable by running an exploit enough times and hope you hit it at least once. In other cases you might have one shot at success, if you can’t guarantee you’ll win the race every time it might be effectively unexploitable (however, that’s not to say you shouldn’t report it to the vendor anyway).

Over the years I’ve come up with various techniques to expand the race window, including file Opportunistic Locks and trapping virtual memory access. However, those techniques are not always appropriate, so I wanted to find a way of increasing the time window to win the race in cases where the code accesses a resource we control. Specifically, we’re going to attack the lookup process for a named resource. The following is an overview of my thought process to come up with a working solution.

Investigating Object Manager Lookup Performance

Hidden under the hood of Windows NT is the Object Manager Namespace (OMNS). You wouldn’t typically interact with it directly, the Win32 API for the most part hides it away. The NT kernel defines a set of objects, such as Files, Events, Registry Keys, which can all have a name associated with the object. The OMNS provides the means to lookup these named objects. It acts like a file system, so for example you can specify a path to an NT system call such as \BaseNamedObjects\MyEvent and an event object can be looked up and opened.

There are two special object types which are for use in the OMNS, Object Directories and Symbolic Links. Object Directories act as named containers for other objects, whereas Symbolic Links allow a name to be redirected to another OMNS path. Symbolic Links are used quite a lot, for example the Windows drive letters are really a symbolic link to the real volume device object. When we call an NT system call the kernel must lookup the entire path, following any symbolic links until it reaches the named object, or fails to find a match.

To create a useful exploitation technique, we want to make the process of looking up a resource we control as slow as possible. For example, if we could make it take 1 or 2 seconds, then we’ve got a massive window of opportunity to win the race condition. Therefore, I want to find a way of manipulating the Object Manager lookup process in such a way that we achieve this goal.

A note about the testing setup: all tests will open a named event object, which is simulating step 2 in the previous list of exploitable operations. The system used is a new Surface Pro 11th Edition CoPilot+ PC with a Snapdragon X Elite running at 3.40GHz. This system has Windows 11 24H2 installed, however from what I can tell, no AI feature was harmed in the making of these results.

First, let’s just measure the time it takes to do a normal lookup. To try and minimize overhead, we’ll write the test in C++ as follows. It creates a named event, then opens the event with a specified number of iterations. Finally it’ll return the time in μs that a single iteration took based on the measurement from the QueryPerformanceCounter API. I’ve not included the support classes in the listing, that’ll be available in the project I’ll link to later.

static double RunTest(const wstring name, int iterations, 
        wstring create_name = L"", HANDLE root = nullptr) {
    if (create_name.empty()) {
        create_name = name;
    }
    ScopedHandle event_handle = CreateEvent(create_name, root);
    ObjectAttributes obja(name);
    vector<ScopedHandle> handles;
    Timer timer;
    for (int i = 0; i < iterations; ++i) {
        HANDLE open_handle;
        Check(NtOpenEvent(&open_handle, MAXIMUM_ALLOWED, &obja));
        handles.emplace_back(open_handle);
    }
    return timer.GetTime(iterations);
}

For the test I’ll pick a simple unique name, such as \BaseNamedObjects\MyEvent. With an iteration count of 1000 the results on my test system are probably what we’d expect, the lookup process for a simple named event is approximately 2μs. That includes the system call transition, lookup process and the access check on the event object.

While, in theory, you could win a race with this amount of time, it seems pretty unlikely, even on a multicore processor. So let’s think about a way of improving the lookup time (and when I say “improve” I mean making the lookup time slower). We can immediately consider two similar approaches:

  1. Make a path which contains one very long name. The lookup process would have to compare the entire name using a string comparison operation to verify it’s accessing the correct object. This should take linear time relative to the length of the string, even if the comparison operation is heavily optimized.
  2. Make multiple small named directories and recurse. E.g. \A\A\A\A\…\EventName. The assumption here is that each lookup takes a fixed amount of time to complete. The operation should again be linear time relative to the depth of recursion of the directories.

At this point we’ve not had to look at any actual kernel code, and we’ll not start quite yet, so instead more empirical testing seems the way to go. Let’s start with the first approach, making a long string and performing a lookup on it.

How long can the path string be? An object manager path is limited to the maximum string size afforded by the UNICODE_STRING structure.

struct UNICODE_STRING {
  USHORT Length;
  USHORT MaximumLength;
  PWSTR  Buffer;
}

We can see that the Length member is a USHORT which is an unsigned 16 bit integer, this limits the maximum length to 216 - 1. This, however, is a byte count so in fact this limits us to 215 - 1 or 32767 wide characters. We’ll need to be able to make the object in a writable directory such as \BaseNamedObject which reduces the length slightly, but not enough to make a significant impact. Therefore we’ll open the event object through names between 1 character and 32000 characters in length using the following code:

std::wstring path;
while (path.size() <= 32000) {
    auto result = RunTest(L"\\BaseNamedObjects\\A" + path, nullptr, 1000);
    printf("%zu,%f\n", path.size(), result);
    path += std::wstring(500, 'A');
}

The results are shown below:

While it’s a little noisy it seems like the assumption of a linear lookup time is correct. The longer the string, the longer it takes to look it up. For a 32000 character long string this seems to top out at approximately 35μs. Still not enough in my opinion for a useful primitive, but it’s certainly a start.

Now let’s look at the recursive directory approach. In this case, the upper bound is around 16000 directories. This is because each path component must contain at least two characters, a backslash and a single character name (e.g. \A\A\A…). Therefore our maximum path limit is halved. Of course we’d make the assumption that the time to go through the lookup process is going to be greater than the time it takes to compare 4 unicode characters, but let’s test to make sure.

ScopedHandle base_dir = OpenDirectory(L"\\BaseNamedObjects");
HANDLE last_dir = base_dir.get();
std::vector<ScopedHandle> dirs;
for (int i = 0; i < 16000; i++) {
    dirs.emplace_back(CreateDirectory(L"A", last_dir));
    last_dir = dirs.back().get();
    if ((i % 500) == 0)
    {
        auto result = RunTest(GetName(last_dir) + L"\\X", iterations);
        printf("%d,%f\n", i + 1, result);
    }
}

The results are shown below:


The results are what we might expect, it seems linear, at least until around 13000 recursive directories where there is a disjoint transition. I ran the test multiple times on the same machine and always got the same issue, however running it on an x64 machine didn’t show the same artifact so I don’t think it’s a problem with the code.

Still, it’s unequivocal that the time to lookup an object is linear based on the number of recursive directories. For a 16000 recursive depth the average lookup time is around 1300μs or approximately 40 times larger than the long path name lookup result. Now of course this comes with downsides. For a start you need to create 16000 or so directory objects in the kernel, each directory takes up some amount of kernel pool memory. On a 64 bit platform this is unlikely to be a problem.

We also have the setup time to consider, too long and we might still miss the race condition. We can speed up the process of creating the directories by using the ability of Windows system calls to create an object relative to an existing directory. This allows us to avoid parsing the full path for every new directory, which is after all what we’re trying to make slow.

Also the process must maintain a handle to each of those directories otherwise they’d be deleted as a normal user can’t make kernel objects permanent. Fortunately our handle limit for a single process is of the order of 16 million so we’re a couple of orders of magnitude below the limit of that.

Now is 1300μs going to be enough for us? Maybe, it’s certainly orders of magnitude greater than 2μs for a normal lookup. But can we do better? We’ve run out of path space now, we’ve filled the absolute maximum allowed string length with recursive directory names. What we need is a method of multiplying that effect without requiring a longer path.

Here we can use object manager symbolic links. By placing the symbolic link as the last component of the long path we can force the kernel to reparse, and start the lookup all over again. On the final lookup we’ll just point the symbolic link to the target.

Through testing we can only redirect using the symbolic link 64 times before receiving an error, why can’t we do this indefinitely? Well for a fairly obvious reason, each time a symbolic link is encountered the kernel restarts the parsing processes, if you pointed a symbolic link at itself you’d end up in an infinite loop. The 64 reparse limit prevents that from becoming a problem. The following code will do this test for us:

ScopedHandle base_dir = OpenDirectory(L"\\BaseNamedObjects");
HANDLE last_dir = base_dir.get();
std::vector<ScopedHandle> dirs;
for (int i = 0; i < 16000; i++) {
    dirs.emplace_back(CreateDirectory(L"A", last_dir));
    last_dir = dirs.back().get();
}
std::vector<ScopedHandle> links;
std::wstring last_dir_name = GetName(last_dir);
for (int i = 0; i < 63; ++i) {
    links.emplace_back(CreateLink(IntToString(i), last_dir, 
                       last_dir_name + L"\\" + IntToString(i + 1)));
}
printf("%f\n", RunTest(links.front().name(), 10, L"63", last_dir));

We only do 10 test iterations to minimize the time we need to run. The results are as we expected, time taken to look up our event is proportional to both the number of symbolic links and the number of recursive directories. For 64 symbolic links and 16000 directories it takes approximately 4.5ms to lookup the event (note I’ve had to change the scale of the result now to milliseconds). That should be enough, right? Maybe, but I’m greedy, I want more. How can we make the lookup time even worse?

At this point, it’s time to break out the disassembler and see how the lookup process works under the kernel. First off, let’s see what an object directory structure looks like. We can dump it from a kernel debugging session using WinDBG with the command dt nt!_OBJECT_DIRECTORY. Converted back to a C style structure it looks something like the following:

struct OBJECT_DIRECTORY {
     POBJECT_DIRECTORY_ENTRY HashBuckets[37];
     EX_PUSH_LOCK Lock;
     PDEVICE_MAP DeviceMap;
     ULONG SessionId;
     PVOID NamespaceEntry;
     ULONG Flags;
     PPOBJECT_DIRECTORY ShadowDirectory.
}

Based on the presence of the HashBucket field, it’s safe to assume that the kernel is using a hash table to store directory entries. This makes some sense, if the kernel just maintained a list of directory entries it’d be pretty poor for performance, however with a hash table the lookup time is reduced as long as the hashing algorithm does a good job of reducing collisions. This is only the case though if the algorithm isn’t being actively exploited. As we’re trying to increase the cost of lookups we can intentionally add entries with collisions to make the lookup process take the worst case time, which is linear relative to the number of entries in a directory. This again provides us with another scaling factory, and in this case the number of entries is only going to be limited by available memory as we’re never going to need to put the name into the path.

So what’s the algorithm for the hash? The main function of interest is ObpLookupObjectName which is referenced by functions such as ObReferenceObjectByName. The directory entry logic is buried somewhere in this large function, however fortunately there’s a helper function ObpLookupDirectoryEntry which has the same logic (it isn’t actually called by ObpLookupObjectName but it doesn’t matter) which is smaller and easier to reverse engineer, the following is a simplified version of that.

POBJECT_DIRECTORY ObpLookupDirectoryEntry(POBJECT_DIRECTORY Directory,
                                          PUNICODE_STRING Name,
                                          ULONG AttributeFlags) {
  BOOLEAN CaseInSensitive = (AttributeFlags & OBJ_CASE_INSENSITIVE) != 0;
  SIZE_T CharCount = Name->Length / sizeof(WCHAR);
  WCHAR* Buffer = Name->Buffer;
  ULONG Hash = 0;  
  while (CharCount) {
    Hash = (Hash / 2) + 3 * Hash;
    Hash += RtlUpcaseUnicodeChar(*Buffer);
    Buffer++;
    CharCount--;
  }

  OBJECT_DIRECTORY_ENTRY* Entry = Directory->HashBuckets[Hash % 37];
  while(Entry) {
    if (Entry->HashValue == Hash) {
      if (RtlEqualUnicodeString(Name, 
            ObpGetObjectName(Entry->Object), CaseInSensitive)) {
        ObReferenceObject(Entry->Object);
        return Entry->Object;
      }
    }
    Entry = Entry->ChainLink;
  }
  
  return NULL;
}

So the hashing algorithm is pretty simple, it repeatedly mixes the bits of the current hash value then adds the uppercase unicode character to the hash. We could work out a clever way of getting hash collisions from this but actually it’s pretty simple, the object manager allows us to specify names containing NUL characters, therefore if we take our target name, say ‘A’ and prefix it with increasing length strings containing only NUL we get both hash and bucket collisions. Due to the path character limit we can only create 32000 or so colliding entries, but as we’ll see that’s not a problem. The following code will test this behavior:

int collision_count = 32000;
ScopedHandle base_dir = CreateDirectory(L"\\BaseNamedObjects\\A");
ScopedHandle test_dir = CreateDirectory(L"A", base_dir.get());
vector<ScopedHandle> dirs;
for (int i = 0; i < collision_count - 1; i++) {
    wstring name = MakeCollisionName(collision_count - i);
    dirs.emplace_back(CreateDirectory(name, base_dir.get()));
    if ((i % 500) == 0) {
        Timer timer;
        for (int j = 0; j < iterations; ++j) {
            OpenDirectory(L"A", base_dir.get());
        }
        printf("%d,%f\n", i, timer.GetTime(iterations));
    }
}

Let’s look at the results of doing this for a single directory:

The chart shows a more or less linear graph. For a given collision count it’s nowhere near as good as the recursive directory approach, around 100μs versus 1300μs but it is a multiplicative factor in the lookup time which we can abuse.

We can apply this additional factor to all our 16000 recursive directories, add in symbolic links and we’ll probably get an insane lookup time. However there’s a problem, insertion time. Every time we add a new entry to a directory the kernel must do a lookup to check that the entry doesn’t already exist. This means that for every new directory entry we add we must do (n-1)2 checks in the hash table just to find that we don’t have the entry before we insert it. This means that the time to add a new entry is approximately proportional to the square of the number of entries, sure it’s not a cubic or exponential increase, but that’s hardly a consolation. On the test machine it takes approximately 2.5s (yes seconds) to create a single collision directory with 32000 entries. If we wanted to do that for all 16000 recursive directory entries it would take around 12 hours!

Okay I think we’re going a bit over the top here, by fiddling with the values we can get something which doesn’t take too long to set up and gives us a long lookup time. But I’m still greedy. I want to see how far I can push the lookup time, is there any way we can get the best of all worlds?

The final piece of the puzzle is to bring in Shadow Directories, which allow the Object Manager a fallback path if it can’t find an entry in a directory. You can use almost any other Object Manager directory as a shadow, which will allow us to control the lookup behavior. A Shadow Directory has a crucial difference from symbolic links, they don’t cause a reparse to occur in the lookup process. This means they’re not restricted to the 64 reparse limit. This doesn’t result in an infinite loop as each lookup consumes a path component, eventually there’ll be no more path to lookup. If we put together two directories in the following arrangement we can pass a similar path to our recursive directory lookup, without actually creating all the directories.

Shadow Directories (1).png
So how does this actually work? If we open a path of the form \A\A\A\A\A… the kernel will first lookup the initial A directory. This is the directory on the left of the diagram. It will then try to open the next A directory, which is on the right which again it will find. Next the kernel again looks up A, but in this case it doesn’t exist so as the directory has a shadow link to its parent it looks there instead, finds the same A directory and repeats the process. This will continue until we run out of path elements to look up.

So let’s determine the performance of this approach. We’d perhaps expect it to be less performant relative to actually creating all those directories but hopefully it won’t be too far behind. We can use the following code to do the test:

wstring dir_name = L"\\BaseNamedObjects\\A";
ScopedHandle shadow_dir = CreateDirectory(dir_name);
ScopedHandle target_dir = CreateDirectory(L"A", shadow_dir.get(), shadow_dir.get());
for (int i = 0; i < 16000; i += 500) {
    wstring open_name = dir_name;
    for (int j = 0; j < i; j++) {
        open_name += L"\\A";
    }
    open_name += L"\\X";
    printf("%d,%f\n", i, RunTest(open_name, iterations, L"X", 
                                 shadow_dir.get()));
}

And the results are as follows, the chart includes the original test for the normal recursive lookup as well for comparison.

Looks good, interestingly based on this test the lookup time is longer for shadow directories than for recursive directories. We still get a weird disjoint region, but in this case it starts earlier, perhaps it’s a cache effect based on the length of the string or something like that?

So the final result is that instead of creating 16000 directories with 16000 collisions we can do it with just 2 directories which is far more manageable and only takes around 5 seconds on my workstation. So to sign off let’s combine everything together with the following code which has the following parameters:

  • 16000 path components using 2 object directories in a shadow configuration
  • 16000 collisions per directory
  • 64 symbolic link reparses
wstring dir_name = L"\\BaseNamedObjects\\A";
ScopedHandle shadow_dir = CreateDirectory(dir_name);
ScopedHandle target_dir = CreateDirectory(L"A", shadow_dir.get(), shadow_dir.get());
vector<ScopedHandle> dirs;
CreateCollidingEntries(shadow_dir, 16000, dirs);
CreateCollidingEntries(target_dir, 16000, dirs);

wstring last_dir_name = dir_name;
for (int i = 0; i < 16000; i++) {
    last_dir_name += L"\\A";
}
vector<ScopedHandle> links;
for (int i = 0; i < 63; ++i) {
    links.emplace_back(CreateLink(IntToString(i), shadow_dir.get(),
                       last_dir_name + L"\\" + IntToString(i + 1)));
}
printf("%f\n", RunTest(last_dir_name + L"\\0", 1, 
                       IntToString(symlink_count), shadow_dir.get()));

And the resulting time for a single lookup on the test system is *drum roll please* 3 minutes. I think we might just be able to win the race condition with that.

Conclusion

So after all that effort we can make the kernel take around 3 minutes to look up a single controlled resource path. That’s pretty impressive. We have many options to get the kernel to start the lookup process. Both file system and registry end up interacting with the object manager namespace, so for example you could plant an NTFS mount point with the initiating path to cause any process which opens that file to lock up for 3 minutes.

After 8 years it’s probably not surprising Microsoft haven’t tried to do anything about this exploit technique. It’s a typical tale of unexpected behavior when facing pathological input, it’s probably not worth the impact on the object manager code to improve performance meaningfully.

Just a final point to note on performance. The timings presented here are going to vary wildly based on the performance of the machine so they should only be taken as guidelines. If you look back at the original publication of this post in PoC||GTFO you’ll find the timings are substantially longer. For example, the final test took 19 minutes on the Xeon workstation I used for testing rather than 3 minutes. I don’t know if this is an indication that the ARM64 CPU used in the Surface Pro was substantially faster than the Xeon, or if it was just the amount of cruft which runs on a typical workstation versus a freshly installed Windows 11 Microsoft PC. Regardless, if you can’t exploit the race condition in 3 or 19 minutes then your bug might truly be unexploitable.

You can find the full test code on Github.

Defeating KASLR by Doing Nothing at All

3 November 2025 at 09:00

Introduction

I’ve recently been researching Pixel kernel exploitation and as part of this research I found myself with an excellent arbitrary write primitive…but without a KASLR leak. As necessity is the mother of all invention, on a hunch, I started researching the Linux kernel linear mapping.

The Linux Linear Mapping

The linear mapping is a region in the kernel virtual address space that is a direct 1:1 unstructured representation of physical memory. Working with Jann, I learned how the kernel decided where to place this region in the virtual address space. To make it possible to analyze kernel internals on a rooted phone, Jann wrote a tool to call tracing BPF’s privileged BPF_FUNC_probe_read_kernel helper, which by design permits arbitrary kernel reads. The code for this is available here. The linear mapping virtual address for a given physical address is calculated by the following macro:

#define phys_to_virt(x)    ((unsigned long)((x) - PHYS_OFFSET) | PAGE_OFFSET)

On Arm64 PAGE_OFFSET is simply:

#define VA_BITS			(CONFIG_ARM64_VA_BITS)
#define _PAGE_OFFSET(va)	(-(UL(1) << (va)))
#define PAGE_OFFSET		(_PAGE_OFFSET(VA_BITS))

As CONFIG_ARM64_VA_BITS is 39 on Android, it’s easy to calculate PAGE_OFFSET = 0xffffff8000000000.
PHYS_OFFSET is calculated by:

extern s64			memstart_addr;
/* PHYS_OFFSET - the physical address of the start of memory. */
#define PHYS_OFFSET		({ VM_BUG_ON(memstart_addr & 1); memstart_addr; })

memstart_addr is an exported variable that can be looked up in /proc/kallsyms. Using Jann’s bpf_arb_read program, it’s easy to see what this value is:

tokay:/ # grep memstart /proc/kallsyms                                         
ffffffee6d3b2b20 D memstart_addr
ffffffee6d3f2f80 r __ksymtab_memstart_addr
ffffffee6dd86cc8 D memstart_offset_seed
tokay:/ # cd /data/local/tmp
tokay:/data/local/tmp # ./bpf_arb_read ffffffee6d3b2b20 8                                              <
ffffffee6d3b2b20  00 00 00 80 00 00 00 00                          |........|
tokay:/data/local/tmp #

This value (0x80000000) doesn’t look particularly random. In fact, memstart_addr was theoretically randomized on every boot, but in practice this hasn’t happened for a while on arm64. In fact as of commit 1db780bafa4c it’s no longer even theoretical - virtual address randomization of the linear map is no longer a supported feature in arm64 Linux kernel.

The systemic issue is that memory can (theoretically) be hot plugged in Linux and on Android because of CONFIG_MEMORY_HOTPLUG=y. This feature is enabled on Android due to its usage in VM memory sharing. When new memory is plugged into an already running system, it must be possible for the Linux kernel to address this new memory, including adding it onto the linear map. Android on arm64 uses a page size of 4 KiB and 3-level paging, which means virtual addresses in the kernel are limited to 39 bits, unlike typical X86-64 desktops which use 4-level paging and have 48 bits of virtual address space (for kernel and userspace combined); the linear map has to fit within this space further shrinking the area available for it. Given that the maximum amount of theoretical physical memory is far larger than the entire possible linear map region range, the kernel places the linear map at the lowest possible virtual address so it can theoretically be prepared to handle exorbitant (up to 256GB) quantities of hypothetical future hot-plugged physical memory. While it is not technically necessary to choose between memory hot-plugging support and linear map randomization, the Linux kernel developers decided not to invest the engineering effort to implement memory hot-plugging in a way that preserves linear map randomization.

So we now know that PHYS_OFFSET will always be 0x80000000, and thusly, the phys_to_virt calculation becomes purely static - given any physical address, you can calculate the corresponding linear map virtual address by the following formula:

#define phys_to_virt(x)    ((unsigned long)((x) - 0x80000000) | 0xffffff8000000000)

Kernel physical address non-randomization

Compounding this issue, it also happens that on Pixel phones, the bootloader decompresses the kernel itself at the same physical address every boot: 0x80010000.

tokay:/ # grep Kernel /proc/iomem
  80010000-81baffff : Kernel code
  81fc0000-8225ffff : Kernel data

Theoretically, the bootloader can place the kernel at a random physical address every boot, and many (but not all) other phones, such as the Samsung S25, do this. Unfortunately, Pixel phones are an example of a device that simply decompresses the kernel at a static physical address.

Calculating static kernel virtual addresses

This means that we can statically calculate a kernel virtual address for any kernel .data entry. Here’s an example of me computing that linear map address for the modprobe_path string in kernel .data on a Pixel 9:

tokay:/ # grep modprobe_path /proc/kallsyms                                    
ffffffee6ddf2398 D modprobe_path
tokay:/ # grep stext /proc/kallsyms                                            
ffffffee6be10000 T _stext
//Offset from kernel base will be 0xffffffee6ddf2398 - 0xffffffee6be10000 = 0x1fe2398
//Physical address will be 0x80010000 + 0x1fe2398 = 0x81ff2398
//phys_to_virt(0x81ff2398) = 0xffffff8001ff2398

tokay:/ # /data/local/tmp/bpf_arb_read ffffff8001ff2398 64                     
ffffff8001ff2398  00 73 79 73 74 65 6d 2f 62 69 6e 2f 6d 6f 64 70  |.system/bin/modp|
ffffff8001ff23a8  72 6f 62 65 00 00 00 00 00 00 00 00 00 00 00 00  |robe............|
[ zeroes ]
tokay:/ # reboot                                                                                                         sethjenkins@sethjenkins91:~$ adb shell
tokay:/ $ su
tokay:/ # /data/local/tmp/bpf_arb_read ffffff8001ff2398 64
ffffff8001ff2398  00 73 79 73 74 65 6d 2f 62 69 6e 2f 6d 6f 64 70  |.system/bin/modp|
ffffff8001ff23a8  72 6f 62 65 00 00 00 00 00 00 00 00 00 00 00 00  |robe............|
[ zeroes ]
tokay:/ #

So modprobe_path will always be accessible at the kernel virtual address 0xffffff8001ff2398, in addition to its normal mapping, even with KASLR enabled. In practice, on Pixel devices you can derive a valid virtual address for a kernel symbol by calculating its offset and simply adding a hardcoded static kernel base address of 0xffffff8000010000. In short, instead of breaking the KASLR slide, it is possible to just use 0xffffff8000010000 as a kernel base instead.

The linear mapping memory is even mapped rw for any kernel .data regions. The only consolation that makes using this address slightly less effective than the traditional method of leaking the KASLR slide is that .text regions are not mapped executable - so an attacker cannot use this base for e.g. ROP gadgets or more generally PC control. But oftentimes, a Linux kernel attacker’s goal isn’t arbitrary code execution in kernel context anyway - arbitrary read-write is the more frequently desired primitive.

Impact on devices with kernel physical address randomization

Even on devices where the kernel location is randomized in the physical address space, linear mapping non-randomization still softens the kernel considerably to attempts at exploitation. This is particularly because techniques that involve spraying memory (either kernel structures or even userland mmap’s!) can land at predictable physical addresses - and those physical addresses are easily referenceable in kernel virtual address space through the linear map. That potentially gives an attacker a methodology for placing kernel data structures or even simply attacker-controlled userland memory at a known kernel virtual address. I created a simple program that allocated (via mmap and page fault) a substantial quantity (~5 GB) of physical memory on a Samsung S23, then used /proc/pagemap to create a list of which physical page frame numbers (pfns) were allocated. I ran this program 100 times (rebooting in between each time), then counted how often each pfn appeared across the 100 execution cycles. The set of pfns and their counts for how often they appeared were then converted into an image where each pfn is represented by a single pixel. The brighter the green of a pixel, the more often that page was attacker controlled, with a white pixel representing a pfn that was allocated every time. A black pixel represents a pfn that was never allocated - often because those pfn numbers are not mapped to physical memory or because they are used every time in a deterministic way. A big thank you to Jann Horn for developing the tool to create this image from the data that I collected.

This data exemplifies the non-homogenous reliability of pfn allocation to userland mappings, albeit on a device that was only just rebooted. There are ranges of pfns that are allocated quite reliably, and other ranges that are quite unreliable (but still occasionally used). For example, here’s a range of pfns surrounding one of the pages that was allocated 100 times in a row. I suspect this sample is representative of the practical reliability of this technique for placing desired data at a known kernel address for at least a newly rebooted device.

While reliability may suffer on a device that hasn’t rebooted in some time, it remains high enough to be inviting to real-world attackers. Being able to place arbitrarily readable and writable data at a known kernel virtual address is a powerful exploitation primitive as an attacker can much more easily forge kernel data structures or objects and, for example, emplace pointers to those objects in heap sprays attacking UAF issues.

The Prognosis

I reported these two separate issues, lack of linear map randomization, and kernel lands at static physical address in Pixel, to the Linux kernel team and Google Pixel respectively. However both of these issues are considered intended behavior. While Pixel may introduce randomized physical kernel load addresses at some later point as a feature, there are no immediate plans to resolve the lack of randomization of the Linux kernel’s linear map on arm64.

Conclusion

Three years ago, I wrote on the state of x86 KASLR and noted how “it is probably time to accept that KASLR is no longer an effective mitigation against local attackers and to develop defensive code and mitigations that accept its limitations.” While it remains true that KASLR should not be trusted to prevent exploitation, particularly in local contexts, it is regrettable that the attitude around Linux KASLR is so fatalistic that putting in the engineering effort to preserve its remaining integrity is not considered to be worthwhile. The joint effect of these two issues dramatically simplified what might otherwise have been a more complicated and likely less reliable exploit. While side-channel attacks do impact the long-term viability of KASLR on all architectures, it is notable that Project Zero and the Google Threat Intelligence Group have yet to see a hardware side-channel attack for bypassing KASLR on Android in the wild. Additionally, KASLR still plays an important role in mitigating any remote kernel exploitation attempts. It is valuable from a security in-depth perspective to recognize the impact KASLR has on exploit complexity and reliability in real-world scenarios. In the future, we hope to see changes to the Linux kernel linear mapping and memory hot-plugging implementation to make this a less inviting target for attackers. Randomizing the location of the linear map in the virtual address space, increasing the entropy in physical page allocation, and randomizing the location of the kernel in the physical address space are all concrete steps that can be taken that would improve the overall security posture of Android, the Linux kernel, and Pixel.

Pointer leaks through pointer-keyed data structures

26 September 2025 at 09:00

Introduction

Some time in 2024, during a Project Zero team discussion, we were talking about how remote ASLR leaks would be helpful or necessary for exploiting some types of memory corruption bugs, specifically in the context of Apple devices. Coming from the angle of “where would be a good first place to look for a remote ASLR leak”, this led to the discovery of a trick that could potentially be used to leak a pointer remotely, without any memory safety violations or timing attacks, in scenarios where an attack surface can be reached that deserializes attacker-provided data, re-serializes the resulting objects, and sends the re-serialized data back to the attacker.

The team brainstormed, and we couldn’t immediately come up with any specific attack surface on macOS/iOS that would behave this way, though we did not perform extensive analysis to test whether such attack surface exists. Instead of targeting a real attack surface, I tested the technique described here on macOS with an artificial test case that uses NSKeyedArchiver serialization as the target. Because of the lack of demonstrated real-world impact, I reported the issue to Apple without filing it in our bugtracker. It was fixed in the 31 Mar 2025 security releases. Links to Apple code in this post go to an outdated version of the code that hasn’t been updated in years, and descriptions of how the code works refer to the old unfixed version.

I decided to write about the technique since it is kind of intriguing and novel, and some of the ideas in it might generalize to other contexts. It is closely related to a partial pointer leak and another pointer ordering leak that I discovered in the past, and shows how pointer-keyed data structures can be used to leak addresses under ideal circumstances.

Background - the tech tree

hashDoS

To me, the story of this issue begins in 2011, when the hashDoS attack was presented at 28C3 (slides, recording). In essence, hashDoS is a denial-of-service attack on services (in particular web servers) that populate hash tables with lots of attacker-controlled keys (like POST parameters). It is based on the observation that many hash table implementations have O(1) complexity per insert/lookup operation in the average case, but O(n) complexity for the same operations in the worst case (where the hashes of all keys land in the same hash bucket, and the hash table essentially turns into something like a linked list or an unsorted array depending on how it is implemented). In particular if the hash function used for keys is known to the attacker, then by constructing a request full of parameters whose keys all map to the same hash bucket, an attacker can cause the server to spend O(n²) time processing such a request; this turned out to be enough to keep a web server’s CPU saturated using ridiculously small amounts of network traffic.

There is also much older prior work on the idea of deliberately creating hash table collisions to leak addresses, as pointed out in a 29C3 talk about the same topic. Solar Designer wrote in Phrack issue 53 back in 1998:

—-[ Data Structures and Algorithm Choice

When choosing a sorting or data lookup algorithm to be used for a normal application, people are usually optimizing the typical case. However, for IDS [intrusion detection systems] the worst case scenario should always be considered: an attacker can supply our IDS with whatever data she likes. If the IDS is fail-open, she would then be able to bypass it, and if it’s fail-close, she could cause a DoS for the entire protected system.

Let me illustrate this by an example. In scanlogd, I’m using a hash table to lookup source addresses. This works very well for the typical case as long as the hash table is large enough (since the number of addresses we keep is limited anyway). The average lookup time is better than that of a binary search. However, an attacker can choose her addresses (most likely spoofed) to cause hash collisions, effectively replacing the hash table lookup with a linear search. Depending on how many entries we keep, this might make scanlogd not be able to pick new packets up in time. This will also always take more CPU time from other processes in a host-based IDS like scanlogd.

[…]

It is probably worth mentioning that similar issues also apply to things like operating system kernels. For example, hash tables are widely used there for looking up active connections, listening ports, etc. There’re usually other limits which make these not really dangerous though, but more research might be needed.

hashDoS as a timing attack

From a slightly different perspective, the central observation of hashDoS is: If an attacker can insert a large number of chosen keys into a hash table (or hash set) and knows which hash buckets these keys hash to, then the attacker can (depending on hash table implementation details) essentially slow down future accesses to a chosen hash bucket.

This becomes interesting if the attacker can cause the insertion of other keys whose hashes are secret into the same hash table. In practice, this can for example happen with hash tables which support mixing multiple key types together, like JavaScript’s Map. Back in 2016, in the Firefox implementation, int32 numbers were hashed with a fixed hash function ScrambleHashCode(number), while strings were atomized/interned and then hashed based on their virtual address. That made it possible to first fill an attacker-chosen hash table bucket with lots of elements, then insert a string, observe whether its insertion is fast or slow, and determine from that whether the string’s hash matches the attacker-chosen hash bucket.

With some tricks relying on a pattern in the addresses of interned single-character strings in Firefox, that made it possible to leak the lower 32 bits of a heap address through Map insertions and timing measurements. For more details, see the original writeup and bug report. Of course, nowadays that kind of timing-based in-process partial pointer leak from JavaScript would be considered less interesting, since it is generally assumed that JavaScript can read all memory in the same process anyway

A takeaway from this is: When pointers are used as the basis for object hash codes, this can leak pointers through side channels in keyed data structures.

Linux: object ordering leak through in-order listing of a pointer-keyed tree

As I noted in a blog post a few years ago, on Linux, it is possible for unprivileged userspace to discover in what order struct file instances are stored in kernel virtual memory by reading from /proc/self/fdinfo/<epoll fd> - this file lists all files that are watched by an epoll instance by iterating through a red-black tree that is (essentially) sorted by the virtual address of the referenced struct file, so the data given to userspace is sorted in the same way.

(As I noted in that post, this could be particularly interesting for breaking probabilistic memory safety mitigations that rely on pointer tagging. If the highest bits of pointers are secret tag bits, and an attacker can determine the order of the addresses (including tag bits) of objects, the attacker can infer whether an object’s tag changed after reallocation.)

A takeaway from this is: Keyed data structures don’t just leak information about object hash codes through timing; iterating over a keyed data structure can also generate data whose ordering reveals information about object hash codes.

Serialization attacks

There are various approaches to serializing an object graph. On one side of the spectrum is schema-based serialization, where ideally:

  • serializable types with their members are declared separately from other types
  • fields explicitly declare which other types they can point to (there are no generic pointers that can point to anything)
  • deserialization starts from a specific starting type

On the other side of the serialization spectrum are things like classic Java serialization (without serialization filters), where essentially any class marked as Serializable can be deserialized, serialized fields can often flexibly point to lots of different types, and therefore serialized data can also have a lot of control over the shape of the resulting object graph. There is a lot of public research on the topic of “serialization gadget chains” in Java, where objects can be combined such that deserializing them results in things like remote code execution. This type of serialization is generally considered to be unsafe for use across security boundaries, though Android exposes it across local security boundaries.

Somewhere in the middle of this spectrum is serialization that is fundamentally built like unsafe deserialization, but adds some coarse filters that only allow deserialized objects to have types from an allowlist to make it safe. In Java, that is called “serialization filtering”. This is also approximately the behavior of Apple’s NSKeyedUnarchiver.unarchivedObjectOfClasses, which this post focuses on.

An artificial test case

The goal of the technique described in this post is to leak a pointer to the “shared cache” (a large mapping which is at the same virtual address across all processes on the system, whose address only changes on reboot) through a single execution of the following test case, which uses NSKeyedUnarchiver.unarchivedObjectOfClasses to deserialize an attacker-supplied object graph consisting of the types NSDictionary, NSNumber, NSArray and NSNull, re-serializes the result, and writes back the resulting serialized data:

@import Foundation;
int main() {
  @autoreleasepool {
    NSArray *args = [[NSProcessInfo processInfo] arguments];
    if (args.count != 3) {
      NSLog(@"bad invocation");
      return 1;
    }
    NSString *in_path = args[1];
    NSString *out_path = args[2];

    NSError *error = NULL;

    NSData *input_binary = [NSData dataWithContentsOfFile:in_path];

    /* decode */
    NSArray<Class> *allowed_classes = @[ [NSDictionary class], [NSNumber class], [NSArray class], [NSString class], [NSNull class] ];
    NSObject *decoded_data = [NSKeyedUnarchiver unarchivedObjectOfClasses:[NSSet setWithArray:allowed_classes] fromData:input_binary error:&error];
    if (error) {
      NSLog(@"Error %@ decoding", error);
      return 1;
    }
    NSLog(@"decoded");

    NSData *encoded_binary = [NSKeyedArchiver archivedDataWithRootObject:decoded_data requiringSecureCoding:true error:&error];
    if (error) {
      NSLog(@"Error %@ encoding", error);
      return 1;
    }
    NSLog(@"reencoded");

    [encoded_binary writeToFile:out_path atomically:NO];
  }
  return 0;
}

(The test case also allows NSString but I think that was irrelevant.)

Building blocks

The NSNull / CFNull singleton

The CFNull type is special: There is only one singleton instance of it, kCFNull, implemented in CFBase.c, which is stored in the shared cache. When you deserialize an NSNull object, this doesn’t actually create a new object - instead, the singleton is used.

In the CFRuntimeClass for CFNull, __CFNullClass, no hash handler is provided. When CFHash is called on an object with a type like __CFNullClass that does not implement a ->hash handler, the address of the object is used as the hash code.

Pointer-based hashing is not specific to NSNull; but there probably aren’t many other types for which deserialization uses singletons in the shared cache. There are probably way more types for which instances’ hashes are heap addresses.

NSNumber

The NSNumber type encapsulates a number and supports several types of numbers; its hash handler __CFNumberHash hashes 32-bit integers with _CFHashInt, which pretty much just performs a multiplication with some big prime number.

NSDictionary

Instances of the NSDictionary type are immutable hash tables and can contain arbitrarily-typed keys. Key hashes are mapped to hash table buckets using a simple modulo operation: hash_code % num_buckets. The number of hash buckets in a NSDictionary is always a prime number (see __CFBasicHashTableSizes); hash table sizes are chosen based on __CFBasicHashTableCapacities such that hash tables are normally roughly half-full (around 38%-62%), though the sizing is a bit different for small sizes. These are probing-style hash tables; so rather than having a linked list off each hash bucket, collisions are handled by finding alternate buckets to store colliding elements in using the policy __kCFBasicHashLinearHashingValue / FIND_BUCKET_HASH_STYLE == 1, under which insertion scans forward through the hash table buckets.

I haven’t found source code for serialization of NSDictionary, but it appears to happen in the obvious way, by iterating through the hash buckets in order.

The attack

The basic idea: Infoleak through key ordering in serialized NSDictionary

If a targeted process fills an NSDictionary with attacker-chosen NSNumber keys (through deserialization), the attacker can control which hash buckets will be used by using numbers for which the number’s hash modulo the hash table size results in the desired bucket index. If the targeted process then inserts an NSNull key (still as part of the same deserialization), and then serializes the resulting NSDictionary, the location of the NSNull key in the dictionary’s serialized keys will reveal information about the hash of NSNull.

In particular, the attacker can create a pattern like this using NSNumber keys (where # is a bucket occupied by an NSNumber, and _ is a bucket left empty), where even-numbered buckets are occupied and odd-numbered buckets are empty, here with the example of a hash table of size 7:

bucket index:    0123456
bucket contents: #_#_#_#

This leaves three spots where the NSNull could be inserted (marked with !):

  • At index 1 (#!#_#_#). This happens if hash_code % num_buckets is 6, 0, or 1. (For 6 and 0, insertion would scan linearly through the buckets until finding the free bucket at index 1.) This would result in NSNull being second in the serialized data.
  • At index 3 (#_#!#_#). This happens if hash_code % num_buckets is 2 or 3. This would result in NSNull being third in the serialized data.
  • At index 5 (#_#_#!#). This happens if hash_code % num_buckets is 4 or 5. This would result in NSNull being fourth in the serialized data.

If the serialized data is then sent back to the attacker, the attacker can distinguish between these three states (based on the index of the NSNull key in the serialized data), and learn in which range hash_code % num_buckets is.

Extending it: Leaking the entire bucket index

If the attack from the last section is repeated with the following pattern (occupying odd-numbered buckets and leaving even-numbered ones empty), this yields more information about hash_code % num_buckets:

0123456
_#_#_#_

(Caveat: Don’t think too hard about how a hash table with 3 elements would use only 3 buckets and therefore wouldn’t look like this. The actual reproducer uses hash tables with >=23 buckets.)

Now we have four spots where the NSNull could be inserted: a

  • At index 0, if hash_code % num_buckets is 0.
  • At index 2, if hash_code % num_buckets is 1 or 2.
  • At index 4, if hash_code % num_buckets is 3 or 4.
  • At index 6, if hash_code % num_buckets is 5 or 6.

By combining the information from an NSDictionary that uses the even-buckets-occupied pattern and an NSDictionary that uses the odd-buckets-occupied pattern, the exact value of hash_code % num_buckets can be determined; for example, if the first pattern results in #_#!#_# and the second pattern results in _#!#_#_, then hash_code % num_buckets is 2.

So by sending a serialized NSArray containing two NSDictionary instances with these patterns of NSNumber and NSNull keys to some targeted process, and then receiving a re-serialized copy from the victim, an attacker can determine hash_code % num_buckets for NSArray.

Some math: Leaking the entire hash_code

To leak even more information about the hash_code, this can be repeated with different hash table sizes. The attack from the last section leaks hash_code % num_buckets, where num_buckets is a prime number that the attacker can pick from the possible sizes __CFBasicHashTableSizes based on how many elements are in each NSDictionary.

A useful math trick here is: Based on the values resulting from calculating hash_code modulo a bunch of different prime numbers, hash_code modulo the product of all those prime numbers can be calculated using the extended Euclidean algorithm. Therefore, based on knowing hash_code % num_buckets for the hash table sizes 23, 41, 71, 127, 191, 251, 383, 631 and 1087, it is possible to determine hash_code modulo 23*41*71*127*191*251*383*631*1087 = 0x5'ce23'017b'3bd5'1495. Because 0x5'ce23'017b'3bd5'1495 is bigger than the biggest value hash_code can have (since hash_code is 64-bit), that will be the actual value of hash_code - the address of the NSNull singleton.

Putting it together

So to leak the address of the NSNull singleton in the shared cache, an attacker has to send serialized data consisting of one large container (such as an NSArray) that, for each prime number of interest, contains two NSDictionary instances with the even-indices and odd-indices patterns. (The NSNull keys should come last in the attacker-provided serialized NSDictionary instances, so my reproducer constructs the serialized data manually as an XML plist, and I then convert it to a binary plist with plutil.)

This attacker-provided serialized data is about 50 KiB in size.

The targeted process then has to deserialize this data, serialize it again, and send it back to the attacker.

Afterwards, the attacker can determine in which buckets NSNull was stored in each NSDictionary, use the bucket indices from pairs of NSDictionary to determine hash_code % num_buckets for each hash table size, and then use the extended Euclidean algorithm to obtain hash_code, the address of the NSNull singleton.

The reproducer

I wrote a reproducer for this issue, consisting of my own victim program that runs on the target machine and attacker programs that provide serialized data to the target machine and receive re-serialized data from the target. (For easy reproduction, you can test this on a single machine, that’s also what I did; though I rebooted between “attacker” and “target” to make sure the attacker isn’t using the same shared cache address as the target.)

First, on the attacker machine, generate serialized data:

% clang -o attacker-input-generator attacker-input-generator.c
% ./attacker-input-generator > attacker-input.plist
% plutil -convert binary1 attacker-input.plist

Then, on the target machine, deserialize and re-serialize this data:

% clang round-trip-victim.m -fobjc-arc -fmodules -o round-trip-victim
% ./round-trip-victim attacker-input.plist reencoded.plist
2024-11-25 22:29:44.043 round-trip-victim[1257:11287] decoded
2024-11-25 22:29:44.049 round-trip-victim[1257:11287] reencoded

For validation, you can also use this helper on the target machine to see the real address of the CFNull singleton:

% clang debug-nsnull-hash.m -fobjc-arc -fmodules -o debug-nsnull-hash
% ./debug-nsnull-hash
null singleton pointer = 0x1eb91ab60, null_hash = 0x00000001eb91ab60

Then, on the attacker machine, process the re-serialized data:

% plutil -convert xml1 reencoded.plist
% clang -o extract-pointer extract-pointer.c
% ./extract-pointer < reencoded.plist
serialized data with 1111 objects
NSNull class is 12, NSNull object is 11
NSNull is elem 8 out of 13
NSNull is elem 7 out of 12
NSNull is elem 7 out of 22
NSNull is elem 7 out of 21
NSNull is elem 6 out of 37
NSNull is elem 5 out of 36
NSNull is elem 61 out of 65
NSNull is elem 60 out of 64
NSNull is elem 32 out of 97
NSNull is elem 31 out of 96
NSNull is elem 95 out of 127
NSNull is elem 95 out of 126
NSNull is elem 175 out of 193
NSNull is elem 175 out of 192
NSNull is elem 188 out of 317
NSNull is elem 188 out of 316
NSNull is elem 214 out of 545
NSNull is elem 214 out of 544

NSNull mod 23 = 14
NSNull mod 41 = 13
NSNull mod 71 = 10
NSNull mod 127 = 120
NSNull mod 191 = 62
NSNull mod 251 = 189
NSNull mod 383 = 349
NSNull mod 631 = 375
NSNull mod 1087 = 427

NSNull mod 0x000000000000000000000000000003af =
0x0000000000000000000000000000017e
NSNull mod 0x00000000000000000000000000010589 =
0x000000000000000000000000000059e6
NSNull mod 0x0000000000000000000000000081bef7 =
0xfffffffffffffffffffffffffff4177a
NSNull mod 0x00000000000000000000000060cd7a49 =
0x000000000000000000000000078e47f3
NSNull mod 0x00000000000000000000005ee976e593 =
0x000000000000000000000001eb91ab60
NSNull mod 0x000000000000000000008dff48e176ed =
0x000000000000000000000001eb91ab60
NSNull mod 0x0000000000000000015e003ca3bc222b =
0x000000000000000000000001eb91ab60
NSNull mod 0x0000000000000005ce23017b3bd51495 =
0x000000000000000000000001eb91ab60

NSNull = 0x1eb91ab60

Conclusion

This is a fairly theoretical attack; but I think it demonstrates that using pointers as object hashes for keyed data structures can lead to pointer leaks if everything lines up right, even without using timing attacks.

My example relies on the victim re-serializing the data; but a timing attack version of this might be possible too, with significantly more requests and sufficiently precise measurements.

In my testcase, NSDictionary made it possible to essentially leak information about the ordering of pointers and hashes of numbers by mixing keys of different types; but it is probably possible to leak some amount of information even from data structures that only use pointer keys without mixing key types, especially when the attacker can guess how far apart heap objects are allocated or such and/or can reference the same objects repeatedly across multiple containers.

The most robust mitigation against this is to avoid using object addresses as lookup keys, or alternatively hash them with a keyed hash function (which should reduce the potential address leak to a pointer equality oracle). However, that could come with negative performance effects - in particular, using an ID stored inside an object instead of the object’s address could add a memory load to the critical path of lookups.

From Chrome renderer code exec to kernel with MSG_OOB

8 August 2025 at 09:00

Introduction

In early June, I was reviewing a new Linux kernel feature when I learned about the MSG_OOB feature supported by stream-oriented UNIX domain sockets. I reviewed the implementation of MSG_OOB, and discovered a security bug (CVE-2025-38236) affecting Linux >=6.9. I reported the bug to Linux, and it got fixed. Interestingly, while the MSG_OOB feature is not used by Chrome, it was exposed in the Chrome renderer sandbox. (Since then, sending MSG_OOB messages has been blocked in Chrome renderers in response to this issue.)

The bug is pretty easy to trigger; the following sequence results in UAF:

char dummy;
int socks[2];
socketpair(AF_UNIX, SOCK_STREAM, 0, socks);
send(socks[1], "A", 1, MSG_OOB);
recv(socks[0], &dummy, 1, MSG_OOB);
send(socks[1], "A", 1, MSG_OOB);
recv(socks[0], &dummy, 1, MSG_OOB);
send(socks[1], "A", 1, MSG_OOB);
recv(socks[0], &dummy, 1, 0);
recv(socks[0], &dummy, 1, MSG_OOB);

I was curious to explore how hard it is to actually exploit such a bug from inside the Chrome Linux Desktop renderer sandbox on an x86-64 Debian Trixie system, escalating privileges directly from native code execution in the renderer to the kernel. Even if the bug is reachable, how hard is it to find useful primitives for heap object reallocation, delay injection, and so on?

The exploit code is posted on our bugtracker; you may want to reference it while following along with this post.

Backstory: The feature

Support for using MSG_OOB with AF_UNIX stream sockets was added in 2021 with commit 314001f0bf92 (“af_unix: Add OOB support”, landed in Linux 5.15). With this feature, it is possible to send a single byte of “out-of-band” data that the recipient can read ahead of the rest of the data. The feature is very limited - out-of-band data is always a single byte, and there can only be a single pending byte of out-of-band data at a time. (Sending two out-of-band messages one after another causes the first one to be turned into a normal in-band message.) This feature is used almost nowhere except in Oracle products, as discussed on an email thread from 2024 where removal of the feature was proposed; yet it is enabled by default when AF_UNIX socket support is enabled in the kernel config, and it wasn’t even possible to disable MSG_OOB support until commit 5155cbcdbf03 (“af_unix: Add a prompt to CONFIG_AF_UNIX_OOB”) landed in December 2024.

Because the Chrome renderer sandbox allows stream-oriented UNIX domain sockets and didn’t filter the flags arguments of send()/recv() functions, this esoteric feature was usable inside the sandbox.

When a message (represented by a socket buffer / struct sk_buff, short SKB) is sent between two connected stream-oriented sockets, the message is added to the ->sk_receive_queue of the receiving socket, which is a linked list. An SKB has a length field ->len describing the length of data contained within it (counting both data in the SKB’s “head buffer” as well as data indirectly referenced by the SKB in other ways). An SKB also contains some scratch space that can be used by the subsystem currently owning the SKB (char cb[48] in struct sk_buff); UNIX domain sockets access this scratch space with the helper #define UNIXCB(skb) (*(struct unix_skb_parms *)&((skb)->cb)), and one of the things they store in there is a field u32 consumed which stores the number of bytes of the SKB that have already been read from the socket. UNIX domain sockets count the remaining length of an SKB with the helper unix_skb_len(), which returns skb->len - UNIXCB(skb).consumed.

MSG_OOB messages (sent with something like send(sockfd, &message_byte, 1, MSG_OOB), which goes through queue_oob() in the kernel) are also added to the ->sk_receive_queue just like normal messages; but to allow the receiving socket to access the latest out-of-band message ahead of the rest of the queue, the ->oob_skb pointer of the receiving socket is updated to point to this message. When the receiving socket receives an OOB message with something like recv(sockfd, &received_byte, 1, MSG_OOB) (implemented in unix_stream_recv_urg()), the corresponding socket buffer stays on the ->sk_receive_queue, but its consumed field is incremented, causing its remaining length (unix_skb_len()) to become 0, and the ->oob_skb pointer is cleared; the normal receive path will have to deal with this when encountering the remaining-length-0 SKB.

This means that the normal recv() path (unix_stream_read_generic()), which runs when recv() is called without MSG_OOB, must be able to deal with remaining-length-0 SKBs and must take care to clear the ->oob_skb pointer when it deletes an OOB SKB. manage_oob() is supposed to take care of this. Essentially, when the normal receive path obtains an SKB from the ->sk_receive_queue, it calls manage_oob() to take care of all the fixing-up required to deal with the OOB mechanism; manage_oob() will then return the first SKB that contains at least 1 byte of remaining data, and manage_oob() ensures that this SKB is no longer referenced as ->oob_skb. unix_stream_read_generic() can then proceed as if the OOB mechanism didn’t exist.

Backstory: The bug, and what led to it

In mid-2024, a userspace API inconsistency was discovered, where recv() could spuriously return 0 (which normally signals end-of-file) when trying to read from a socket with a receive queue that contains a remaining-length-0 SKB left behind by receiving an OOB SKB. The fix for this issue introduced two closely related security issues that can lead to UAF; it was marked as fixing a bug introduced by the original MSG_OOB implementation, but luckily was actually only backported to Linux 6.9.8, so the buggy fix did not land in older LTS kernel branches.

After the buggy fix, manage_oob() looked as follows:

static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
                                  int flags, int copied)
{
        struct unix_sock *u = unix_sk(sk);

        if (!unix_skb_len(skb)) {
                struct sk_buff *unlinked_skb = NULL;

                spin_lock(&sk->sk_receive_queue.lock);

                if (copied) {
                        skb = NULL;
                } else if (flags & MSG_PEEK) {
                        skb = skb_peek_next(skb, &sk->sk_receive_queue);
                } else {
                        unlinked_skb = skb;
                        skb = skb_peek_next(skb, &sk->sk_receive_queue);
                        __skb_unlink(unlinked_skb, &sk->sk_receive_queue);
                }

                spin_unlock(&sk->sk_receive_queue.lock);

                consume_skb(unlinked_skb);
        } else {
                struct sk_buff *unlinked_skb = NULL;

                spin_lock(&sk->sk_receive_queue.lock);

                if (skb == u->oob_skb) {
                        if (copied) {
                                skb = NULL;
                        } else if (!(flags & MSG_PEEK)) {
                                if (sock_flag(sk, SOCK_URGINLINE)) {
                                        WRITE_ONCE(u->oob_skb, NULL);
                                        consume_skb(skb);
                                } else {
                                        __skb_unlink(skb, &sk->sk_receive_queue);
                                        WRITE_ONCE(u->oob_skb, NULL);
                                        unlinked_skb = skb;
                                        skb = skb_peek(&sk->sk_receive_queue);
                                }
                        } else if (!sock_flag(sk, SOCK_URGINLINE)) {
                                skb = skb_peek_next(skb, &sk->sk_receive_queue);
                        }
                }

                spin_unlock(&sk->sk_receive_queue.lock);

                if (unlinked_skb) {
                        WARN_ON_ONCE(skb_unref(unlinked_skb));
                        kfree_skb(unlinked_skb);
                }
        }
        return skb;
}

After this change, syzbot (the public syzkaller instance operated by Google) reported that a use-after-free occurs in the following scenario, as described by the fix commit for the syzbot-reported issue:

  1. send(MSG_OOB)
  2. recv(MSG_OOB)
     -> The consumed OOB remains in recv queue
  3. send(MSG_OOB)
  4. recv()
     -> manage_oob() returns the next skb of the consumed OOB
     -> This is also OOB, but unix_sk(sk)->oob_skb is not cleared
  5. recv(MSG_OOB)
     -> unix_sk(sk)->oob_skb is used but already freed

In other words, the issue is that when the receive queue looks like this (shown with the oldest message at the top):

  • SKB 1: unix_skb_len()=0
  • SKB 2: unix_skb_len()=1 <--OOB pointer

and a normal recv() happens, then manage_oob() takes the !unix_skb_len(skb) branch, which deletes the SKB with remaining length 0 and skips forward to the following SKB; but it then doesn’t go through the skb == u->oob_skb check as it otherwise would, which means it doesn’t clear out the ->oob_skb pointer before the SKB is consumed by the normal receive path, creating a dangling pointer that will lead to UAF on a subsequent recv(... MSG_OOB).

This issue was fixed, making the checks for remaining-length-0 SKBs and ->oob_skb in manage_oob() independent:

static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
                                  int flags, int copied)
{
        struct sk_buff *read_skb = NULL, *unread_skb = NULL;
        struct unix_sock *u = unix_sk(sk);

        if (likely(unix_skb_len(skb) && skb != READ_ONCE(u->oob_skb)))
                return skb;

        spin_lock(&sk->sk_receive_queue.lock);

        if (!unix_skb_len(skb)) {
                if (copied && (!u->oob_skb || skb == u->oob_skb)) {
                        skb = NULL;
                } else if (flags & MSG_PEEK) {
                        skb = skb_peek_next(skb, &sk->sk_receive_queue);
                } else {
                        read_skb = skb;
                        skb = skb_peek_next(skb, &sk->sk_receive_queue);
                        __skb_unlink(read_skb, &sk->sk_receive_queue);
                }

                if (!skb)
                        goto unlock;
        }

        if (skb != u->oob_skb)
                goto unlock;

        if (copied) {
                skb = NULL;
        } else if (!(flags & MSG_PEEK)) {
                WRITE_ONCE(u->oob_skb, NULL);

                if (!sock_flag(sk, SOCK_URGINLINE)) {
                        __skb_unlink(skb, &sk->sk_receive_queue);
                        unread_skb = skb;
                        skb = skb_peek(&sk->sk_receive_queue);
                }
        } else if (!sock_flag(sk, SOCK_URGINLINE)) {
                skb = skb_peek_next(skb, &sk->sk_receive_queue);
        }

unlock:
        spin_unlock(&sk->sk_receive_queue.lock);

        consume_skb(read_skb);
        kfree_skb(unread_skb);

        return skb;
}

But a remaining issue is that when this function discovers a remaining-length-0 SKB left behind by recv(..., MSG_OOB), it skips ahead to the next SKB and assumes that it is not also a remaining-length-0 SKB. If this assumption is broken, manage_oob() can return a pointer to the second remaining-length-0 SKB, which is bad because the caller unix_stream_read_generic() does not expect to see remaining-length-0 SKBs:

static int unix_stream_read_generic(struct unix_stream_read_state *state,
                                    bool freezable)
{
[...]
        int flags = state->flags;
[...]
        int skip;
[...]
        skip = max(sk_peek_offset(sk, flags), 0); // 0 if MSG_PEEK isn't set

        do {
                struct sk_buff *skb, *last;
[...]
                last = skb = skb_peek(&sk->sk_receive_queue);
                last_len = last ? last->len : 0;

again:
#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
                if (skb) {
                        skb = manage_oob(skb, sk, flags, copied);
                        if (!skb && copied) {
                                unix_state_unlock(sk);
                                break;
                        }
                }
#endif
                if (skb == NULL) {
[...]
                }

                while (skip >= unix_skb_len(skb)) {
                        skip -= unix_skb_len(skb);
                        last = skb;
                        last_len = skb->len;
                        skb = skb_peek_next(skb, &sk->sk_receive_queue);
                        if (!skb)
                                goto again;
                }
[...]
                /* Mark read part of skb as used */
                if (!(flags & MSG_PEEK)) {
                        UNIXCB(skb).consumed += chunk;
[...]
                        if (unix_skb_len(skb))
                                break;

                        skb_unlink(skb, &sk->sk_receive_queue);
                        consume_skb(skb); // frees the SKB

                        if (scm.fp)
                                break;
                } else {

If MSG_PEEK is not set (which is the only case in which SKBs can actually be freed), skip is always 0, and the while (skip >= unix_skb_len(skb)) loop condition should always be false; but when a remaining-length-0 SKB unexpectedly gets here, the condition turns into 0 >= 0, and the loop skips ahead to the first SKB that does not have remaining length 0. That SKB could be the ->oob_skb; in which case this again bypasses the logic in manage_oob() that is supposed to set ->oob_skb to NULL before the current ->oob_skb can be freed.

So the remaining bug can be triggered by first doing the following twice, creating two remaining-length-0 SKBs in the ->sk_receive_queue:

send(socks[1], "A", 1, MSG_OOB);
recv(socks[0], &dummy, 1, MSG_OOB);

If another OOB SKB is then sent with send(socks[1], "A", 1, MSG_OOB), the ->sk_receive_queue will look like this:

  • SKB 1: unix_skb_len()=0
  • SKB 2: unix_skb_len()=0
  • SKB 3: unix_skb_len()=1 <--OOB pointer

Now, recv(socks[0], &dummy, 1, 0) will trigger the bug and free SKB 3 while leaving ->oob_skb pointing to it; making it possible for subsequent recv() syscalls with MSG_OOB to use the dangling pointer.

The initial primitive

This bug yields a dangling ->msg_oob pointer. Pretty much the only way to use that dangling pointer is the recv() syscall with MSG_OOB, either with or without MSG_PEEK, which is implemented in unix_stream_recv_urg(). (There are other codepaths that touch it, but they’re mostly just pointer comparisons, with the exception of the unix_ioctl() handler for SIOCATMARK, which is blocked in Chrome’s seccomp sandbox.)

unix_stream_recv_urg() does this:

static int unix_stream_recv_urg(struct unix_stream_read_state *state)
{
        struct socket *sock = state->socket;
        struct sock *sk = sock->sk;
        struct unix_sock *u = unix_sk(sk);
        int chunk = 1;
        struct sk_buff *oob_skb;

        mutex_lock(&u->iolock);
        unix_state_lock(sk);
        spin_lock(&sk->sk_receive_queue.lock);

        if (sock_flag(sk, SOCK_URGINLINE) || !u->oob_skb) {
[...]
        }

        // read dangling pointer
        oob_skb = u->oob_skb;

        if (!(state->flags & MSG_PEEK))
                WRITE_ONCE(u->oob_skb, NULL);

        spin_unlock(&sk->sk_receive_queue.lock);
        unix_state_unlock(sk);

        // read primitive
        // ->recv_actor() is unix_stream_read_actor()
        chunk = state->recv_actor(oob_skb, 0, chunk, state);

        if (!(state->flags & MSG_PEEK))
                UNIXCB(oob_skb).consumed += 1; // write primitive

        mutex_unlock(&u->iolock);

        if (chunk < 0)
                return -EFAULT;

        state->msg->msg_flags |= MSG_OOB;
        return 1;
}

At a high level, the call to state->recv_actor() (which goes down the call path unix_stream_read_actor -> skb_copy_datagram_msg -> skb_copy_datagram_iter -> __skb_datagram_iter(cb=simple_copy_to_iter)) gives a read primitive: it is trying to copy one byte of data referenced by the oob_skb to userspace, so by replacing the memory pointed to by oob_skb with controlled, repeatedly writable data, it is possible to repeatedly cause copy_to_user(<userspace pointer>, <kernel pointer>, 1) with arbitrary kernel pointers. As long as MSG_PEEK is set, this can be repeated; only when MSG_PEEK is clear, the ->msg_oob pointer is cleared.

The only write primitive this bug yields is the increment UNIXCB(oob_skb).consumed += 1 that happens when MSG_PEEK is not set. In the build I’m looking at, the consumed field that is incremented is located 0x44 bytes into the oob_skb, an object which is effectively allocated with an alignment of 0x100 bytes. This means that, if the write primitive is applied to a 64-bit length value or a pointer, it would have to do an increment at offset 4 relative to the 8-byte aligned overwrite target, and it would effectively increment the 64-bit pointer/length by 4 GiB.

My exploit for this issue

Discarded strategy for using the write primitive: Pointer increment

It would be possible to free the sk_buff and reallocate it as some structure containing a pointer at offset 0x40. The write primitive would effectively increment this pointer by 4 GiB (because it would increment by 1 at an offset 4 bytes into the pointer). But this would fundamentally rely on the machine having significantly more than 4 GiB of RAM, which feels gross and a bit like cheating.

Overall strategy

Since this issue relatively straightforwardly leads to a semi-arbitrary read (subject to usercopy hardening restrictions), but the write primitive is much more gnarly, I decided to go with the general approach of: first get the read primitive working; then use the read primitive to assist in exploiting the write primitive. This way, ideally everything after the read primitive bootstrapping can be made reliable with enough work.

Dealing with per-cpu state

Lots of things in this exploit rely on per-cpu kernel data structures and will fail if a task is migrated between CPUs at the wrong time. In some places in the exploit, I repeatedly check which CPU the exploit is running on with sched_getcpu(), and retry if the CPU number changed; though I was too lazy to do that everywhere perfectly, and this could be done even better by relying more directly on the “restartable sequences” subsystem.

Note that the Chrome sandbox policy forbids __NR_getcpu; but that has no effect at all on sched_getcpu(), in particular on x86-64, because there are two faster alternatives to the getcpu() syscall that glibc prefers to use instead:

  • The kernel’s rseq subsystem maintains a struct rseq in userspace for each thread, which contains the cpu_id that the thread is currently running on; if rseq is available, glibc will read from the rseq struct.
  • On x86-64, the vDSO contains a pure-userspace implementation of the getcpu() syscall which relies on either the RDPID instruction or, if that is not available, the LSL instruction to determine the ID of the current CPU without having to perform a syscall. (This is implemented in vdso_read_cpunode() in the kernel sources, which is compiled into the vDSO that is mapped into userspace.)

Setting up the read primitive - mostly boring spraying

On the targeted Debian kernel, struct sk_buff is in the skbuff_head_cache SLUB cache, which normally uses order-1 unmovable pages. I had trouble finding a good reallocation primitive that also uses order-1 pages (though maple_node might have been an option); so I went for reallocation as a pipe page (order-0 unmovable), though that means that the reallocation will go through the buddy allocator and requires the order-0 unmovable list to become empty so that an order-1 page is split up.

This is not very novel, so I will only describe a few interesting aspects of the strategy here - if you want a better understanding of how to free a SLUB page and reallocate it as something else, there are plenty of existing writeups, including one I wrote a while ago (section “Attack stage: Freeing the object’s page to the page allocator”), though that one does not discuss the buddy allocator.

To make it more likely for a reallocation of an order-1 page as an order-0 page to succeed, the exploit starts by allocating a large number of order-0 unmovable pages to drain the order-0 and order-1 unmovable freelists. Most ways of allocating large amounts of kernel memory are limited in the sandbox; in particular, the default file descriptor table size soft limit (RLIMIT_NOFILE) is 4096 on Debian (Chrome leaves this limit as-is), and I can neither use setrlimit() to bump that number up (due to seccomp) nor create subprocesses with separate file descriptor tables. (A real exploit might be able to work around this by exploiting several renderer processes, though that seems like a pain.) The one primitive I have for allocating large amounts of unmovable pages are page tables: by creating a gigantic anonymous VMA (read-only to avoid running into Chrome’s RLIMIT_DATA restrictions) and then triggering read faults all over this VMA, an unlimited number of page tables can be allocated. I use this to spam around 10% of total RAM with page tables. (To figure out how much RAM the machine has, I’m testing whether mmap() works with different sizes, relying on the OVERCOMMIT_GUESS behavior of __vm_enough_memory(); though that doesn’t actually work precisely in the sandbox due to the RLIMIT_DATA limit. A cleaner and less noisy way might be to actually fill up RAM and use mincore() to figure out how large the working set can get before pages get swapped out or discarded.)

Afterwards, I create 41 UNIX domain sockets and use them to spam 256 SKB allocations each; since each SKB uses 0x100 bytes, this allocates a bit over 2.5 MiB of kernel memory. That is enough to later flush a slab page out of both SLUB’s per-cpu partial list as well as the page allocator’s per-cpu freelist, all the way into the buddy allocator.

Then I set up a SLUB page containing a dangling pointer, try to flush this page all the way into the buddy allocator, and reallocate it as a pipe page by using 256 pipes to each allocate 2 pages (which is the minimum size that a pipe always has, see PIPE_MIN_DEF_BUFFERS). This allocates 25624KiB = 2 MiB worth of order-0 pages.

At this point, I have probably reallocated the SKB as a pipe page; but I don’t know in which pipe the SKB is located, or at which offset. To figure that out, I store fake SKBs in the pipe pages that point to different data; then, by triggering the bug with recv(..., MSG_OOB|MSG_PEEK), I can read one byte at the pointed-to location and narrow down where in which pipe the SKB is. I don’t know the addresses of any kernel objects yet; but the X86-64 implementation of copy_to_user() is symmetric and also works if you pass a userspace pointer as the source, so I can simply use userspace data pointers in the crafted SKBs for now. (SMAP is not an issue here - SMAP is disabled for all memory accesses in copy_to_user(). On x86-64, copy_to_user() is actually implemented as a wrapper around copy_user_generic(), which is a helper that accepts both kernel and userspace addresses as source and destination.)

Afterwards, I have the ability to call copy_to_user(..., 1) on arbitrary kernel pointers through recv(..., MSG_OOB|MSG_PEEK) using the controlled SKB.

Properties of the read primitive

One really cool aspect of a copy_to_user()-based read primitive on x86-64 is that it doesn’t crash even when called on invalid kernel pointers - if the kernel memory access fails, the recv() syscall will simply return an error (-EFAULT).

The main limitation is that usercopy hardening (__check_object_size()) will catch attempts to read from some specific memory ranges:

  • Ranges that wrap around - not an issue here, only ranges of length 1 can be used anyway.
  • Addresses <=16 - not an issue here.
  • The kernel stack of the current process, if some other criteria are met. Not an issue here - even if I want to read from a kernel stack, I’ll probably want to read the kernel stack of another thread, which isn’t protected.
  • The kernel .text section - all of .data and such is accessible, just .text is restricted. When targeting a specific kernel build, that’s not really relevant.
  • kmap() mappings - those don’t exist on x86-64.
  • Freed vmalloc allocations, or ranges that straddle the bounds of a vmalloc allocation. Not an issue here.
  • Ranges in the direct mapping, or in the kernel image address range, that straddle the bounds of a high-order folio. Not an issue here, only ranges of length 1 can be used anyway.
  • Ranges in the direct mapping, or in the kernel image address range, that are used as SLUB pages in non-kmalloc slab caches, at offsets not allowed by usercopy allowlisting (see __check_heap_object()). This is the most annoying part.

(There might be other ways of using this bug to read memory with different constraints, like by using the frag_iter->len read in __skb_datagram_iter() to influence an offset from which known data is subsequently read, but that seems like a pain to work with.)

Locating the kernel image

To break KASLR of the kernel image at this point, there are lots of options, partially thanks to copy_to_user() not crashing on access to invalid addresses; but one nice option is to read an Interrupt Descriptor Table (IDT) entry through the read-only IDT mapping at the fixed address 0xfffffe0000000000 (CPU_ENTRY_AREA_RO_IDT_VADDR), which yields the address of a kernel interrupt handler.

Using the read primitive to observe allocator state and other things

From here on, my goal is to use the read primitive to assist in exploiting the write primitive; I would like to be able to answer questions like:

  • What is the mapping between struct page */struct ptdesc */struct slab * and the corresponding region in the direct mapping? (This is easy and just requires reading some global variables out of the .data/.bss sections.)
  • At which address will the next sk_buff allocation be?
  • What is the current state of this particular page?
  • Where are my page tables located, and which physical address does a given virtual address map to?

Because usercopy hardening blocks access to objects in specialized slabs, reading the contents of a struct kmem_cache is not possible, because a kmem_cache is allocated from a specialized slab type which does not allow usercopy. But there are many important pieces of kernel memory that are readable, so it is possible to work around that:

  • The kernel .data/.bss sections, which contain things like pointers to kmem_cache instances.
  • The vmemmap region, which contains all instances of struct page/struct folio/struct ptdesc/struct slab (these types all together effectively form a union) which describe the status of each page. These also contain things like a SLUB freelist head pointer; a pointer to the kmem_cache associated with a given SLUB page; or an intrusive linked list element tying together the root page tables of all processes.
  • Kernel stacks of other threads (located in vmalloc memory).
  • Per-CPU memory allocations (located in vmalloc memory), which are used in particular for memory allocation fastpaths in SLUB and the page allocator; and also the metadata describing where the per-cpu memory ranges are located.
  • Page tables.

So to observe the state of the SLUB allocator for a given slab cache, it is possible to first read the corresponding kmem_cache* from the kernel .data/.bss section, then scan through all per-cpu memory for objects that look like a struct kmem_cache_cpu (with a struct slab * and a freelist pointer pointing into the corresponding direct mapping range), and check which kmem_cache the struct slab’s kmem_cache* points to to determine whether the kmem_cache_cpu is for the right slab cache. Afterwards, the read primitive can be used to read the slab cache’s per-cpu freelist head pointer out of the struct kmem_cache_cpu.

To observe the state of a struct page/struct slab/…, the read primitive can be used to simply read the page’s refcount and mapcount (which contains type information). This makes it possible to observe things like “has this page been freed yet or is it still allocated” and “as what type of page has this page been reallocated”.

To locate the page table root of the current process, it is similarly not possible to directly go through the mm_struct because that is allocated from a specialized slab type which does not allow usercopy (except in the saved_auxv field). But one way to work around this is to instead walk the global linked list of all root page tables (pgd_list), which stores its elements inside struct ptdesc, and search for a struct ptdesc which has a pt_mm field that points to the mm_struct of the current process. The address of this mm_struct can be obtained from the per-cpu variable cpu_tlbstate.loaded_mm. Afterwards, the page tables can be walked through the read primitive.

Finding a reallocation target: The magic of CONFIG_RANDOMIZE_KSTACK_OFFSET

Having already discarded the “bump a pointer by 4 GiB” and “reallocate as a maple tree node” strategies, I went looking for some other allocation which would place an object such that incrementing the value at address 0x…44 leads to a nice primitive. It would be nice to have something there like an important flags field, or a length specifying the size of a pointer array, or something like that. I spent a lot of time looking at various object types that can be allocated on the kernel heap from inside the Chrome sandbox, but found nothing great.

Eventually, I realized that I had been going down the wrong path. Clearly trying to target a heap object was foolish, because there is something much better: It is possible to reallocate the target page as the topmost page of a kernel stack!

That might initially sound like a silly idea; but Debian’s kernel config enables CONFIG_RANDOMIZE_KSTACK_OFFSET=y and CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT=y, causing each syscall invocation to randomly shift the stack pointer down by up to 0x3f0 bytes, with 0x10 bytes granularity. That is supposed to be a security mitigation, but works to my advantage when I already have an arbitrary read: instead of having to find an overwrite target that is at a 0x44-byte distance from the preceding 0x100-byte boundary, I effectively just have to find an overwrite target that is at a 0x4-byte distance from the preceding 0x10-byte boundary, and then keep doing syscalls and checking at what stack depth they execute until I randomly get lucky and the stack lands in the right position.

With that in mind, I went looking for an overwrite target on the stack, strongly inspired by Seth’s exploit that overwrote a spilled register containing a length used in copy_from_user. Targeting a normal copy_from_user() directly wouldn’t work here - if I incremented the 64-bit length used inside copy_from_user() by 4 GiB, then even if the copy failed midway through due to a userspace fault, copy_from_user() would try to memset() the remaining kernel memory to zero.

I discovered that, on the codepath pipe_write -> copy_page_from_iter -> copy_from_iter, the 64-bit length variable bytes of copy_page_from_iter() is stored in register R14, which is spilled to the stack frame of copy_from_iter(); and this stack spill is in a stack location where I can clobber it.

When userspace calls write() on a pipe, the kernel constructs an iterator (struct iov_iter) that encapsulates the userspace memory range passed to write(). (There are different types of iterators that can encapsulate a single userspace range, a set of userspace ranges, or various types of kernel memory.) Then, pipe_write() (which is called anon_pipe_write() in newer kernels) essentially runs a loop which allocates a new pipe_buffer slot in the pipe, places a new page allocation in this pipe buffer slot, and copies up to a page worth of data (PAGE_SIZE bytes) from the iov_iter to the pipe buffer slot’s page using copy_page_from_iter(). copy_page_from_iter() effectively receives two length values: The number of bytes that fit into the caller-provided page (bytes, initially set to PAGE_SIZE here) and the number of bytes available in the struct iov_iter encapsulating the userspace memory range (i->count). The amount of data that will actually be copied is limited by both.

If I manage to increment the spilled register R14 which contains bytes by 4 GiB while copy_from_iter() is busy copying data into the kernel, then after copy_from_iter() returns, copy_page_from_iter() will effectively no longer be bounded by bytes, only by i->count (based on the length userspace passed to write()); so it will do a second iteration, which copies into out-of-bounds memory behind the pipe buffer page. If userspace calls write(fd, buf, 0x3000), and the overwrite happens in the middle of copying bytes 0x1000-0x1fff of the userspace buffer into the second pipe buffer page, then bytes 0x2000-0x2fff will be written out-of-bounds behind the second pipe buffer page, at which point i->count will drop to 0, terminating the operation.

Reallocating a SLUB page as a stack page, with arb-read assistance

So to get the ability to increment-after-free a value in a stack page, I again start by draining the low-order page allocator caches. But this time, the arb-read can be used to determine when an object at the right in-page offset is at the top of the SLUB freelist for the sk_buff slub cache; and the arb-read can also determine whether I managed to allocate an entire slab page worth of objects, with no other objects mixed in. Then, when flushing the page out of the SLUB allocator, the arb-read helps to verify that the page really has been freed (its refcount field should drop to 0); and afterwards, the page is flushed out of the page allocator’s per-cpu freelist.

Then, to reallocate the page, I run a loop that first allocates a pipe page, then checks the refcount field of the target page. If the refcount of the target page goes up, I probably found the target page, and can exit the loop; otherwise, I free the pipe page again, reallocate it as a page table to drain the page away, and try again. (Directly allocating as a page table would be cumbersome because page tables have RCU lifetime, so once a page has been allocated as a page table, it is hard to reallocate it. Keeping drained pages in pipe buffers might not work well due to the low file descriptor table size, and each pipe FD pair potentially only being able to reference two pages.)

Once I have reallocated the target page as a pipe buffer, I free it again, then free three more pages (from other helper pipes), and then create a new thread with the clone() syscall. If everything goes well, clone() will allocate four pages for the new kernel stack: First the three other pages I freed last, and then the target page as the last page of the stack. By walking the page tables, I can verify that the target page really got reused as the last page of the target stack.

Remaining prerequisites for using the write primitive

At this point, I have the write primitive set up such that I can trigger it on a specific stack memory location. The write primitive essentially first reads some surrounding (stack) memory (in unix_stream_read_actor() and its callees skb_copy_datagram_msg -> skb_copy_datagram_iter) and expects that memory to have a certain structure before incrementing the value at a specific stack location.

I also know what stack allocation I want to overwrite.

The remaining issues are:

  1. I need to ensure that an OOB copy_from_user() behind a pipe buffer page will overwrite some data that helps in compromising the kernel.
  2. I need to be able to detect at what stack depth pipe_write() is running, and depending on that either try again or proceed to trigger the bug.
  3. The UAF reads preceding the UAF increment need to see the right kind of data to avoid crashing.
  4. copy_from_iter() needs to take enough time to allow me to increment a value in its stack frame.

Selecting an OOB overwrite target

Page tables have several nice properties here:

  • It is easy for me to cause allocation of as many page tables as I want.
  • I can easily determine the physical and kernel-virtual addresses of page tables that the kernel has allocated for my process (by walking the page tables with the arb read).
  • They are order-0 unmovable allocations, just like pipe buffers, so the page allocator will allocate them in the same 2MiB pageblocks.

So I am choosing to use the OOB copy_from_user() to overwrite a page table.

This requires that I can observe where my pipe buffer pages are located; for that, I again use the SLUB per-cpu freelist observing trick, this time on the kmalloc-cg-192 slab cache, to figure out where a newly created pipe’s pipe_inode_info is located. From there, I can walk to the pipe’s pipe_buffer array, which contains pointers to the pages used by the pipe.

With the ability to observe both where my page tables are located and where pipe buffer pages are allocated, I can essentially alternatingly allocate page tables and pipe buffer pages until I get two that are adjacent.

Detecting pipe_write() stack depth

To run pipe_write() with a write() syscall such that I can reliably determine at which depth the function is running and decide whether to go ahead with the corruption, without having to race, I can prepare a pipe such that it initially only has space for one more pipe_buffer, and then call write() with a length of 0x3000. This will cause pipe_write() to first store 0x1000 bytes in the last free pipe_buffer slot, then wait for space to become available again. From another thread, it is possible to detect when pipe_write() has used the last free pipe_buffer slot by repeatedly calling poll() on the pipe: When poll() stops reporting that the pipe is ready for writing (POLLOUT), pipe_write() must have used up the last free pipe_buffer slot.

At that point, I know that the syscall entry part of the kernel stack is no longer changing. To check whether the syscall is executing at a specific depth, it is enough to check whether the return address for the return from x64_sys_call to do_syscall_64 is at the expected position on the kernel stack using the arb read - it can’t be a return address left from a preceding syscall because the same stack location where that return address is stored is always clobbered by a subsequent call to syscall_exit_to_user_mode at the end of a syscall.

If the stack randomization is the correct one, I can then do more setup and resume pipe_write() by using read() to clear pipe buffer entries; otherwise, I will use read() to clear pipe buffer entries, let pipe_write() run to completion, and try again.

Letting the reads in the increment primitive see the right data

The increment primitive happens on this call graph:

unix_stream_recv_urg
  [read dangling pointer from ->oob_skb]
  unix_stream_read_actor [called as state->recv_actor]
    [UAF read UNIXCB(skb).consumed]
    skb_copy_datagram_msg
      skb_copy_datagram_iter
        __skb_datagram_iter
          skb_headlen
            [UAF read skb->len]
            [UAF read skb->data_len]
          skb_frags_readable
            [UAF read skb->unreadable]
          skb_shinfo [for reading nr_frags]
            skb_end_pointer
              [UAF read skb->head]
              [UAF read skb->end]
          skb_walk_frags
            skb_shinfo [for reading frag_list]
            [forward iteration starting at skb_shinfo(skb)->frag_list along ->next pointers]
  [UAF increment of UNIXCB(oob_skb).consumed]

A promising aspect here is that this codepath first does all the reads; then it does a linked list walk through attacker-controlled pointers with skb_walk_frags(); and then it does the write. skb_walk_frags() is defined as follows:

#define skb_walk_frags(skb, iter)	\
	for (iter = skb_shinfo(skb)->frag_list; iter; iter = iter->next)

and is used like this in __skb_datagram_iter():

	skb_walk_frags(skb, frag_iter) {
		int end;

		WARN_ON(start > offset + len);

		end = start + frag_iter->len;
		if ((copy = end - offset) > 0) {
			if (copy > len)
				copy = len;
			if (__skb_datagram_iter(frag_iter, offset - start,
						to, copy, fault_short, cb, data))
				goto fault;
			if ((len -= copy) == 0)
				return 0;
			offset += copy;
		}
		start = end;
	}

So if I run recv(..., MSG_OOB) on the UNIX domain socket while the dangling ->oob_skb pointer points to data I control, and craft that fake SKB such that its skb_shinfo(skb)->frag_list points to another fake SKB with ->len=0 and a ->next pointer pointing back to itself, I can cause the syscall to get stuck in an infinite loop. It will keep looping until I replace the ->next pointer with NULL, at which point it will perform just the UAF increment.

This is great news: instead of needing to ensure that the stack contains the right data for the UAF reads and the overwrite target for the UAF increment at the same time, I can first place controlled data on the stack, and then afterwards separately place the overwrite target on the stack.

To place controlled data on the stack, I initially considered using select() or poll(), since I know that those syscalls copy large-ish amounts of data from userspace onto the stack; however, those have the disadvantage of immediately validating the supplied data, and it would be hard to make them actually stay in the syscall, rather than immediately returning out of the syscall with an error and often clobbering the on-stack data array in the process. Eventually I discovered that sendmsg() on a datagram-oriented UNIX domain socket works great for this: ___sys_sendmsg(), which implements the sendmsg() syscall, will import the destination address pointed to by msg->msg_name into a stack buffer (struct sockaddr_storage address), then call into the protocol-specific ->sendmsg handler - in the case of datagram-oriented UNIX domain sockets, unix_dgram_sendmsg(). This function coarsely validates the structure of the destination address (checking that it specifies the AF_UNIX family and is no larger than struct sockaddr_un), then waits for space to become available in the socket’s queue before doing anything else with the destination address. This makes it possible to place 108 bytes of controlled data on a kernel stack, and that data will stay there until the syscall can continue or bail out when space becomes available in the socket queue or the socket is shut down. I actually need a bit more data on the stack, but luckily the struct iovec iovstack[UIO_FASTIOV] is directly in front of the address, and unused elements at the end of the iovstack are guaranteed to be zeroed thanks to CONFIG_INIT_STACK_ALL_ZERO=y, which happens to be exactly what I need.

It would be helpful to be able to reliably wait for the sendmsg() syscall to enter the kernel and copy the destination address onto the kernel stack before inspecting the state of its stack; this is luckily possible by supplying a single-byte “control message” via msg->msg_control and msg->msg_controllen, which will mostly be ignored because it is too small to be a legitimate control message, but will be copied onto the kernel stack in ____sys_sendmsg() after the destination address has been copied onto the stack. It is possible to detect from userspace when this kernel access to msg->msg_control happens by pointing it to a userspace address which is not yet populated with a page table entry, then polling mincore() on this userspace address.

So now my strategy is roughly:

  1. In a loop, call sendmsg() on the thread with the stack the dangling ->oob_skb pointer points to to place a fake SKB on the stack until the fake SKB lands at the right stack offset thanks to CONFIG_RANDOMIZE_KSTACK_OFFSET, and have that fake SKB’s skb_shinfo(skb)->frag_list point to a second fake SKB with a ->next pointer that refers back to itself. (This second fake SKB can be placed anywhere I want, so I’m putting it in a userspace-owned page, so that userspace can directly write into it.)
  2. On a second thread, use write() on a UNIX domain socket to use the dangling ->oob_skb pointer, which will start looping endlessly, following the ->next pointer.
  3. On the thread that called sendmsg() before, now call write(..., 0x3000) on a pipe with one free pipe_buffer slot in a loop until the syscall handler lands at the right stack offset thanks to CONFIG_RANDOMIZE_KSTACK_OFFSET.
  4. Let the pipe write() continue, and wait until it is in the middle of copying data from userspace memory to a pipe buffer page.
  5. Set the ->next pointer in the second fake SKB to NULL, so that the write() on the UNIX domain socket stops looping, performs the UAF increment, and returns.
  6. Wait for the pipe write() to finish, at which point the page table behind the pipe data page should have been overwritten with controlled data.

Slowing down copy_from_iter()

I need to slow down a copy_from_iter() call. There are several strategies for this that don’t work (or don’t work well) in a Chrome renderer sandbox:

  • userfaultfd: not accessible in the Chrome Desktop renderer sandbox, and nowadays usually anyways nerfed such that only root can use it to intercept usercopy operations
  • FUSE: not accessible in the Chrome Desktop renderer sandbox
  • causing lots of major page faults: I’m not sure if there is some indirect way to get a file descriptor to a writable on-disk file; but either way, this seems like it would be a pain from a renderer.

But as long as only a single userspace memory read needs to be delayed, there is another option: I can create a very large anonymous VMA; fill it with mappings of the 4KiB zeropage; ensure that no page is mapped at one specific location in the VMA (for example with madvise(..., MADV_DONTNEED), which zaps page table entries in the specified range); and then have one thread run an mprotect() operation on this large anonymous VMA while another thread tries to access the part of the userspace region where no page is currently mapped. The mprotect() operation will keep the VMA write-locked while it walks through all the associated page table entries, modifies the page table entries as required, and performs TLB flushes if necessary; so a concurrent page fault in this VMA will have to wait until the mprotect() has finished. One limitation of this technique is that the part of the accessed userspace range that causes the slowdown will be filled with zeroes; but that can just be a single byte at the start or end of the range being copied, so it’s not a major limitation.

Based on some rough testing on my machine, if mprotect() has to iterate through 128 MiB of page tables populated with zeropage mappings, it takes something like 500-1000ms depending on which way the page table entries are changed.

Page table control

Putting all this together, I can overwrite the contents of a page table with controlled data. I’m using that controlled write to place a new entry in the page table that points back to the page table, effectively creating a userspace mapping of the page table; and then I can use this to map arbitrary kernel memory writably into userspace.

My exploit demonstrates its ability to modify kernel memory with this by using it to overwrite the UTS information printed by uname.

Takeaway: Chrome sandbox attack surface

One thing that stood out to me about this is that I was able to use a somewhat large number of kernel interfaces in this exploit; in particular:

interface usecase
anonymous VMA creation page table allocations
madvise() fast VMA splitting and merging
AF_UNIX SOCK_STREAM sockets triggering the bug; SKB allocation and freeing
sched_getcpu() (via syscall-less fastpaths) interacting with per-cpu kernel structures
eventfd() synchronization between threads
pipe() allocation and freeing of order-0 unmovable pages with controlled contents
pipe() stack overwrite target
AF_UNIX SOCK_DGRAM sockets placing controlled data on the stack
sendmsg() placing controlled data on the stack
mprotect() slowing down copy_from_user()
munmap() TLB flushing
madvise(..., MADV_DONTNEED) zapping PTEs for slowing down subsequent copy_from_user() or subsequently detecting copy_from_user()
mincore() detecting copy_from_user()
clone() racing operations on multiple threads; reallocating pages as kernel stack
poll() detecting progress of concurrent pipe_write()

Some of these are obviously needed to implement necessary features of the sandboxed renderer; others seem like unnecessary attack surface. I hope to look at this more systematically in the future.

Takeaway: Esoteric kernel features in core interfaces are an issue for browser sandboxes

One thing I’ve noticed, not just with this issue, but several issues before that, is that core kernel subsystems (which are exposed in renderer sandbox policies and such) sometimes have flags that trigger esoteric ancillary features that are unintentionally exposed by Chrome’s renderer sandbox. Such features seem to often be more buggy than the core feature that the policy intended to expose. Examples of this from Chrome’s past include:

  • futex() was broadly exposed in the sandbox, making it possible to reach a bug in Priority Inheritance futexes from the renderer sandbox.
  • memfd_create() was exposed in the sandbox without checking its flags, making it possible to create HugeTLB mappings using the MFD_HUGETLB flag. There have been several bugs in HugeTLB, which is to my knowledge almost exclusively used by some server applications that use large amounts of RAM, such as databases.
  • pipe2() was exposed in the sandbox without checking its flags, making it possible to create “notification pipes” using the O_NOTIFICATION_PIPE flag, which behave very differently from normal pipes and are used exclusively for posting notifications from the kernel “keys” subsystem to userspace.

Takeaway: probabilistic mitigations against attackers with arbitrary read

When faced with an attacker who already has an arbitrary read primitive, probabilistic mitigations that randomize something differently on every operation can be ineffective if the attacker can keep retrying until the arbitrary read confirms that the randomization picked a suitable value or even work to the attacker’s advantage by lining up memory locations that could otherwise never overlap, as done here using the kernel stack randomization feature.

Picking per-syscall random stack offsets at boottime might avoid this issue, since to retry with different offsets, the attacker would have to wait for the machine to reboot or try again on another machine. However, that would break the protection for cases where the attacker wants to line up two syscalls that use the same syscall number (such as different ioctl() calls); and it could also weaken the protection in cases where the attacker just needs to know what the randomization offset for some syscall will be.

Somewhat relatedly, Blindside demonstrated that this style of attack can be pulled off without a normal arbitrary read primitive, by “exploiting” a real kernel memory corruption bug during speculative execution in order to leak information needed for subsequently exploiting the same memory corruption bug for real.

Takeaway: syzkaller fuzzing and complex data structures

The first memory corruption bug described in this post was introduced in late June 2024, and discovered by syzkaller in late August 2024. Hitting that bug required 6 syscalls: One to set up a socket pair, four send()/recv() calls to set up a dangling pointer, and one more recv() call to actually trigger UAF by accessing the dangling pointer.

Hitting the second memory corruption bug, which I found by code review, required 8 syscalls: One to set up a socket pair, six send()/recv() calls to set up a dangling pointer, and one more recv() to cause UAF.

This was not a racy bug; in a KASAN build, running the buggy syscall sequence once would be enough to get a kernel splat. But when a fuzzer chains together syscalls more or less at random, the chance of running the right sequence of syscalls drops exponentially with each syscall required…

The most important takeaway from this is that data structures with complex safety rules (in this case, rules about the ordering of different types of SKBs in the receive queues of UNIX domain stream sockets) don’t just make it hard for human programmers to keep track of safety rules, they also make it hard for fuzzers to construct inputs that explore all relevant state patterns. This might be an area for fuzzer improvement - perhaps fuzzers could reach deeper into specific subsystems by generating samples that focus on interaction with a single kernel subsystem, or by monitoring whether additional syscalls chained to the end of a base sample cause additional activity in a particular subsystem.

Takeaway: copy_from_user() delays don’t require FUSE or userfaultfd

FUSE and userfaultfd are the most effective and reliable ways to inject delays on copy_from_user() calls because they can set up separate delays for multiple memory regions, provide precise control over the timing of the injected delay, don’t require large allocations or slow preparation, and allow placing arbitrary data in the page that is eventually installed. However, applying mprotect() to a large anonymous VMA filled with zeropage mappings (with 128 MiB of page tables) turns out to be sufficient to delay kernel execution by around a second. In the past, I have pushed for restricting userfaultfd because of how it can delay operations like copy_from_user(), but perhaps userfaultfd was not actually significantly more useful in this regard than mprotect().

Takeaway: Usercopy hardening

The hardening checks I encountered when calling copy_to_user() on arbitrary kernel addresses were a major annoyance, but could be worked around, since access to almost anything except type-specific SLUB pages is allowed. That said, I’m not sure how important improving these checks is - trying to protect against an attacker who can pass arbitrary kernel pointers to copy_to_user() might be futile, and guarding against out-of-bounds/use-after-free copy_to_user() or such is the major focus of this hardening.

Conclusions

Even in somewhat constrained environments, it is possible to pull off moderately complex Linux kernel exploits.

Chrome’s Linux desktop renderer sandbox exposes kernel attack surface that is never legitimately used in the sandbox. This needless functionality doesn’t just allow attackers to exercise vulnerabilities they otherwise couldn’t; it also exposes kernel interfaces that are useful for exploitation, enabling heap grooming, delay injection and more. The Linux kernel contributes to this issue by exposing esoteric features through the same syscalls as commonly-used core kernel functionality. I hope to do a more in-depth analysis of Chrome’s renderer sandbox on Linux in a follow-up blogpost.

Policy and Disclosure: 2025 Edition

29 July 2025 at 09:00

Posted by Tim Willis, Google Project Zero

In 2021, we updated our vulnerability disclosure policy to the current "90+30" model. Our goals were to drive faster yet thorough patch development, and improve patch adoption. While we’ve seen progress, a significant challenge remains: the time it takes for a fix to actually reach an end-user's device.

This delay, often called the "patch gap," is a complex problem. Many consider the patch gap to be the time between a fix being released for a security vulnerability and the user installing the relevant update. However, our work has highlighted a critical, earlier delay: the "upstream patch gap". This is the period where an upstream vendor has a fix available, but downstream dependents, who are ultimately responsible for shipping fixes to users, haven’t yet integrated it into their end product.

As Project Zero's recent work has focused on foundational, upstream technologies like chipsets and their drivers, we've observed that this upstream gap significantly extends the vulnerability lifecycle.

For the end user, a vulnerability isn't fixed when a patch is released from Vendor A to Vendor B; it's only fixed when they download the update and install it on their device. To shorten that entire chain, we need to address the upstream delay.

To address this, we're announcing a new trial policy: Reporting Transparency.

The Trial: Reporting Transparency

Our core 90-day disclosure deadline will remain in effect. However, we're adding a new step at the beginning of the process.

Beginning today, within approximately one week of reporting a vulnerability to a vendor, we will publicly share that a vulnerability was discovered. We will share:

  • The vendor or open-source project that received the report.
  • The affected product.
  • The date the report was filed, and when the 90-day disclosure deadline expires.

This trial maintains our existing 90+30 policy, meaning vendors still have 90 days to fix a bug before it is disclosed, with a 30-day period for patch adoption if the bug is fixed before the deadline.

Google Big Sleep, a collaboration between Google DeepMind and Google Project Zero, will also be trialling this policy for their vulnerability reports. The issue tracker for Google Big Sleep is at goo.gle/bigsleep

Why the Change? Increased Transparency to Close the Gap

The primary goal of this trial is to shrink the upstream patch gap by increasing transparency. By providing an early signal that a vulnerability has been reported upstream, we can better inform downstream dependents. For our small set of issues, they will have an additional source of information to monitor for issues that may affect their users. 

We hope that this trial will encourage the creation of stronger communication channels between upstream vendors and downstream dependents relating to security, leading to faster patches and improved patch adoption for end users.

This data will make it easier for researchers and the public to track how long it takes for a fix to travel from the initial report, all the way to a user's device (which is especially important if the fix never arrives!)

Will this help attackers?

No — we anticipate that in the initial phase of this trial, there may be increased public attention on unfixed bugs. We want to be clear: no technical details, proof-of-concept code, or information that we believe would materially assist discovery will be released until the deadline. Reporting Transparency is an alert, not a blueprint for attackers.

We understand that for some vendors without a downstream ecosystem, this policy may create unwelcome noise and attention for vulnerabilities that only they can address. However, these vendors now represent the minority of vulnerabilities reported by Project Zero. We believe the benefits of a fair, simple, consistent and transparent policy outweigh the risk of inconvenience to a small number of vendors.

That said, in 2025, we hope that the industry consensus is that the mere existence of vulnerabilities in software is neither surprising nor alarming. End users are more aware of the importance of security updates than ever before. It's widely accepted as fact that any system of moderate complexity will have vulnerabilities, and systems that were considered impenetrable in the past have been shown to be vulnerable in retrospect.

This is a trial, and we will be closely monitoring its effects. We hope it achieves our ultimate goal: a safer ecosystem where vulnerabilities are remediated not just in an upstream code repository, but on the devices, systems and services that people use every day. We look forward to sharing our findings and continuing to evolve our policies to meet the challenges of the ever-changing security landscape.

The Windows Registry Adventure #8: Practical exploitation of hive memory corruption

28 May 2025 at 09:00

Posted by Mateusz Jurczyk, Google Project Zero

In the previous blog post, we focused on the general security analysis of the registry and how to effectively approach finding vulnerabilities in it. Here, we will direct our attention to the exploitation of hive-based memory corruption bugs, i.e., those that allow an attacker to overwrite data within an active hive mapping in memory. This is a class of issues characteristic of the Windows registry, but universal enough that the techniques described here are applicable to 17 of my past vulnerabilities, as well as likely any similar bugs in the future. As we know, hives exhibit a very special behavior in terms of low-level memory management (how and where they are mapped in memory), handling of allocated and freed memory chunks by a custom allocator, and the nature of data stored there. All this makes exploiting this type of vulnerability especially interesting from the offensive security perspective, which is why I would like to describe it here in detail.

Similar to any other type of memory corruption, the vast majority of hive memory corruption issues can be classified into two groups: spatial violations (such as buffer overflows):

A diagram showing a corrupted memory cell overflowing an adjacent cell

and temporal violations, such as use-after-free conditions:

A diagram showing multiple invalid references to a freed cell

In this write up, we will aim to select the most promising vulnerability candidate and then create a step-by-step exploit for it that will elevate the privileges of a regular user in the system, from Medium IL to system-level privileges. Our target will be Windows 11, and an additional requirement will be to successfully bypass all modern security mitigations. I have previously presented on this topic at OffensiveCon 2024 with a presentation titled "Practical Exploitation of Registry Vulnerabilities in the Windows Kernel", and this blog post can be considered a supplement and expansion of the information shown there. Those deeply interested in the subject are encouraged to review the slides and recording available from that presentation.

Where to start: high-level overview of potential options

Let's start with a recap of some key points. As you may recall, the Windows registry cell allocator (i.e., the internal HvAllocateCell, HvReallocateCell, and HvFreeCell functions) operates in a way that is very favorable for exploitation. Firstly, it completely lacks any safeguards against memory corruption, and secondly, it has no element of randomness, making its behavior entirely predictable. Consequently, there is no need to employ any "hive spraying" or other similar techniques known from typical heap exploitation – if we manage to achieve the desired cell layout on a test machine, it will be reproducible on other computers without any additional steps. A potential exception could be carrying out attacks on global, shared hives within HKLM and HKU, as we don't know their initial state, and some randomness may arise from operations performed concurrently by other applications. Nevertheless, even this shouldn't pose a particularly significant challenge. We can safely assume that arranging the memory layout of a hive is straightforward, and if we have some memory corruption capability within it, we will eventually be able to overwrite any type of cell given some patience and experimentation.

The exploitation of classic memory corruption bugs typically involves the following steps:

  1. Initial memory corruption primitive
  2. ???
  3. ???
  4. ???
  5. Profit (in the form of arbitrary code execution, privilege escalation, etc.)

The task of the exploit developer is to fill in the gaps in this list, devising the intermediate steps leading to the desired goal. There are usually several such intermediate steps because, given the current state of security and mitigations, vulnerabilities rarely lead directly from memory corruption to code execution in a single step. Instead, a strategy of progressively developing stronger and stronger primitives is employed, where the final chain might look like this, for instance:

A flowchart depicting exploit development strategy, starting with "Memory corruption" which leads to "Information leak". This is followed by "Arbitrary vtable call", then "ROP" (Return-Oriented Programming). "ROP" leads to "Allocation of executable payload", which ultimately results in "Arbitrary code execution".

In this model, the second/third steps are achieved by finding another interesting object, arranging for it to be allocated near the overwritten buffer, and then corrupting it in such a way as to create a new primitive. However, in the case of hives, our options in this regard seem limited: we assume that we can fully control the representation of any cell in the hive, but the problem is that there is no immediately interesting data in them from an exploitation point of view. For example, the regf format does not contain any data that directly influences control flow (e.g., function pointers), nor any other addresses in virtual memory that could be overwritten in some clever way to improve the original primitive. The diagram below depicts our current situation:

A diagram showing a box labeled "Hive memory corruption" with an arrow pointing to a second box with a dashed outline and a question mark inside, indicating an unknown next step resulting from hive memory corruption.

Does this mean that hive memory corruption is non-exploitable, and the only thing it allows for is data corruption in an isolated hive memory view? Not quite. In the following subsections, we will carefully consider various ideas of how taking control of the internal hive data can have a broader impact on the overall security of the system. Then, we will try to determine which of the available approaches is best suited for use in a real-world exploit.

Intra-hive corruption

Let's start by investigating whether overwriting internal hive data is as impractical as it might initially seem.

Performing hive-only attacks in privileged system hives

To be clear, it's not completely accurate to say that hives don't contain any data worth overwriting. If you think about it, it's quite the opposite – the registry stores a vast amount of system configuration, information about registered services, user passwords, and so on. The only issue is that all this critical data is located in specific hives, namely those mounted under HKEY_LOCAL_MACHINE, and some in HKEY_USERS (e.g., HKU\.Default, which corresponds to the private hive of the System user). To be able to perform a successful attack and elevate privileges by corrupting only regf format data (without accessing other kernel memory or achieving arbitrary code execution), two conditions must be met:

  1. The vulnerability must be triggerable solely through API/system calls and must not require binary control over the hive, as we obviously don't have that over any system hive.
  2. The target hive must contain at least one key with permissive enough access rights that allow unprivileged users to create values (KEY_SET_VALUE permission) and/or new subkeys (KEY_CREATE_SUB_KEY). Some other access rights might also be necessary, depending on the prerequisites of the specific bug.

Of the two points above, the first is definitely more difficult to satisfy. Many hive memory corruption bugs result from a strange, unforeseen state in the hive structures that can only be generated "offline", starting with full control over the given file. API-only vulnerabilities seem to be relatively rare: for instance, of my 17 hive-based memory corruption cases, less than half (specifically 8 of them) could theoretically be triggered solely by operations on an existing hive. Furthermore, a closer look reveals that some of them do not meet other conditions needed to target system hives (e.g., they only affect differencing hives), or are highly impractical, e.g., require the allocation of more than 500 GB of memory, or take many hours to trigger. In reality, out of the wide range of vulnerabilities, there are really only two that would be well suited for directly attacking a system hive: CVE-2023-23420 (discussed in the "Operating on subkeys of transactionally renamed keys" section of the report) and CVE-2023-23423 (discussed in "Freeing a shallow copy of a key node with CmpFreeKeyByCell").

Regarding the second issue – the availability of writable keys – the situation is much better for the attacker. There are three reasons for this:

  • To successfully carry out a data-only attack on a system key, we are usually not limited to one specific hive, but can choose any that suits us. Exploiting hive corruption in most, if not all, hives mounted under HKLM would enable an attacker to elevate privileges.
  • The Windows kernel internally implements the key opening process by first doing a full path lookup in the registry tree, and only then checking the required user permissions. The access check is performed solely on the security descriptor of the specific key, without considering its ancestors. This means that setting overly permissive security settings for a key automatically makes it vulnerable to attacks, as according to this logic, it receives no additional protection from its ancestor keys, even if they have much stricter access controls.
  • There are a large number of user-writable keys in the HKLM\SOFTWARE and HKLM\SYSTEM hives. They do not exist in HKLM\BCD00000000, HKLM\SAM, or HKLM\SECURITY, but as I mentioned above, only one such key is sufficient for successful exploitation.

To find specific examples of such publicly accessible keys, it is necessary to write custom tooling. This tooling should first recursively list all existing keys within the low-level \Registry\Machine and \Registry\User paths, while operating with the highest possible privileges, ideally as the System user. This will ensure that the process can see all the keys in the registry tree – even those hidden behind restricted parents. It is not worth trying to enumerate the subkeys of \Registry\A, as any references to it are unconditionally blocked by the Windows kernel. Similarly, \Registry\WC can likely be skipped unless one is interested in attacking differencing hives used by containerized applications. Once we have a complete list of all the keys, the next step is to verify which of them are writable by unprivileged users. This can be accomplished either by reading their security descriptors (using RegGetKeySecurity) and manually checking their access rights (using AccessCheck), or by delegating this task entirely to the kernel and simply trying to open every key with the desired rights while operating with regular user privileges. In either case, we should be ultimately able to obtain a list of potential keys that can be used to corrupt a system hive.

Based on my testing, there are approximately 1678 keys within HKLM that grant subkey creation rights to normal users on a current Windows 11 system. Out of these, 1660 are located in HKLM\SOFTWARE, and 18 are in HKLM\SYSTEM. Some examples include:

HKLM\SOFTWARE\Microsoft\CoreShell

HKLM\SOFTWARE\Microsoft\DRM

HKLM\SOFTWARE\Microsoft\Input\Locales          (and some of its subkeys)

HKLM\SOFTWARE\Microsoft\Input\Settings         (and some of its subkeys)

HKLM\SOFTWARE\Microsoft\Shell\Oobe

HKLM\SOFTWARE\Microsoft\Shell\Session

HKLM\SOFTWARE\Microsoft\Tracing                (and some of its subkeys)

HKLM\SOFTWARE\Microsoft\Windows\UpdateApi

HKLM\SOFTWARE\Microsoft\WindowsUpdate\UX

HKLM\SOFTWARE\WOW6432Node\Microsoft\DRM

HKLM\SOFTWARE\WOW6432Node\Microsoft\Tracing

HKLM\SYSTEM\Software\Microsoft\TIP             (and some of its subkeys)

HKLM\SYSTEM\ControlSet001\Control\Cryptography\WebSignIn\Navigation

HKLM\SYSTEM\ControlSet001\Control\MUI\StringCacheSettings

HKLM\SYSTEM\ControlSet001\Control\USB\AutomaticSurpriseRemoval

HKLM\SYSTEM\ControlSet001\Services\BTAGService\Parameters\Settings

As we can see, there are quite a few possibilities. The second key on the list, HKLM\SOFTWARE\Microsoft\DRM, has been somewhat popular in the past, as it was previously used by James Forshaw to demonstrate two vulnerabilities he discovered in 2019–2020 (CVE-2019-0881, CVE-2020-1377). Subsequently, I also used it as a way to trigger certain behaviors related to registry virtualization (CVE-2023-21675, CVE-2023-21748, CVE-2023-35357), and as a potential avenue to fill the SOFTWARE hive to its capacity, thereby causing an OOM condition as part of exploiting another bug (CVE-2023-32019). The main advantage of this key is that it exists in all modern versions of the system (since at least Windows 7), and it grants broad rights to all users (the Everyone group, also known as World, or S-1-1-0). The other keys mentioned above also allow regular users write operations, but they often do so through other, potentially more restricted groups such as Interactive (S-1-5-4), Users (S-1-5-32-545), or Authenticated Users (S-1-5-11), which may be something to keep in mind.

Apart from global system hives, I also discovered the curious case of the HKCU\Software\Microsoft\Input\TypingInsights key being present in every user's hive, which permits read and write access to all other users in the system. I reported it to Microsoft in December 2023 (link to report), but it was deemed low severity and hasn't been fixed so far. This decision is somewhat understandable, as the behavior doesn't have direct, serious consequences for system security, but it still can work as a useful exploitation technique. Since any user can open a key for writing in the user hive of any other user, they gain the ability to:

  • Fill the entire 2 GiB space of that hive, resulting in a DoS condition (the user and their applications cannot write to HKCU) and potentially enabling exploitation of bugs related to mishandling OOM conditions within the hive.
  • Write not just to the "TypingInsights" key in the HKCU itself, but also to any of the corresponding keys in the differencing hives overlaid on top of it. This provides an opportunity to attack applications running within app/server silos with that user's permissions.
  • Perform hive-based memory corruption attacks not only on system hives, but also on the hives of specific users, allowing for a more lateral privilege escalation scenario.


As demonstrated, even a seemingly minor weakness in the security descriptor of a single registry key can have significant consequences for system security.

In summary, attacking system hives with hive memory corruption is certainly possible, but requires finding a very good vulnerability that can be triggered on existing keys, without the need to load a custom hive. This is a good starting point, but perhaps we can find a more universal technique.

Abusing regf inconsistency to trigger kernel pool corruption

While hive mappings in memory are isolated and self-contained to some extent, they do not exist in a vacuum. The Windows kernel allocates and manages many additional registry-related objects within the kernel pool space, as discussed in blog post #6. These objects serve as optimization through data caching, and help implement certain functionalities that cannot be achieved solely through operations on the hive space (e.g., transactions, layered keys). Some of these objects are long-lived and persist in memory as long as the hive is mounted. Other buffers are allocated and immediately freed within the same syscall, serving only as temporary data storage. The memory safety of all these objects is closely tied to the consistency of the corresponding data within the hive mapping. After the kernel meticulously verifies the hive validity in CmCheckRegistry and related functions, it assumes that the registry hive's data maintains consistency with its own structure and associated auxiliary structures.

For a potential attacker, this means that hive memory corruption can be potentially escalated to some forms of pool corruption. This provides a much broader spectrum of options for exploitation, as there are a variety of pool allocations used by various parts of the kernel. In fact, I even took advantage of this behavior in my reports to Microsoft: in every case of a use-after-free on a security descriptor, I would enable Special Pool and trigger a reference to the cached copy of that descriptor on the pools through the _CM_KEY_CONTROL_BLOCK.CachedSecurity field. I did this because it is much easier to generate a reliably reproducible crash by accessing a freed allocation on the pool than when accessing a freed but still mapped cell in the hive. 

However, this is certainly not the only way to cause pool memory corruption by modifying the internal data of the regf format. Another idea would be, for example, to create a very long "big data" value in the hive (over ~16 KiB in a hive with version ≥ 1.4) and then cause _CM_KEY_VALUE.DataLength to be inconsistent with the _CM_BIG_DATA.Count field, which denotes the number of 16-kilobyte chunks in the backing buffer. If we look at the implementation of the internal CmpGetValueData function, it is easy to see that it allocates a paged pool buffer based on the former value, and then copies data to it based on the latter one. Therefore, if we set _CM_KEY_VALUE.DataLength to a number less than 16344 × (_CM_BIG_DATA.Count - 1), then the next time the value's data is requested, a linear pool buffer overflow will occur.

This type of primitive is promising, as it opens the door to targeting a much wider range of objects in memory than was previously possible. The next step would likely involve finding a suitable object to place immediately after the overwritten buffer (e.g., pipe attributes, as mentioned in this article from 2020), and then corrupting it to achieve a more powerful primitive like arbitrary kernel read/write. In short, such an attack would boil down to a fairly generic exploitation of pool-based memory corruption, a topic widely discussed in existing resources. We won't explore this further here, and instead encourage interested readers to investigate it on their own.

Inter-hive memory corruption

So far in our analysis, we have assumed that with a hive-based memory corruption bug, we can only modify data within the specific hive we are operating on. In practice, however, this is not necessarily the case, because there might be other data located in the immediate vicinity of our bin's mapping in memory. If that happens, it might be possible to seamlessly cross the boundary between the original hive and some more interesting objects at higher memory addresses using a linear buffer overflow. In the following sections, we will look at two such scenarios: one where the mapping of the attacked hive is in the user-mode space of the "Registry" process, and one where it resides in the kernel address space.

Other hive mappings in the user space of the Registry process

Mapping the section views of hives in the user space of the Registry process is the default behavior for the vast majority of the registry. The layout of individual mappings in memory can be easily observed from WinDbg. To do this, find the Registry process (usually the second in the system process list), switch to its context, and then issue the !vad command. An example of performing these operations is shown below.

0: kd> !process 0 0

**** NT ACTIVE PROCESS DUMP ****

PROCESS ffffa58fa069f040

    SessionId: none  Cid: 0004    Peb: 00000000  ParentCid: 0000

    DirBase: 001ae002  ObjectTable: ffffe102d72678c0  HandleCount: 3077.

    Image: System

PROCESS ffffa58fa074a080

    SessionId: none  Cid: 007c    Peb: 00000000  ParentCid: 0004

    DirBase: 1025ae002  ObjectTable: ffffe102d72d1d00  HandleCount: <Data Not Accessible>

    Image: Registry

[...]

0: kd> .process ffffa58fa074a080

Implicit process is now ffffa58f`a074a080

WARNING: .cache forcedecodeuser is not enabled

0: kd> !vad

VAD             Level         Start             End              Commit

ffffa58fa207f740  5        152e7a20        152e7a2f               0 Mapped       READONLY           \Windows\System32\config\SAM

ffffa58fa207dbc0  4        152e7a30        152e7b2f               0 Mapped       READONLY           \Windows\System32\config\DEFAULT

ffffa58fa207dc60  5        152e7b30        152e7b3f               0 Mapped       READONLY           \Windows\System32\config\SECURITY

ffffa58fa207d940  3        152e7b40        152e7d3f               0 Mapped       READONLY           \Windows\System32\config\SOFTWARE

ffffa58fa207dda0  5        152e7d40        152e7f3f               0 Mapped       READONLY           \Windows\System32\config\SOFTWARE

[...]

ffffa58fa207e840  5        152ec940        152ecb3f               0 Mapped       READONLY           \Windows\System32\config\SOFTWARE

ffffa58fa207b780  3        152ecb40        152ecd3f               0 Mapped       READONLY           \Windows\System32\config\SOFTWARE

ffffa58fa0f98ba0  5        152ecd40        152ecd4f               0 Mapped       READONLY           \EFI\Microsoft\Boot\BCD

ffffa58fa3af5440  4        152ecd50        152ecd8f               0 Mapped       READONLY           \Windows\ServiceProfiles\NetworkService\NTUSER.DAT

ffffa58fa3bfe9c0  5        152ecd90        152ecdcf               0 Mapped       READONLY           \Windows\ServiceProfiles\LocalService\NTUSER.DAT

ffffa58fa3ca3d20  1        152ecdd0        152ece4f               0 Mapped       READONLY           \Windows\System32\config\BBI

ffffa58fa2102790  6        152ece50        152ecf4f               0 Mapped       READONLY           \Users\user\NTUSER.DAT

ffffa58fa4145640  5        152ecf50        152ed14f               0 Mapped       READONLY           \Windows\System32\config\DRIVERS

ffffa58fa4145460  6        152ed150        152ed34f               0 Mapped       READONLY           \Windows\System32\config\DRIVERS

ffffa58fa412a520  4        152ed350        152ed44f               0 Mapped       READONLY           \Windows\System32\config\DRIVERS

ffffa58fa412c5a0  6        152ed450        152ed64f               0 Mapped       READONLY           \Users\user\AppData\Local\Microsoft\Windows\UsrClass.dat

ffffa58fa4e8bf60  5        152ed650        152ed84f               0 Mapped       READONLY           \Windows\appcompat\Programs\Amcache.hve

In the listing above, the "Start" and "End" columns show the starting and ending addresses of each mapping divided by the page size, which is 4 KiB. In practice, this means that the SAM hive is mapped at 0x152e7a20000 – 0x152e7a2ffff, the DEFAULT hive is mapped at 0x152e7a30000 – 0x152e7b2ffff, and so on. We can immediately see that all the hives are located very close to each other, with practically no gaps in between them.

However, this example does not directly demonstrate whether it's possible to place, for instance, the mapping of the SOFTWARE hive directly after the mapping of an app hive loaded by a normal user. The addresses of the system hives appear to be already determined, and there isn't much space between them to inject our own data. Fortunately, hives can grow dynamically, especially when you start writing long values to them. This leads to the creation of new bins and mapping them at new addresses in the Registry process's memory.

For testing purposes, I wrote a simple program that creates consecutive values of 0x3FD8 bytes within a given key. This triggers the allocation of new bins of exactly 0x4000 bytes: 0x3FD8 bytes of data plus 0x20 bytes for the _HBIN structure, 4 bytes for the cell size, and 4 bytes for padding. Next, I ran two instances of it in parallel on an app hive and HKLM\SOFTWARE, filling the former with the letter "A" and the latter with the letter "B". The result of the test was immediately visible in the memory layout:

0: kd> !vad

VAD             Level         Start             End              Commit

ffffa58fa67b44c0  8        15280000        152801ff               0 Mapped       READONLY           \Windows\System32\config\SOFTWARE

ffffa58fa67b5b40  7        15280200        152803ff               0 Mapped       READONLY           \Users\user\Desktop\test.dat

ffffa58fa67b46a0  8        15280400        152805ff               0 Mapped       READONLY           \Windows\System32\config\SOFTWARE

ffffa58fa67b6540  6        15280600        152807ff               0 Mapped       READONLY           \Users\user\Desktop\test.dat

ffffa58fa67b5dc0  8        15280800        152809ff               0 Mapped       READONLY           \Windows\System32\config\SOFTWARE

ffffa58fa67b4560  7        15280a00        15280bff               0 Mapped       READONLY           \Users\user\Desktop\test.dat

ffffa58fa67b6900  8        15280c00        15280dff               0 Mapped       READONLY           \Windows\System32\config\SOFTWARE

ffffa58fa67b5280  5        15280e00        15280fff               0 Mapped       READONLY           \Users\user\Desktop\test.dat

ffffa58fa67b5e60  8        15281000        152811ff               0 Mapped       READONLY           \Windows\System32\config\SOFTWARE

ffffa58fa67b7800  7        15281200        152813ff               0 Mapped       READONLY           \Users\user\Desktop\test.dat

ffffa58fa67b8de0  8        15281400        152815ff               0 Mapped       READONLY           \Windows\System32\config\SOFTWARE

ffffa58fa67b8840  6        15281600        152817ff               0 Mapped       READONLY           \Users\user\Desktop\test.dat

ffffa58fa67b8980  8        15281800        152819ff               0 Mapped       READONLY           \Windows\System32\config\SOFTWARE

[...]

What we have here are interleaved mappings of trusted and untrusted hives, each 2 MiB in length and tightly packed with 512 bins of 16 KiB each. Importantly, there are no gaps between the end of one mapping and the start of another, which means that it is indeed possible to use memory corruption within one hive to influence the internal representation of another. Take, for example, the boundary between the test.dat and SOFTWARE hives at address 0x15280400000. If we dump the memory area encompassing a few dozen bytes before and after this page boundary, we get the following result:

0: kd> db 0x15280400000-30

00000152`803fffd0  41 41 41 41 41 41 41 41-41 41 41 41 41 41 41 41  AAAAAAAAAAAAAAAA

00000152`803fffe0  41 41 41 41 41 41 41 41-41 41 41 41 41 41 41 41  AAAAAAAAAAAAAAAA

00000152`803ffff0  41 41 41 41 41 41 41 41-41 41 41 41 00 00 00 00  AAAAAAAAAAAA....

00000152`80400000  68 62 69 6e 00 f0 bf 0c-00 40 00 00 00 00 00 00  hbin.....@......

00000152`80400010  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

00000152`80400020  20 c0 ff ff 42 42 42 42-42 42 42 42 42 42 42 42   ...BBBBBBBBBBBB

00000152`80400030  42 42 42 42 42 42 42 42-42 42 42 42 42 42 42 42  BBBBBBBBBBBBBBBB

00000152`80400040  42 42 42 42 42 42 42 42-42 42 42 42 42 42 42 42  BBBBBBBBBBBBBBBB

We can clearly see that the bytes belonging to both hives in question exist within a single, continuous memory area. This, in turn, means that memory corruption could indeed spread from one hive into the other. However, to successfully achieve this result, one would also need to ensure that the specific fragment of the target hive is marked as dirty. Otherwise, this memory page would be marked as PAGE_READONLY, which would lead to a system crash when attempting to write data, despite both regions being directly adjacent to each other.

After successfully corrupting data in a global, system hive, the remainder of the attack would likely involve either modifying a security descriptor to grant oneself write permissions to specific keys, or directly changing configuration data to enable the execution of one's own code with administrator privileges.

Attacking adjacent memory in pool-based hive mappings

Although hive file views are typically mapped in the user-mode space of the Registry process (which contains nothing else but these mappings), there are a few circumstances where this data is stored directly in kernel-mode pools. These cases are as follows:

  1. All volatile hives, which have no persistent representation as regf files on disk. Examples include the virtual hive rooted at \Registry, as well as the HKLM\HARDWARE hive.
  2. The entire HKLM\SYSTEM hive, including both its stable and volatile parts.
  3. All hives that have been recently created by calling one of the NtLoadKey* syscalls on a previously non-existent file, including newly created app hives.
  4. Volatile storage space of every active hive in the system.

The first point is not useful to a potential attacker because these types of hives do not grant unprivileged users write permissions. The second and third points are also quite limited, as they could only be exploited through memory corruption that doesn't require binary control over the input hive. However, the fourth point makes it possible to exploit vulnerabilities in any hive in the system, including app hives. This is because creating volatile keys does not require any special permissions compared to regular keys. Additionally, if we have a memory corruption primitive within one storage type, we can easily influence data within the other. For example, in the case of stable storage memory corruption, it is enough to craft a value for which the cell index _CM_KEY_VALUE.Data has the highest bit set, and thus points to the volatile space. From this point, we can arbitrarily modify regf structures located in that space, and directly read/write out-of-bounds pool memory by setting a sufficiently long value size (exceeding the bounds of the given bin). Such a situation is shown in the diagram below:

A diagram illustrating memory corruption, divided into two sections. The top section, labeled "Kernel-mode paged pool," shows a memory bar containing "test.dat (volatile)" followed by several "Pool chunk" blocks and a dotted "Pool chunks..." block. The pool chunks are showed being overflowed. The bottom section, labeled "Registry process address space," shows a memory bar with a small corrupted area at the beginning, followed by "test.dat (stable)" and a dotted "... Other hives ..." block. An arrow from a "Corrupted value node" label points to this red area. A red arrow labeled "Volatile cell index" connects the "Pool corruption" in the kernel-mode paged pool to the "Corrupted value node" in the registry process address space, indicating a relationship between the two corrupted areas.

This behavior can be further verified on a specific example. Let's consider the HKCU hive for a user logged into a Windows 11 system – it will typically have some data stored in the volatile storage due to the existence of the "HKCU\Volatile Environment" key. Let's first find the hive in WinDbg using the !reg hivelist command:

0: kd> !reg hivelist

---------------------------------------------------------------------------------------------------------------------------------------------

|     HiveAddr     |Stable Length|    Stable Map    |Volatile Length|    Volatile Map    |     BaseBlock     | FileName 

---------------------------------------------------------------------------------------------------------------------------------------------

[...]

| ffff82828fc1a000 |      ee000  | ffff82828fc1a128 |       5000    |  ffff82828fc1a3a0  | ffff82828f8cf000  | \??\C:\Users\user\ntuser.dat

[...]

As can be seen, the hive has a volatile space of 0x5000 bytes (5 memory pages). Let's try to find the second page of this hive region in memory by translating its corresponding cell index:

0: kd> !reg cellindex ffff82828fc1a000 80001000

Map = ffff82828fc1a3a0 Type = 1 Table = 0 Block = 1 Offset = 0

MapTable     = ffff82828fe6a000 

MapEntry     = ffff82828fe6a018 

BinAddress = ffff82828f096009, BlockOffset = 0000000000000000

BlockAddress = ffff82828f096000 

pcell:  ffff82828f096004

It is a kernel-mode address, as expected. We can dump its contents to verify that it indeed contains registry data:

0: kd> db ffff82828f096000

ffff8282`8f096000  68 62 69 6e 00 10 00 00-00 10 00 00 00 00 00 00  hbin............

ffff8282`8f096010  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

ffff8282`8f096020  38 ff ff ff 73 6b 00 00-20 10 00 80 20 10 00 80  8...sk.. ... ...

ffff8282`8f096030  01 00 00 00 b0 00 00 00-01 00 04 88 98 00 00 00  ................

ffff8282`8f096040  a4 00 00 00 00 00 00 00-14 00 00 00 02 00 84 00  ................

ffff8282`8f096050  05 00 00 00 00 03 24 00-3f 00 0f 00 01 05 00 00  ......$.?.......

ffff8282`8f096060  00 00 00 05 15 00 00 00-dc be 84 0b 6c 21 35 39  ............l!59

ffff8282`8f096070  b9 d0 84 88 ea 03 00 00-00 03 14 00 3f 00 0f 00  ............?...

Everything looks good. At the start of the page, there is a bin header, and at offset 0x20, we see the first cell corresponding to a security descriptor ('sk'). Now, let's see what the !pool command tells us about this address:

0: kd> !pool ffff82828f096000

Pool page ffff82828f096000 region is Paged pool

*ffff82828f096000 : large page allocation, tag is CM16, size is 0x1000 bytes

                Pooltag CM16 : Internal Configuration manager allocations, Binary : nt!cm

We are dealing with a paged pool allocation of 0x1000 bytes requested by the Configuration Manager. And what is located right behind it?

0: kd> !pool ffff82828f096000+1000

Pool page ffff82828f097000 region is Paged pool

*ffff82828f097000 : large page allocation, tag is Obtb, size is 0x1000 bytes

                Pooltag Obtb : object tables via EX handle.c, Binary : nt!ob

0: kd> !pool ffff82828f096000+2000

Pool page ffff82828f098000 region is Paged pool

*ffff82828f098000 : large page allocation, tag is Gpbm, size is 0x1000 bytes

                Pooltag Gpbm : GDITAG_POOL_BITMAP_BITS, Binary : win32k.sys

The next two memory pages correspond to other, completely unrelated allocations on the pool: one associated with the NT Object Manager, and the other with the win32k.sys graphics driver. This clearly demonstrates that in the kernel space, areas containing volatile hive data are mixed with various other allocations used by other parts of the system. Moreover, this technique is attractive because it not only enables out-of-bound writes of controlled data, but also the ability to read this OOB data beforehand. Thanks to this, the exploit does not have to operate "blindly", but it can precisely verify whether the memory is arranged exactly as expected before proceeding with the next stage of the attack. With these kinds of capabilities, writing the rest of the exploit should be a matter of properly grooming the pool layout and finding some good candidate objects for corruption.

The ultimate primitive: out-of-bounds cell indexes

The situation is clearly not as hopeless as it might have seemed earlier, and there are quite a few ways to convert memory corruption in one's own hive space into taking control of other types of memory. All of them, however, have one minor flaw: they rely on prearranging a specific layout of objects in memory (e.g., hive mappings in the Registry process, or allocations on the paged pool), which means they cannot be said to be 100% stable or deterministic. The randomness of the memory layout carries the inherent risk that either the exploit simply won't work, or worse, it will crash the operating system in the process. For lack of better alternatives, these techniques would be sufficient, especially for demonstration purposes. However, I found a better method that guarantees 100% effectiveness by completely eliminating the element of randomness. I have hinted at or even directly mentioned this many times in previous blog posts in this series, and I am, of course, referring to out-of-bounds cell indexes.

As a quick reminder, cell indexes are the hive's equivalent of pointers: they are 32-bit values that allow allocated cells to reference each other. The translation of cell indexes into their corresponding virtual addresses is achieved using a special 3-level structure called a cell map, which resembles a CPU page table:

A diagram of a cell map

The C-like pseudocode of the internal HvpGetCellPaged function responsible for performing the cell map walk is presented below:

_CELL_DATA *HvpGetCellPaged(_HHIVE *Hive, HCELL_INDEX Index) {

  _HMAP_ENTRY *Entry = &Hive->Storage[Index >> 31].Map

                            ->Directory[(Index >> 21) & 0x3FF]

                            ->Table[(Index >> 12) & 0x1FF];

  return (Entry->PermanentBinAddress & (~0xF)) + Entry->BlockOffset + (Index & 0xFFF) + 4;

}

The structures corresponding to the individual levels of the cell map are _DUAL, _HMAP_DIRECTORY, _HMAP_TABLE and _HMAP_ENTRY, and they are accessible through the _CMHIVE.Hive.Storage field. From an exploitation perspective, two facts are crucial here. First, the HvpGetCellPaged function does not perform any bounds checks on the input index. Second, for hives smaller than 2 MiB, Windows applies an additional optimization called "small dir". In that case, instead of allocating the entire Directory array of 1024 elements and only using one of them, the kernel sets the _CMHIVE.Hive.Storage[...].Map pointer to the address of the _CMHIVE.Hive.Storage[...].SmallDir field, which simulates a single-element array. In this way, the number of logical cell map levels remains the same, but the system uses one less pool allocation to store them, saving about 8 KiB of memory per hive. This behavior is shown in the screenshot below:

Screenshot

What we have here is a hive that has a stable storage area of 0xEE000 bytes (952 KiB) and a volatile storage area of 0x5000 bytes (20 KiB). Both of these sizes are smaller than 2 MiB, and consequently, the "small dir" optimization is applied in both cases. As a result, the Map pointers (marked in orange) point directly to the SmallDir fields (marked in green).

This situation is interesting because if the kernel attempts to resolve an invalid cell index with a value of 0x200000 or greater (i.e., with the "Directory index" part being non-zero) in the context of such a hive, then the first step of the cell map walk will reference the out-of-bounds Guard, FreeDisplay, etc. fields as pointers. This situation is illustrated in the diagram below:

Diagram described above

In other words, by fully controlling the 32-bit value of the cell index, we can make the translation logic jump through two pointers fetched from out-of-bounds memory, and then add a controlled 12-bit offset to the result. An additional consideration is that in the first step, we reference OOB indexes of an "array" located inside the larger _CMHIVE structure, which always has the same layout on a given Windows build. Therefore, by choosing a directory index that references a specific pointer in _CMHIVE, we can be sure that it will always work the same way on a given version of the system, regardless of any random factors.

On the other hand, a small inconvenience is that the _HMAP_ENTRY structure (i.e., the last level of the cell map) has the following layout:

0: kd> dt _HMAP_ENTRY

nt!_HMAP_ENTRY

   +0x000 BlockOffset      : Uint8B

   +0x008 PermanentBinAddress : Uint8B

   +0x010 MemAlloc         : Uint4B

And the final returned value is the sum of the BlockOffset and PermanentBinAddress fields. Therefore, if one of these fields contains the address we want to reference, the other must be NULL, which may slightly narrow down our options.

If we were to create a graphical representation of the relationships between structures based on the pointers they contain, starting from _CMHIVE, it would look something like the following:

A diagram illustrating the relationships between various system components, with "CMHIVE" as the central element in a rectangular box. Several components interact directly with "CMHIVE": A box labeled "CM_KEY_SECURITY_CACHE_ENTRY" has an arrow pointing to "CMHIVE". A box labeled "CMP_VOLUME_CONTEXT" has a two-way arrow connecting it to "CMHIVE". A box labeled "CM_KEY_CONTROL_BLOCK" has a two-way arrow connecting it to "CMHIVE". A box labeled "CM_RM" has a two-way arrow connecting it to "CMHIVE". Other components are connected as follows: A box labeled "CM_KEY_SECURITY_CACHE" points to "CM_KEY_SECURITY_CACHE_ENTRY". A box labeled "FILE_OBJECT" points to "CMP_VOLUME_CONTEXT". A box labeled "CMP_VOLUME_MANAGER" has a two-way arrow with "CMP_VOLUME_CONTEXT". A box labeled "CM_NAME_CONTROL_BLOCK" has a two-way arrow with "CM_KEY_CONTROL_BLOCK". A box labeled "CM_KCB_LAYER_INFO" has a two-way arrow with "CM_KEY_CONTROL_BLOCK". "CM_KEY_CONTROL_BLOCK" points to boxes labeled "CM_KEY_BODY", "CM_TRANS", and "CM_KCB_UOW". A box labeled "KRESOURCEMANAGER" points to "CM_RM". A box labeled "KTM" points to "CM_RM".

The diagram is not necessarily complete, but it shows an overview of some objects that can be reached from _CMHIVE with a maximum of two pointer dereferences. However, it is important to remember that not every edge in this graph will be traversable in practice. This is because of two reasons: first, due the layout of the _HMAP_ENTRY structure (i.e. 0x18-byte alignment and the need for a 0x0 value being adjacent to the given pointer), and second, due to the fact that not every pointer in these objects is always initialized. For example, the _CMHIVE.RootKcb field is only valid for app hives (but not for normal hives), while _CMHIVE.CmRm is only set for standard hives, as app hives never have KTM transaction support enabled. So, the idea provides some good foundation for our exploit, but it does require additional experimentation to get every technical detail right.

Moving on, the !reg cellindex command in WinDbg is perfect for testing out-of-bounds cell indexes, because it uses the exact same cell map walk logic as HvpGetCellPaged, and it doesn't perform any additional bounds checks either. So, let's stick with the HKCU hive we were working with earlier, and try to create a cell index that points back to its _CMHIVE structure. We'll use the _CMHIVE → _CM_RM → _CMHIVE path for this. The first decision we need to make is to choose the storage type for this index: stable (0) or volatile (1). In the case of HKCU, both storage types are non-empty and use the "small dir" optimization, so we can choose either one; let's say volatile. Next, we need to calculate the directory index, which will be equal to the difference between the offsets of the _CMHIVE.CmRm and _CMHIVE.Hive.Storage[1].SmallDir fields:

0: kd> dx (&((nt!_CMHIVE*)0xffff82828fc1a000)->Hive.Storage[1].SmallDir)

(&((nt!_CMHIVE*)0xffff82828fc1a000)->Hive.Storage[1].SmallDir) : 0xffff82828fc1a3a0 [Type: _HMAP_TABLE * *]

    0xffff82828fe6a000 [Type: _HMAP_TABLE *]

0: kd> dx (&((nt!_CMHIVE*)0xffff82828fc1a000)->CmRm)

(&((nt!_CMHIVE*)0xffff82828fc1a000)->CmRm)                     : 0xffff82828fc1b038 [Type: _CM_RM * *]

    0xffff82828fdcc8e0 [Type: _CM_RM *]

In this case, it is (0xffff82828fc1b038 - 0xffff82828fc1a3a0) ÷ 8 = 0x193. The next step is to calculate the table index, which will be the offset of the _CM_RM.CmHive field from the beginning of the structure, divided by the size of _HMAP_ENTRY (0x18).

0: kd> dx (&((nt!_CM_RM*)0xffff82828fdcc8e0)->CmHive)

(&((nt!_CM_RM*)0xffff82828fdcc8e0)->CmHive)                 : 0xffff82828fdcc930 [Type: _CMHIVE * *]

    0xffff82828fc1a000 [Type: _CMHIVE *]

So, the calculation is (0xffff82828fdcc930 - 0xffff82828fdcc8e0) ÷ 0x18 = 3. Next, we can verify where the CmHive pointer falls within the _HMAP_ENTRY structure.

0: kd> dt _HMAP_ENTRY 0xffff82828fdcc8e0+3*0x18

nt!_HMAP_ENTRY

   +0x000 BlockOffset      : 0

   +0x008 PermanentBinAddress : 0xffff8282`8fc1a000

   +0x010 MemAlloc         : 0

The _CM_RM.CmHive pointer aligns with the PermanentBinAddress field, which is good news. Additionally, the BlockOffset field is zero, which is also desirable. Internally, it corresponds to the ContainerSize field, which is zero'ed out if no KTM transactions have been performed on the hive during this session – this will suffice for our example.

We have now calculated three of the four cell index elements, and the last one is the offset, which we will set to zero, as we want to access the _CMHIVE structure from the very beginning. It is time to gather all this information in one place; we can build the final cell index using a simple Python function:

>>> def MakeCellIndex(storage, directory, table, offset):

...     print("0x%x" % ((storage << 31) | (directory << 21) | (table << 12) | offset))

...

And then pass the values we have established so far:

>>> MakeCellIndex(1, 0x193, 3, 0)

0xb2603000

>>>

So the final out-of-bounds cell index pointing to the _CMHIVE structure of a given hive is 0xB2603000. It is now time to verify in WinDbg whether this magic index actually works as intended.

0: kd> !reg cellindex ffff82828fc1a000 b2603000

Map = ffff82828fc1a3a0 Type = 1 Table = 193 Block = 3 Offset = 0

MapTable     = ffff82828fdcc8e0 

MapEntry     = ffff82828fdcc928 

BinAddress = ffff82828fc1a000, BlockOffset = 0000000000000000

BlockAddress = ffff82828fc1a000 

pcell:  ffff82828fc1a004

Indeed, the _CMHIVE address passed as the input of the command was also printed in its output, which means that our technique works (the extra 0x4 in the output address is there to account for the cell size). If we were to insert this index into the _CM_KEY_VALUE.Data field, we would gain the ability to read from and write to the _CMHIVE structure in kernel memory through the registry value. This represents a very powerful capability in the hands of a local attacker.

Writing the exploit

At this stage, we already have a solid plan for how to leverage the initial primitive of hive memory corruption for further privilege escalation. It's time to choose a specific vulnerability and begin writing an actual exploit for it. This process is described in detail below.

Step 0: Choosing the vulnerability

Faced with approximately 17 vulnerabilities related to hive memory corruption, the immediate challenge is selecting one for a demonstration exploit. While any of these bugs could eventually be exploited with time and experimentation, they vary in difficulty. There is also an aesthetic consideration: for demonstration purposes, it would be ideal if the exploit's actions were visible within Regedit, which narrows our options. Nevertheless, with a significant selection still available, we should be able to identify a suitable candidate. Let's briefly examine two distinct possibilities.

CVE-2022-34707

The first vulnerability that always comes to my mind in the context of the registry is CVE-2022-34707. This is partly because it was the first bug I manually discovered as part of this research, but mainly because it is incredibly convenient to exploit. The essence of this bug is that it was possible to load a hive with a security descriptor containing a refcount very close to the maximum 32-bit value (e.g., 0xFFFFFFFF), and then overflow it by creating a few more keys that used it. This resulted in a very powerful UAF primitive, as the incorrectly freed cell could be subsequently filled with new objects and then freed again any number of times. In this way, it was possible to achieve type confusion of several different types of objects, e.g., by reusing the same cell subsequently as a security descriptor → value node → value data backing cell, we could easily gain control over the _CM_KEY_VALUE structure, allowing us to continue the attack using out-of-bounds cell indexes.

Due to its characteristics, this bug was also the first vulnerability in this research for which I wrote a full-fledged exploit. Many of the techniques I describe here were discovered while working on this bug. Furthermore, the screenshot showing the privilege escalation at the end of blog post #1 illustrates the successful exploitation of CVE-2022-34707. However, in the context of this blog post, it has one fundamental flaw: to set the initial refcount to a value close to overflowing the 32-bit range, it is necessary to manually craft the input regf file. This means that the target can only be an app hive, and thus we wouldn't be able to directly observe the exploitation in the Registry Editor. This would greatly reduce my ability to visually demonstrate the exploit, which is what ultimately led me to look for a better bug.

CVE-2023-23420

This brings us to the second vulnerability, CVE-2023-23420. This is also a UAF condition within the hive, but it concerns a key node cell instead of a security descriptor cell. It was caused by certain issues in the transactional key rename operation. These problems were so deep and affected such fundamental aspects of the registry that this and the related vulnerabilities CVE-2023-23421, CVE-2023-23422 and CVE-2023-23423 were fixed by completely removing support for transacted key rename operations.

In terms of exploitation, this bug is particularly unique because it can be triggered using only API/system calls, making it possible to corrupt any hive the attacker has write access to. This makes it an ideal candidate for writing an exploit whose operation is visible to the naked eye using standard Windows registry utilities, so that's what we'll do. Although the details of massaging the hive layout into the desired state may be slightly more difficult here than with CVE-2022-34707, it's nothing we can't handle. So let's get to work!

Step 1: Abusing the UAF to establish dynamically-controlled value cells

Let's start by clarifying that our attack will target the HKCU hive, and more specifically its volatile storage space. This will hopefully make the exploit a bit more reliable, as the volatile space resets each time the hive is reloaded, and there generally isn't much activity occurring there. The exploitation process begins with a key node use-after-free, and our goal is to take full control over the _CM_KEY_VALUE representation of two registry values by the end of the first stage (why two – we'll get to that in a moment). Once we achieve this goal, we will be able to arbitrarily set the _CM_KEY_VALUE.Data field, and thus gain read/write access to any chosen out-of-bounds cell index. There are many different approaches to how to achieve this, but in my proof-of-concept, I started with the following data layout:

At the top left, a box labeled "Exploit" is designated as a "Key node," with a dotted line extending upwards from its "Key node" label. An arrow from "Exploit" points to a box labeled "TmpKeyName," also designated as a "Key node." From "TmpKeyName," two arrows point downwards to two separate "Key node" boxes: "SubKey1" and "SubKey2." Another arrow extends to the right from "TmpKeyName" to a vertically stacked group of four rectangular elements, collectively referred to as a "Value list" via a label to their left. From this "Value list," four separate arrows point to the right, each connecting to a distinct container box. Each of these container boxes has a "Value node" label above it: The first container is "FakeKeyContainer." The second is "ValueListContainer." The third is "KernelAddrContainer." The fourth is "KernelDataContainer."

At the top of the hierarchy is the HKCU\Exploit key, which is the root of the entire exploit subtree. Its only role is to work as a container for all the other keys and values we create. Below it, we have the "TmpKeyName" key, which is important for two reasons: first, it stores four values that will be used at a later stage to fill freed cells with controlled data (but are currently empty). Second, this is the key on which we will perform the "rename" operation, which is the basis of the CVE-2023-23420 vulnerability. Below it are two more keys, "SubKey1" and "SubKey2", which are also needed in the exploitation process for transactional deletion, each through a different view of their parent.

Once we have this data layout arranged in the hive, we can proceed to trigger the memory corruption. We can do it exactly as described in the original report in section "Operating on subkeys of transactionally renamed keys", and demonstrated in the corresponding InconsistentSubkeyList.cpp source code. In short, it involves the following steps:

  1. Creating a lightweight transaction by calling the NtCreateRegistryTransaction syscall.
  2. Opening two different handles to the HKCU\Exploit\TmpKeyName key within our newly created transaction.
  3. Performing a transactional rename operation on one of these handles, changing the name to "Scratchpad".
  4. Transactionally deleting the "SubKey1" and "SubKey2" keys, each through a different parent handle (one renamed, the other not).
  5. Committing the entire transaction by calling the NtCommitRegistryTransaction syscall.

After successfully executing these operations on a vulnerable system, the layout of our objects within the hive should change accordingly:

At the top left, a box labeled "Exploit" is designated as a "Key node," with a dotted line extending upwards from its "Key node" label. An arrow from "Exploit" points to a box labeled "Scratchpad," also designated as a "Key node." From "Scratchpad," a red arrow points downwards to a dashed-outline box labeled "Free." Another arrow extends to the right from "Scratchpad" to a vertically stacked group of four rectangular elements, collectively referred to as a "Value list" via a label to their left. From this "Value list," four separate arrows point to the right, each connecting to a distinct container box. Each of these container boxes has a "Value node" label above it: The first container is "FakeKeyContainer." The second is "ValueListContainer." The third is "KernelAddrContainer." The fourth is "KernelDataContainer."

We see that the "TmpKeyName" key has been renamed to "Scratchpad", and both its subkeys have been released, but the freed cell of the second subkey still appears on its parent's subkey list. At this point, we want to use the four values of the "Scratchpad" key to create our own fake data structure. According to it, the freed subkey will still appear as existing, and contain two values named "KernelAddr" and "KernelData". Each of the "Container" values is responsible for imitating one type of object, and the most crucial role is played by the "FakeKeyContainer" value. Its backing buffer must perfectly align with the memory previously associated with the "SubKey1" key node. The diagram below illustrates the desired outcome:

A diagram illustrates a complex data structure and flow, likely related to a system exploit. At the top left, a box labeled "Exploit," designated as a "Key node" with a dotted line extending upwards, points to a box labeled "Scratchpad," also a "Key node." "Scratchpad" points to the right to a vertically stacked group of four rectangular elements, labeled "Value list." This "Value list" has four arrows pointing to four "Value node" container boxes on the far right: "FakeKeyContainer," "ValueListContainer," "KernelAddrContainer," and "KernelDataContainer." An arrow extends downwards and to the right from "Scratchpad" to a box labeled "FakeKey," which is also designated as "Data cell / fake key node." From "FakeKey," an arrow points right to a stack of two horizontal elements labeled "Data cell / fake value list," and another thin arrow points upwards and right to "FakeKeyContainer." From the "Data cell / fake value list," its top element has an arrow pointing right to "KernelAddr" (labeled "Data cell / fake value node"), and its bottom element has an arrow pointing downwards and right to "KernelData" (labeled "Data cell / fake value node"). "KernelAddr" has a thin arrow pointing upwards and right to "KernelAddrContainer." "KernelData" has a thin arrow pointing upwards and right to "KernelDataContainer." A wavy line connects the right side of "KernelAddrContainer" to the left side of "KernelDataContainer," and another wavy line extends from the right side of "KernelDataContainer" off to the right.

All the highlighted cells contain attacker-controlled data, which represent valid regf structures describing the HKCU\Exploit\Scratchpad\FakeKey key and its two values. Once this data layout is achieved, it becomes possible to open a handle to the "FakeKey" using standard APIs such as RegOpenKeyEx, and then operate on arbitrary cell indexes through its values. In reality, the process of crafting these objects after triggering the UAF is slightly more complicated than just setting data for four different values and requires the following steps:

  1. Writing to the "FakeKeyContainer" value with an initial, basic representation of the "FakeKey" key. At this stage, it is not important that the key node is entirely correct, but it must be of the appropriate length, and thus precisely cover the freed cell currently pointed to by the subkey list of the "Scratchpad" key.
  2. Setting the data for the other three container values – again, not the final ones yet, but those that have the appropriate length and are filled with unique markers, so that they can be easily recognized later on.
  3. Launching an info-leak loop to find the three cell indexes corresponding to the data cells of the "ValueListContainer", "KernelAddrContainer" and "KernelDataContainer" values, as well as a cell index of a valid security descriptor. This logic relies on abusing the _CM_KEY_NODE.Class and _CM_KEY_NODE.ClassLength fields of the "FakeKey" to point them to the data in the hive that we want to read. Specifically, the ClassLength member is set to 0xFFC, and the Class member is set to indexes 0x80000000, 0x80001000, 0x80002000, ... in subsequent loop iterations. This enables a kind of "arbitrary hive read" primitive, and the reading can be achieved by calling the NtEnumerateKey syscall on the "Scratchpad" key with the KeyNodeInformation class, which returns, among other things, the class property for a given subkey. This way, we get all the information about the internal hive layout needed to construct the final form of each of the imitated cells.
  4. Using the above information to set the correct data for each of the four cells: the key node of the "FakeKey" key with a valid security descriptor and index to the value list, the value list itself, and the value nodes of "KernelAddr" and "KernelData". This makes "FakeKey" a full-fledged key as seen by Windows, but with all of its internal regf structures fully controlled by us.

If all of these steps are successful, we should be able to open the HKCU\Exploit\Scratchpad key in Regedit and see the current exploitation progress. An example from my test system is shown in the screenshot below. The extra "Filler" value is used to fill the space occupied by the old "TmpKeyName" key node freed during the rename operation. This is necessary so that the data of the "FakeKeyContainer" value correctly aligns with the freed cell of the "SubKey1" key, but I skipped this minor implementation detail in the above high-level description of the logic for the sake of clarity.

Example successful exploit

Step 2: Getting read/write access to the CMHIVE kernel object

Since we now have full control over some registry values, the next logical step would be to initialize them with a specially crafted OOB cell index and then check if we can actually access the kernel structure it represents. Let's say that we set the type of the "KernelData" value to REG_BINARY, its length to 0x100, and the data cell index to the previously calculated value of 0xB2603000, which should point back at the hive's _CMHIVE structure on the kernel pool. If we do this, and then browse to the "FakeKey" key in the Registry Editor, we will encounter an unpleasant surprise:

Bluescreen!

This is definitely not the result we expected, and something must have gone wrong. If we investigate the system crash in WinDbg, we will get the following information:

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.

Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:

fffff800`8061ff20 cc              int     3

0: kd> !analyze -v

*******************************************************************************

*                                                                             *

*                        Bugcheck Analysis                                    *

*                                                                             *

*******************************************************************************

REGISTRY_ERROR (51)

Something has gone badly wrong with the registry.  If a kernel debugger

is available, get a stack trace. It can also indicate that the registry got

an I/O error while trying to read one of its files, so it can be caused by

hardware problems or filesystem corruption.

It may occur due to a failure in a refresh operation, which is used only

in by the security system, and then only when resource limits are encountered.

Arguments:

Arg1: 0000000000000001, (reserved)

Arg2: ffffd4855dc36000, (reserved)

Arg3: 00000000b2603000, depends on where Windows BugChecked, may be pointer to hive

Arg4: 000000000000025d, depends on where Windows BugChecked, may be return code of

        HvCheckHive if the hive is corrupt.

[...]

0: kd> k

 # Child-SP          RetAddr               Call Site

00 ffff828b`b100be68 fffff800`80763642     nt!DbgBreakPointWithStatus

01 ffff828b`b100be70 fffff800`80762e81     nt!KiBugCheckDebugBreak+0x12

02 ffff828b`b100bed0 fffff800`80617957     nt!KeBugCheck2+0xa71

03 ffff828b`b100c640 fffff800`80a874d5     nt!KeBugCheckEx+0x107

04 ffff828b`b100c680 fffff800`8089dfd5     nt!HvpReleaseCellPaged+0x1ec1a5

05 ffff828b`b100c6c0 fffff800`808a29be     nt!CmpQueryKeyValueData+0x1a5

06 ffff828b`b100c770 fffff800`808a264e     nt!CmEnumerateValueKey+0x13e

07 ffff828b`b100c840 fffff800`80629e75     nt!NtEnumerateValueKey+0x31e

08 ffff828b`b100ca70 00007ff8`242c4114     nt!KiSystemServiceCopyEnd+0x25

09 00000008`c747dc38 00000000`00000000     0x00007ff8`242c4114

We are seeing bugcheck code 0x51 (REGISTRY_ERROR), which indicates that it was triggered intentionally rather than through a bad memory access. Additionally, the direct caller of KeBugCheckEx is HvpReleaseCellPaged, a function that we haven't really mentioned so far in this blog post series.

To better understand what is actually happening here, we need to take a step back and look at the general scheme of cell operations as implemented in the Windows kernel. It typically follows a common pattern:

  _HV_GET_CELL_CONTEXT Context;

  //

  // Translate the cell index to virtual address

  //

  PVOID CellAddress = Hive->GetCellRoutine(Hive, CellIndex, &Context);

  //

  // Operate on the cell view using the CellAddress pointer

  //

  ...

  //

  // Release the cell

  //

  Hive->ReleaseCellRoutine(Hive, &Context)

There are three stages here: translating the cell index to a virtual address, performing operations on that cell, and releasing it. We are already familiar with the first two, and they are both obvious, but what is the release about? Based on a historical analysis of various Windows kernel builds, it turns out that in some versions, a get+release function pair was not only used for translating cell indexes to virtual addresses, but also to ensure that the memory view of the cell would not be accidentally unmapped between these two calls.

The presence or absence of the "release" function in consecutive Windows versions is shown below:

  • Windows NT 3.1 – 2000: ❌
  • Windows XP – 7: ✅
  • Windows 8 – 8.1: ❌
  • Windows 10 – 11: ✅

Let's take a look at the decompiled HvpReleaseCellPaged function from Windows 10, 1507 (build 10240), where it first reappeared after a hiatus in Windows 8.x:

VOID HvpReleaseCellPaged(_CMHIVE *CmHive, _HV_GET_CELL_CONTEXT *Context) {

  _HCELL_INDEX RealCell;

  _HMAP_ENTRY *MapEntry;

  RealCell = Context->Cell & 0xFFFFFFFE;

  MapEntry = HvpGetCellMap(&CmHive->Hive, RealCell);

  if (MapEntry == NULL) {

    KeBugCheckEx(REGISTRY_ERROR, 1, CmHive, RealCell, 0x291);

  }

  if ((Context->Cell & 1) != 0) {

    HvpMapEntryReleaseBinAddress(MapEntry);

  }

  HvpGetCellContextReinitialize(Context);

}

_HMAP_ENTRY *HvpGetCellMap(_HHIVE *Hive, _HCELL_INDEX CellIndex) {

  DWORD StorageType = CellIndex >> 31;

  DWORD StorageIndex = CellIndex & 0x7FFFFFFF;

  if (StorageIndex < Hive->Storage[StorageType].Length) {

    return &Hive->Storage[StorageType].Map

                                     ->Directory[(CellIndex >> 21) & 0x3FF]

                                     ->Table[(CellIndex >> 12) & 0x1FF];

  } else {

    return NULL;

  }

}

VOID HvpMapEntryReleaseBinAddress(_HMAP_ENTRY *MapEntry) {

  ExReleaseRundownProtection(&MapEntry->TemporaryBinRundown);

}

VOID HvpGetCellContextReinitialize(_HV_GET_CELL_CONTEXT *Context) {

  Context->Cell = -1;

  Context->Hive = NULL;

}

As we can see, the main task of HvpReleaseCellPaged and its helper functions was to find the _HMAP_ENTRY structure that corresponded to a given cell index, and then potentially call the ExReleaseRundownProtection API on the _HMAP_ENTRY.TemporaryBinRunDown field. This behavior was coordinated with the implementation of HvpGetCellPaged, which called ExAcquireRundownProtection on the same object. An additional side effect was that during the lookup of the _HMAP_ENTRY structure, a bounds check was performed on the cell index, and if it failed, a REGISTRY_ERROR bugcheck was triggered.

This state of affairs persisted for about two years, until Windows 10 1803 (build 17134). In that version, the code was greatly simplified: the TemporaryBinAddress and TemporaryBinRundown members were removed from _HMAP_ENTRY, and the call to ExReleaseRundownProtection was eliminated from HvpReleaseCellPaged. This effectively meant that there was no longer any reason for this function to retrieve a pointer to the map entry (as it was not used for anything), but for some unclear reason, this logic has remained in the code to this day. In most modern kernel builds, the auxiliary functions have been inlined, and HvpReleaseCellPaged now takes the following form:

VOID HvpReleaseCellPaged(_HHIVE *Hive, _HV_GET_CELL_CONTEXT *Context) {

  _HCELL_INDEX Cell = Context->Cell;

  DWORD StorageIndex = Cell & 0x7FFFFFFF;

  DWORD StorageType = Cell >> 31;

  if (StorageIndex >= Hive->Storage[StorageType].Length ||

      &Hive->Storage[StorageType].Map->Directory[(Cell >> 21) & 0x3FF]->Table[(Cell >> 12) & 0x1FF] == NULL) {

    KeBugCheckEx(REGISTRY_ERROR, 1, (ULONG_PTR)Hive, Cell, 0x267);

  }

  Context->Cell = -1;

  Context->BinContext = 0;

}

The bounds check on the cell index is clearly still present, but it doesn't serve any real purpose. Based on this, we can assume that this is more likely a historical relic rather than a mitigation deliberately added by the developers. Still, it interferes with our carefully crafted exploitation technique. Does this mean that OOB cell indexes are not viable because their use will always result in a forced BSoD, and we have to look for other privilege escalation methods instead?

As it turns out, not necessarily. Indeed, if the bounds check was located in the HvpGetCellPaged function, there wouldn't be much to discuss – a blue screen would always occur right before using any OOB index, completely neutralizing this idea's usefulness. However, as things stand, resolving such an index works without issues, and we can perform a single invalid memory operation before a crash occurs in the release call. In many ways, this sounds like a "pwn" task straight out of a CTF, where the attacker is given a memory corruption primitive that is theoretically exploitable, but somehow artificially limited, and the goal is to figure out how to cleverly bypass this limitation. Let's take another look at the if statement that stands in our way:

if (StorageIndex >= Hive->Storage[StorageType].Length || /* ... */) {

  KeBugCheckEx(REGISTRY_ERROR, 1, (ULONG_PTR)Hive, Cell, 0x267);

}

The index is compared against the value of the long-lived _HHIVE.Storage[StorageType].Length field, which is located at a constant offset from the beginning of the _HHIVE structure. On the Windows 11 system I tested, this offset is 0x118 for stable storage and 0x390 for volatile storage:

0: kd> dx (&((_HHIVE*)0)->Storage[0].Length)

(&((_HHIVE*)0)->Storage[0].Length)                 : 0x118

0: kd> dx (&((_HHIVE*)0)->Storage[1].Length)

(&((_HHIVE*)0)->Storage[1].Length)                 : 0x390

As we established earlier, the special out-of-bounds index 0xB2603000 points to the base address of the _CMHIVE / _HHIVE structure. By adding one of the offsets above, we can obtain an index that points directly to the Length field. Let's test this in practice:

0: kd> dx (&((nt!_CMHIVE*)0xffff810713f82000)->Hive.Storage[1].Length) 

(&((nt!_CMHIVE*)0xffff810713f82000)->Hive.Storage[1].Length)                  : 0xffff810713f82390

0: kd> !reg cellindex 0xffff810713f82000 0xB2603390-4

Map = ffff810713f823a0 Type = 1 Table = 193 Block = 3 Offset = 38c

MapTable     = ffff810713debe90 

MapEntry     = ffff810713debed8 

BinAddress = ffff810713f82000, BlockOffset = 0000000000000000

BlockAddress = ffff810713f82000 

pcell:  ffff810713f82390

So, indeed, index 0xB260338C points to the field representing the length of the volatile space in the HKCU hive. This is very good news for an attacker, because it means that they are able to neutralize the bounds check in HvpReleaseCellPaged by performing the following steps:

  1. Crafting a controlled registry value with a data index of 0xB260338C.
  2. Setting this value programmatically to a very large number, such as 0xFFFFFFFF, and thus overwriting the _HHIVE.Storage[1].Length field with it.
  3. During the NtSetValueKey syscall in step 2, when HvpReleaseCellPaged is called on index 0xB260338C, the Length member has already been corrupted. As a result, the condition checked by the function is not satisfied, and the KeBugCheckEx call never occurs.
  4. Since the _HHIVE.Storage[1].Length field is located in a global hive object and does not change very often (unless the storage space is expanded or shrunk), all future checks performed in HvpReleaseCellPaged against this hive will no longer pose any risk to the exploit stability.

To better realize just how close the overwriting of the Length field is to its use in the bounds check, we can have a look at the disassembly of the CmpSetValueKeyExisting function, where this whole logic takes place.

Dissasembly output

The technique works by a hair's breadth – the memmove and HvpReleaseCellPaged calls are separated by only a few instructions. Nevertheless, it works, and if we first perform a write to the 0xB260338C index (or equivalent) after gaining binary control over the hive, then we will be subsequently able to read from/write to any OOB indexes without any restrictions in the future.

For completeness, I should mention that after corrupting the Length field, it is worthwhile to set a few additional flags in the _HHIVE.HiveFlags field using the same trick as before. This prevents the kernel from crashing due to the unexpectedly large hive length. Specifically, the flags are (as named in blog post #6):

  • HIVE_COMPLETE_UNLOAD_STARTED (0x40): This prevents a crash during potential hive unloading in the CmpLateUnloadHiveWorker → CmpCompleteUnloadKey → HvHiveCleanup → HvpFreeMap → CmpFree function.
  • HIVE_FILE_READ_ONLY (0x8000): This prevents a crash that could occur in the CmpFlushHive → HvStoreModifiedData → HvpTruncateBins path.

Of course, these are just conclusions drawn from writing a demonstration exploit, so I don't guarantee that the above flags are sufficient to maintain system stability in every configuration. Nevertheless, repeated tests have shown that it works in my environment, and if we subsequently set the data cell index of the controlled value back to 0xB2603000, and the Type/DataLength fields to something like REG_BINARY and 0x100, we should be finally able to see the following result in the Registry Editor:

Result in registry editor

It is easy to verify that this is indeed a "live view" into the _CMHIVE structure in kernel memory:

0: kd> dt _HHIVE ffff810713f82000

nt!_HHIVE

   +0x000 Signature        : 0xbee0bee0

   +0x008 GetCellRoutine   : 0xfffff801`8049b370     _CELL_DATA*  nt!HvpGetCellPaged+0

   +0x010 ReleaseCellRoutine : 0xfffff801`8049b330     void  nt!HvpReleaseCellPaged+0

   +0x018 Allocate         : 0xfffff801`804cae30     void*  nt!CmpAllocate+0

   +0x020 Free             : 0xfffff801`804c9100     void  nt!CmpFree+0

   +0x028 FileWrite        : 0xfffff801`80595e00     long  nt!CmpFileWrite+0

   +0x030 FileRead         : 0xfffff801`805336a0     long  nt!CmpFileRead+0

   +0x038 HiveLoadFailure  : (null)

   +0x040 BaseBlock        : 0xffff8107`13f9a000 _HBASE_BLOCK

[...]

Unfortunately, the hive signature 0xBEE0BEE0 is not visible in the screenshot, because the first four bytes of the cell are treated as its size, and only the subsequent bytes as actual data. For this reason, the entire view of the structure is shifted by 4 bytes. Nevertheless, it is immediately apparent that we have gained direct access to function addresses within the kernel image, as well as many other interesting pointers and data. We are getting very close to our goal!

Step 3: Getting arbitrary read/write access to the entire kernel address space

At this point, we can both read from and write to the _CMHIVE structure through our magic value, and also operate on any other out-of-bounds cell index that resolves to a valid address. This means that we no longer need to worry about kernel ASLR, as _CMHIVE readily leaks the base address of ntoskrnl.exe, as well as many other addresses from kernel pools. The question now is how, with these capabilities, to execute our own payload in kernel-mode or otherwise elevate our process's privileges in the system. What may immediately come to mind based on the layout of the _CMHIVE / _HHIVE structure is the idea of overwriting one of the function pointers located at the beginning. In practice, this is less useful than it seems. As I wrote in blog post #6, the vast majority of operations on these pointers have been devirtualized, and in the few cases where they are still used directly, the Control Flow Guard mitigation is enabled. Perhaps something could be ultimately worked out to bypass CFG, but with the primitives currently available to us, I decided that this sounds more difficult than it should be.

If not that, then what else? Experienced exploit developers would surely find dozens of different ways to complete the privilege escalation process. However, I had a specific goal in mind that I wanted to achieve from the start. I thought it would be elegant to create an arrangement of objects where the final stage of exploitation could be performed interactively from within Regedit. This brings us back to the selection of our two fake values, "KernelAddr" and "KernelData". My goal with these values was to be able to enter any kernel address into KernelAddr, and have KernelData automatically—based solely on how the registry works—contain the data from that address, available for both reading and writing. This would enable a very unique situation where the user could view and modify kernel memory within the graphical interface of a tool available in a default Windows installation—something that doesn't happen very often. 🙂

The crucial observation that allows us to even consider such a setup is the versatility of the cell maps mechanism. In order for such an obscure arrangement to work, KernelData must utilize a _HMAP_ENTRY structure controlled by KernelAddr at the final stage of the cell walk. Referring back to the previous diagram illustrating the relationships between the _CMHIVE structure and other objects, this implies that if KernelAddr reaches an object through two pointer dereferences, KernelData must be configured to reach it with a single dereference, so that the second dereference then occurs through the data stored in KernelAddr.

In practice, this can be achieved as follows: KernelAddr will function similarly as before, pointing to an offset within _CMHIVE using a series of pointer dereferences:

  • _CMHIVE.CmRm → _CM_RM.Hive → _CMHIVE: for normal hives (e.g., HKCU).
  • _CMHIVE.RootKcb → _CM_KEY_CONTROL_BLOCK.KeyHive → _CMHIVE: for app hives.

For KernelData, we can use any self-referencing pointer in the first step of the cell walk. These are plentiful in _CMHIVE, due to the fact that there are many LIST_ENTRY objects initialized as an empty list.

The next step is to select the appropriate offsets and indexes based on the layout of the _CMHIVE structure, so that everything aligns with our plan. Starting with KernelAddr, the highest 20 bits of the cell index remain the same as before, which is 0xB2603???. The lower 12 bits will correspond to an offset within _CMHIVE where we will place our fake _HMAP_ENTRY object. This should be a 0x18 byte area that is generally unused and located after a self-referencing pointer. For demonstration purposes, I used offset 0xB70, which corresponds to the following fields:

_CMHIVE layout

_HMAP_ENTRY layout

+0xb70 UnloadEventArray : Ptr64 Ptr64 _KEVENT

+0x000 BlockOffset         : Uint8B

+0xb78 RootKcb          : Ptr64 _CM_KEY_CONTROL_BLOCK

+0x008 PermanentBinAddress : Uint8B

+0xb80 Frozen           : UChar

+0x010 MemAlloc            : Uint4B

On my test Windows 11 system, all these fields are zeroed out and unused for the HKCU hive, which makes them well-suited for acting as the _HMAP_ENTRY structure. The final cell index for the KernelAddr value will, therefore, be 0xB2603000 + 0xB70 - 0x4 = 0xB2603B6C. If we set its type to REG_QWORD and its length to 8 bytes, then each write to it will result in setting the _CMHIVE.UnloadEventArray field (or _HMAP_ENTRY.BlockOffset in the context of the cell walk) to the specified 64-bit number.

As for KernelData, we will use _CMHIVE.SecurityHash[3].Flink, located at offset 0x798, as the aforementioned self-referencing pointer. To calculate the directory index value, we need to subtract it from the offset of _CMHIVE.Hive.Storage[1].SmallDir and then divide by 8, which gives us: (0x798 - 0x3A0) ÷ 8 = 0x7F. Next, we will calculate the table index by subtracting the offset of the fake _HMAP_ENTRY structure from the offset of the self-referencing pointer and then dividing the result by the size of _HMAP_ENTRY: (0xB70 - 0x798) ÷ 0x18 = 0x29. If we assume that the 12-bit offset part is zero (we don't want to add any offsets at this point), then we have all the elements needed to compose the full cell index. We will use the MakeCellIndex helper function defined earlier for this purpose:

>>> MakeCellIndex(1, 0x7F, 0x29, 0)

0x8fe29000

So, the cell index for the KernelData value will be 0x8FE29000, and with that, we have all the puzzle pieces needed to assemble our intricate construction. This is illustrated in the diagram below:

Diagram described below

The cell map walk for the KernelAddr value is shown on the right side of the _CMHIVE structure, and the cell map walk for KernelData is on the left. The dashed arrows marked with numbers ①, ②, and ③ correspond to the consecutive elements of the cell index (i.e., directory index, table index, and offset), while the solid arrows represent dereferences of individual pointers. As you can see, we successfully managed to select indexes where the data of one value directly influences the target virtual address to which the other one is resolved.

We could end this section right here, but there is one more minor issue I'd like to mention. As you may recall, the HvpGetCellPaged function ends with the following statement:

return (Entry->PermanentBinAddress & (~0xF)) + Entry->BlockOffset + (Index & 0xFFF) + 4;

Our current assumption is that the PermanentBinAddress and the lower 12 bits of the index are both zero, and BlockOffset contains the exact value of the address we want to access. Unfortunately, the expression ends with the extra "+4". Normally, this skips the cell size and directly returns a pointer to the cell's data, but in our exploit, it means we would see a view of the kernel memory shifted by four bytes. This isn't a huge issue in practical terms, but it doesn't look perfect in a demonstration.

So, can we do anything about this? It turns out, we can. What we want to achieve is to subtract 4 from the final result using the other controlled addends in the expression (PermanentBinAddress and BlockOffset). Individually, each of them has some limitations:

  • The PermanentBinAddress is a fully controlled 64-bit field, but only its upper 60 bits are used when constructing the cell address. This means we can only use it to subtract multiples of 0x10, but not exactly 4.
  • The cell offset is a 12-bit unsigned number, so we can use it to add any number in the 1–4095 range, but we can't subtract anything.

However, we can combine both of them together to achieve the desired goal. If we set PermanentBinAddress to 0xFFFFFFFFFFFFFFF0 (-0x10 in 64-bit representation) and the cell offset to 0xC, their sum will be -4, which will mutually reduce with the unconditionally added +4, causing the HvpGetCellPaged function to return exactly Entry->BlockOffset. For our exploit, this means one additional write to the _CMHIVE structure to properly initialize the fake PermanentBinAddress field, and a slight change in the cell index of the KernelData value from the previous 0x8FE29000 to 0x8FE2900C. If we perform all these steps correctly, we should be able to read and write arbitrary kernel memory via Regedit. For example, let's dump the data at the beginning of the ntoskrnl.exe kernel image using WinDbg:

0: kd> ? nt

Evaluate expression: -8781857554432 = fffff803`50800000

0: kd> db /c8 fffff803`50800004

fffff803`50800004  03 00 00 00 04 00 00 00  ........

fffff803`5080000c  ff ff 00 00 b8 00 00 00  ........

fffff803`50800014  00 00 00 00 40 00 00 00  ....@...

fffff803`5080001c  00 00 00 00 00 00 00 00  ........

fffff803`50800024  00 00 00 00 00 00 00 00  ........

fffff803`5080002c  00 00 00 00 00 00 00 00  ........

fffff803`50800034  00 00 00 00 00 00 00 00  ........

fffff803`5080003c  10 01 00 00 0e 1f ba 0e  ........

fffff803`50800044  00 b4 09 cd 21 b8 01 4c  ....!..L

fffff803`5080004c  cd 21 54 68 69 73 20 70  .!This p

fffff803`50800054  72 6f 67 72 61 6d 20 63  rogram c

fffff803`5080005c  61 6e 6e 6f 74 20 62 65  annot be

And then let's browse to the same address using our FakeKey in Regedit:

Fake key in registry editor

The data from both sources match, and the KernelData value displays them correctly without any additional offset. A keen observer will note that the expected "MZ" signature is not there, because I entered an address 4 bytes greater than the kernel image base. I did this because, even though we can "peek" at any virtual address X through the special registry value, the kernel still internally accesses address X-4 for certain implementation reasons. Since there isn't any data mapped directly before the ntoskrnl.exe image in memory, using the exact image base would result in a system crash while trying to read from the invalid address 0xFFFFF803507FFFFC.

An even more attentive reader will also notice that the exploit has jokingly changed the window title from "Registry Editor" to "Kernel Memory Editor", as that's what the program has effectively become at this point. 🙂

Step 4: Elevating process security token

With an arbitrary kernel read/write primitive and the address of ntoskrnl.exe at our disposal, escalating privileges is a formality. The simplest approach is perhaps to iterate through the linked list of all processes (made of _EPROCESS structures) starting from nt!KiProcessListHead, find both the "System" process and our own process on the list, and then copy the security token from the former to the latter. This method is illustrated in the diagram below.

Diagram described above

This entire procedure could be easily performed programmatically, using only RegQueryValueEx and RegSetValueEx calls. However, it would be a shame not to take advantage of the fact that we can modify kernel memory through built-in Windows tools. Therefore, my exploit performs most of the necessary steps automatically, except for the final stage – overwriting the process security token. For that part, it creates a .reg file on disk that refers to our fake key and its two registry values. The first is KernelAddr, which points to the address of the security token within the _EPROCESS structure of a newly created command prompt, followed by KernelData, which contains the actual value of the System token. The invocation and output of the exploit looks as follows:

C:\Users\user\Desktop\exploits>Exploit.exe C:\users\user\Desktop\become_admin.reg

[+] Found kernel base address: fffff80350800000

[+] Spawning a command prompt...

[+] Found PID 6892 at address ffff8107b3864080

[+] System process: ffff8107ad0ed040, security token: ffffc608b4c8a943

[+] Exploit succeeded, enjoy!

C:\Users\user\Desktop\exploits>

Then, a new command prompt window appears on the screen. There, we can manually perform the final step of the attack, applying changes from the newly created become_admin.reg file using the reg.exe tool, thus overwriting the appropriate field in kernel memory and granting ourselves elevated privileges:

It works!

As we can see, the attack was indeed successful, and our cmd.exe process is now running as NT AUTHORITY\SYSTEM. A similar effect could be achieved from the graphical interface by double-clicking the .reg file and applying it using the Regedit program associated with this extension. This is exactly how I finalized my attack during the exploit demonstration at OffensiveCon 2024, which can be viewed in the recording of the presentation:

Final thoughts

Since we have now fully achieved our intended goal, we can return to our earlier, incomplete diagram, and fill it in with all the intermediate steps we have taken:

A flowchart illustrating a multi-step attack chain leading to privilege escalation. The process begins with “Hive memory corruption”, which leads to “Construction of a controlled registry value”. This enables “Disabling the cell index bounds check”, followed by a “Kernel image base leak”. The leak is then used for “Construction of self-referential values for arbitrary kernel r/w” (read/write), ultimately resulting in “Privilege escalation by stealing the system token”.

To conclude this blog post, I would like to share some final thoughts regarding hive-based memory corruption vulnerabilities.

Exploit mitigations

The above exploit shows that out-of-bounds cell indexes in the registry are a powerful exploitation technique, whose main strength lies in its determinism. Within a specific version of the operating system, a given OOB index will always result in references to the same fields of the _CMHIVE structure, which eliminates the need to use any probabilistic exploitation methods such as kernel pool spraying. Of all the available hive memory corruption exploitation methods, I consider this one to be the most stable and practical.

Therefore, it should come as no surprise that I would like Microsoft to mitigate this technique for the security of all Windows users. I already emphasized this in my previous blog post #7, but now the benefit of this mitigation is even more apparent: since the cell index bounds check is already present in HvpReleaseCellPaged, moving it to HvpGetCellPaged should be completely neutral in terms of system performance, and it would fully prevent the use of OOB indexes for any malicious purposes. I suggested this course of action in November 2023, but it hasn't been implemented by the vendor yet, so all the techniques described here still work at the time of publication.

False File Immutability

So far in this blog, we have mostly focused on a scenario where we can control the internal regf data of an active hive through memory corruption. This is certainly the most likely reason why someone would take control of registry structures, but not necessarily the only one. As I already mentioned in the previous posts, Windows uses section objects and their corresponding section views to map hive files into memory. This means that the mappings are backed by the corresponding files, and if any of them are ever evicted from memory (e.g., due to memory pressure in the system), they will be reloaded from disk the next time they are accessed. Therefore, it is crucial for system security to protect actively loaded hives from being simultaneously written to. This guarantee is achieved in the CmpOpenHiveFile function through the ShareAccess argument passed to ZwCreateFile, which takes a value of 0 or at most FILE_SHARE_READ, but never FILE_SHARE_WRITE. This causes the operating system to ensure that no application can open the file for writing as long as the handle remains open.

As I write these words, the research titled False File Immutability, published by Gabriel Landau in 2024, naturally comes to my mind. He effectively demonstrated that for files opened from remote network shares (e.g., via the SMB protocol), guarantees regarding their immutability may not be upheld in practice, as the local computer simply lacks physical control over it. However, the registry implementation is generally prepared for this eventuality: for hives loaded from locations other than the system partition, the HIVE_FILE_PAGES_MUST_BE_KEPT_LOCAL and VIEW_MAP_MUST_BE_KEPT_LOCAL flags are used, as discussed in blog post #6. These flags instruct the kernel to keep local copies of each memory page for such hives, never allowing them to be completely evicted and, as a result, having to be read again from remote storage. Thus, the attack vector seems to be correctly addressed.

However, during my audit of the registry's memory management implementation last year, I discovered two related vulnerabilities: CVE-2024-43452 and CVE-2024-49114. The second one is particularly noteworthy because, by abusing the Cloud Filter API functionality and its "placeholder files", it was possible to arbitrarily modify active hive files in the system, including those loaded from the C:\ drive. This completely bypassed the sharing access right checks and their associated security guarantees. With this type of issue, the hive corruption exploitation techniques can be used without any actual memory corruption taking place, by simply replacing the memory in question with controlled data. I believe that vulnerabilities of this class can be a real treat for bug hunters, and they are certainly worth remembering for the future.

Conclusion

Dear reader, if you've made it to the end of this blog post, and especially if you've read all the posts in this series, I'd like to sincerely congratulate you on your perseverance. 🙂 Through these write ups, I hope I've managed to document as many implementation details of the registry as possible; details that might otherwise have never seen the light of day. My goal was to show how interesting and internally complex this mechanism is, and in particular, what an important role it plays in the security of Windows as a whole. Thank you for joining me on this adventure, and see you next time!

The Windows Registry Adventure #7: Attack surface analysis

23 May 2025 at 09:00

Posted by Mateusz Jurczyk, Google Project Zero

In the first three blog posts of this series, I sought to outline what the Windows Registry actually is, its role, history, and where to find further information about it. In the subsequent three posts, my goal was to describe in detail how this mechanism works internally – from the perspective of its clients (e.g., user-mode applications running on Windows), the regf format used to encode hives, and finally the kernel itself, which contains its canonical implementation. I believe all these elements are essential for painting a complete picture of this subsystem, and in a way, it shows my own approach to security research. One could say that going through this tedious process of getting to know the target unnecessarily lengthens the total research time, and to some extent, they would be right. On the other hand, I believe that to conduct complete research, it is equally important to answer the question of how certain things are implemented, as well as why they are implemented that way – and the latter part often requires a deeper dive into the subject. And since I have already spent the time reverse engineering and understanding various internal aspects of the registry, there are great reasons to share the information with the wider community. There is a lack of publicly available materials on how various mechanisms in the registry work, especially the most recent and most complicated ones, so I hope that the knowledge I have documented here will prove useful to others in the future.

In this blog post, we get to the heart of the matter, the actual security of the Windows Registry. I'd like to talk about what made a feature that was initially meant to be just a quick test of my fuzzing infrastructure draw me into manual research for the next 1.5 ~ 2 years, and result in Microsoft fixing (so far) 53 CVEs. I will describe the various areas that are important in the context of low-level security research, from very general ones, such as the characteristics of the codebase that allow security bugs to exist in the first place, to more specific ones, like all possible entry points to attack the registry, the impact of vulnerabilities and the primitives they generate, and some considerations on effective fuzzing and where more bugs might still be lurking.

Let's start with a quick recap of the registry's most fundamental properties as an attack surface:

  • Local attack surface for privilege escalation: As we already know, the Windows Registry is a strictly local attack surface that can potentially be leveraged by a less privileged process to gain the privileges of a higher privileged process or the kernel. It doesn't have any remote components except for the Remote Registry service, which is relatively small and not accessible from the Internet on most Windows installations.
  • Complex, old codebase in a memory-unsafe language: The Windows Registry is a vast and complex mechanism, entirely written in C, most of it many years ago. This means that both logic and memory safety bugs are likely to occur, and many such issues, once found, would likely remain unfixed for years or even decades.
  • Present in the core NT kernel: The registry implementation resides in the core Windows kernel executable (ntoskrnl.exe), which means it is not subject to mitigations like the win32k lockdown. Of course, the reachability of each registry bug needs to be considered separately in the context of specific restrictions (e.g., sandbox), as some of them require file system access or the ability to open a handle to a specific key. Nevertheless, being an integral part of the kernel significantly increases the chances that a given bug can be exploited.
  • Most code reachable by unprivileged users: The registry is a feature that was created for use by ordinary user-mode applications. It is therefore not surprising that the vast majority of registry-related code is reachable without any special privileges, and only a small part of the interface requires administrator rights. Privilege escalation from medium IL (Integrity Level) to the kernel is probably the most likely scenario of how a registry vulnerability could be exploited.
  • Manages sensitive information: In addition to the registry implementation itself being complex and potentially prone to bugs, it's important to remember that the registry inherently stores security-critical system information, including various global configurations, passwords, user permissions, and other sensitive data. This means that not only low-level bugs that directly allow code execution are a concern, but also data-only attacks and logic bugs that permit unauthorized modification or even disclosure of registry keys without proper permissions.
  • Not trivial to fuzz, and not very well documented: Overall, it seems that the registry is not a very friendly target for bug hunting without any knowledge of its internals. At the same time, obtaining the information is not easy either, especially for the latest registry mechanisms, which are not publicly documented and learning about them basically boils down to reverse engineering. In other words, the entry bar into this area is quite high, which can be an advantage or a disadvantage depending on the time and commitment of a potential researcher.

Security properties

The above cursory analysis seems to indicate that the registry may be a good audit target for someone interested in EoP bugs on Windows.  Let's now take a closer look at some of the specific low-level reasons why the registry has proven to be a fruitful research objective.

Broad range of bug classes

Due to the registry being both complex and a central mechanism in the system operating with kernel-mode privileges, numerous classes of bugs can occur within it. An example vulnerability classification is presented below:

  • Hive memory corruption: Every invasive operation performed on the registry (i.e., a "write" operation) is reflected in changes made to the memory-mapped view of the hive's structure. Considering that objects within the hive include variable-length arrays, structures with counted references, and references to other cells via cell indexes (hives' equivalent of memory pointers), it's natural to expect common issues like buffer overflows or use-after-frees.
  • Pool memory corruption: In addition to hive memory mappings, the Configuration Manager also stores a significant amount of information on kernel pools. Firstly, there are cached copies of certain hive data, as described in my previous blog post. Secondly, there are various auxiliary objects, such as those allocated and subsequently released within a single system call. Many of these objects can fall victim to memory management bugs typical of the C language.
  • Information disclosure: Because the registry implementation is part of the kernel, and it exchanges large amounts of information with unprivileged user-mode applications, it must be careful not to accidentally disclose uninitialized data from the stack or kernel pools to the caller. This can happen both through output data copied to user-mode memory and through other channels, such as data leakage to a file (hive file or related log file). Therefore, it is worthwhile to keep an eye on whether all arrays and dynamically allocated buffers are fully populated or carefully filled with zeros before passing them to a lower-privileged context.
  • Race conditions: As a multithreaded environment, Windows allows for concurrent registry access by multiple threads. Consequently, the registry implementation must correctly synchronize access to all shared kernel-side objects and be mindful of "double fetch" bugs, which are characteristic of user-mode client interactions.
  • Logic bugs: In addition to being memory-safe and free of low-level bugs, a secure registry implementation must also enforce correct high-level security logic. This means preventing unauthorized users from accessing restricted keys and ensuring that the registry operates consistently with its documentation under all circumstances. This requires a deep understanding of both the explicit documentation and the implicit assumptions that underpin the registry's security from the kernel developers. Ultimately, any behavior that deviates from expected logic, whether documented or assumed, could lead to vulnerabilities.
  • Inter-process attacks: The registry can serve as a security target, but also as a means to exploit flaws in other applications on the system. It is a shared database, and a local attacker has many ways to indirectly interact with more privileged programs and services. A simple example is when privileged code sets overly permissive permissions on its keys, allowing unauthorized reading or modification. More complex cases can occur when there is a race condition between key creation and setting its restricted security descriptor, or when a key modification involving several properties is not performed transactionally, potentially leading to an inconsistent state. The specifics depend on how the privileged process uses the registry interface.

If I were to depict the Windows Registry in a single Venn diagram, highlighting its various possible bug classes, it might look something like this:

A Venn diagram illustrates the intersection of different bug categories within the Windows Registry. Four overlapping circles represent Kernel-specific bugs, File parsing bugs, Object lifetime bugs, and Logic bugs. The central area where all circles overlap is highlighted, indicating vulnerabilities that combine all these bug types.

Manual reference counting

As I have mentioned multiple times, security descriptors in registry hives are shared by multiple keys, and therefore, must be reference counted. The field responsible for this is a 32-bit unsigned integer, and any situation where it's set to a value lower than the actual number of references can result in the release of that security descriptor while it's still in use, leading to a use-after-free condition and hive-based memory corruption. So, we see that it's absolutely critical that this refcounting is implemented correctly, but unfortunately, there are (or were until recently) many reasons why this mechanism could be prone to bugs:

  • Usually, a reference count is a construct that exists strictly in memory, where it is initialized with a value of 1, then incremented and decremented some number of times, and finally drops to zero, causing the object to be freed. However, with registry hives, the initial refcount values are loaded from disk, from a file that we assume is controlled by the attacker. Therefore, these values cannot be trusted in any way, and the first necessary step is to actually compare and potentially adjust them according to the true number of references to each descriptor. Even though this is done in theory, bugs can creep into this logic in practice (CVE-2022-34707, CVE-2023-38139).
  • For a long time, all operations on reference counts were performed by directly referencing the _CM_KEY_SECURITY.ReferenceCount field, instead of using a secure wrapper. As a result, none of these incrementations were protected against integer overflow. This meant that not only a too small, but also a too large refcount value could eventually overflow and lead to a use-after-free situation (CVE-2023-28248, CVE-2024-43641). This weakness was gradually addressed in various places in the registry code between April 2023 and November 2024. Currently, all instances of refcount incrementation appear to be secure and involve calling the special helper function CmpKeySecurityIncrementReferenceCount, which protects against integer overflow. Its counterpart for refcount decrementation is CmpKeySecurityDecrementReferenceCount.
  • It seems that there is a lack of clarity and understanding of how certain special types of keys, such as predefined keys and tombstone keys, behave in relation to security descriptors. In theory, the only type of key that does not have a security descriptor assigned to it is the exit node (i.e., a key with the KEY_HIVE_EXIT flag set, found solely in the virtual hive rooted at \Registry\), while all other keys do have a security descriptor assigned to them, even if it is not used for anything. In practice, however, there have been several vulnerabilities in Windows that resulted either from incorrect security refresh in KCB for special types of keys (CVE-2023-21774), from releasing the security descriptor of a predefined key without considering its reference count (CVE-2023-35356), or from completely forgetting the need for reference counting the descriptors of tombstone keys in the "rename" operation (CVE-2023-35382).
  • When the reference count of a security descriptor reaches zero and is released, this operation is irreversible. There is no guarantee that upon reallocation, the descriptor would have the same cell index, or even that it could be reallocated at all. This is crucial for multi-step operations where individual actions could fail, necessitating a full rollback to the original state. Ideally, releasing security descriptors should always be the final step, only when the kernel can be certain that the entire operation will succeed. A vulnerability exemplifying this is CVE-2023-21772, where the registry virtualization code first released the old security descriptor and then attempted to allocate a new one. If the allocation failed, the key was left without any security properties, violating a fundamental assumption of the registry and potentially having severe consequences for system memory safety.

Aggressive self-healing and recovery

As I described in blog post #5, one of the registry's most interesting features, which distinguishes it from many other file format implementations, is that it is self-healing. The entire hive loading process, from the internal CmCheckRegistry function downwards, is focused on loading the database at all costs, even if some corrupted fragments are encountered. Only if the file damage is so extensive that recovering any data is impossible does the entire loading process fail. Of course, given that the registry stores critical system data such as its basic configuration, and the lack of access to this data virtually prevents Windows from booting, this decision made a lot of sense from the system reliability point of view. It's probably safe to assume that it has prevented the need for system reinstallation on numerous computers, simply because it did not reject hives with minor damage that might have appeared due to random hardware failure.

However, from a security perspective, this behavior is not necessarily advantageous. Firstly, it seems obvious that upon encountering an error in the input data, it is simpler to unconditionally halt its processing rather than attempt to repair it. In the latter case, it is possible for the programmer to overlook an edge case – forget to reset some field in some structure, etc. – and thus instead of fixing the file, allow for another unforeseen, inconsistent state to materialize within it. In other words, the repair logic constitutes an additional attack surface, and one that is potentially even more interesting and error-prone than other parts of the implementation. A classic example of a vulnerability associated with this property is CVE-2023-38139.

Secondly, in my view, the existence of this logic may have negatively impacted the secure development of the registry code, perhaps by leading to a discrepancy between what it guaranteed and what other developers thought it had guaranteed. For example, in 1991–1993, when the foundations of the Configuration Manager subsystem were being created in their current form, probably no one considered hive loading a potential attack vector. At that time, the registry was used only to store system configuration, and controlled hive loading was privileged and required admin rights. Therefore, I suspect that the main goal of hive checking at that time was to detect simple data inconsistencies due to hardware problems, such as single bit flips. No one expected a hive to contain a complex, specially crafted multi-kilobyte data structure designed to trigger a security flaw. Perhaps the rest of the registry code was written under the assumption that since data sanitization and self-healing occurred at load time, its state was safe from that point on and no further error handling was needed (except for out-of-memory errors). Then, in Windows Vista, a decision was made to open access to controlled hive loading by unprivileged users through the app hive mechanism, and it suddenly turned out that the existing safeguards were not entirely adequate. Attackers now became able to devise data constructs that were structurally correct at the low level, but completely beyond the scope of what the actual implementation expected and could handle.

Finally, self-healing can adversely affect system security by concealing potential registry bugs that could trigger during normal Windows operation. These problems might only become apparent after a period of time and with a "build-up" of enough issues within the hive. Because hives are mapped into memory, and the kernel operates directly on the data within the file, there exists a category of errors known as "inconsistent hive state". This refers to a data structure within the hive that doesn't fully conform to the file format specification. The occurrence of such an inconsistency is noteworthy in itself and, for someone knowledgeable about the registry, it could be a direct clue for finding vulnerabilities. However, such instances rarely cause an immediate system crash or other visible side effects. Consider security descriptors and their reference counting: as mentioned earlier, any situation where the active number of references exceeds the reference count indicates a serious security flaw. However, even if this were to happen during normal system operation, it would require all other references to that descriptor to be released and then for some other data to overwrite the freed descriptor. Then, a dangling reference would need to be used to access the descriptor. The occurrence of all these factors in sequence is quite unlikely, and the presence of self-healing further decreases these chances, as the reference count would be restored to its correct value at the next hive load. This characteristic can be likened to wrapping the entire registry code in a try/except block that catches all exceptions and masks them from the user. This is certainly helpful in the context of system reliability, but for security, it means that potential bugs are harder to spot during system run time and, for the same reason, quite difficult to fuzz. This does not mean that they don't exist; their detection just becomes more challenging.

Unclear boundaries between hard and conventional format requirements

This point is related to the previous section. In the regf format, there are certain requirements that are fairly obvious and must be always met for a file to be considered valid. Likewise, there are many elements that are permitted to be formatted arbitrarily, at the discretion of the format user. However, there is a third category, a gray area of requirements that seem reasonable and probably would be good if they were met, but it is not entirely clear whether they are formally required. Another way to describe this set of states is one that is not generated by the Windows kernel itself but is still not obviously incorrect. From a researcher's perspective, it would be worthwhile to know which parts of the format are actually required by the specification and which are only a convention adopted by the Windows code.

We might never find out, as Microsoft hasn't published an official format specification and it seems unlikely that they will in the future. The only option left for us is to rely on the implementation of the CmpCheck* functions (CmpCheckKey, CmpCheckValueList, etc.) as a sort of oracle and assume that everything there is enforced as a hard requirement, while all other states are permissible. If we go down this path, we might be in for a big surprise, as it turns out that there are many logical-sounding requirements that are not enforced in practice. This could allow user-controlled hives to contain constructs that are not obviously problematic, but are inconsistent with the spirit of the registry and its rules. In many cases, they allow encoding data in a less-than-optimal way, leading to unexpected redundancy. Some examples of such constructs are presented below:

  • Values with duplicate names within a single key: Under normal conditions, only one value with a given name can exist in a key, and if there is a subsequent write to the same name, the new data is assigned to the existing value. However, the uniqueness of value names is not required in input hives, and it is possible to load a hive with duplicate values.
  • Duplicate identical security descriptors within a single hive: Similar to the previous point, it is assumed that security descriptors within a hive are unique, and if an existing descriptor is assigned to another key, its reference count is incremented rather than allocating a new object. However, there is no guarantee that a specially crafted hive will not contain multiple duplicates of the same security descriptor, and this is accepted by the loader.
  • Uncompressed key names consisting solely of ASCII characters: Under normal circumstances, if a given key has a name comprising only ASCII characters, it will always be stored in a compressed form, i.e., by writing two bytes of the name in each element of the _CM_KEY_NODE.Name array of type uint16, and setting the KEY_COMP_NAME flag (0x20) in _CM_KEY_NODE.Flags. However, once again, optimal representation of names is not required when loading the hive, and this convention can be ignored without issue.
  • Allocated but unused cells: The Windows registry implementation deallocates objects within a hive when they are no longer needed, making space for new data. However, the loader does not require every cell marked "allocated" to be actively used. Similarly, security descriptors with a reference count of zero are typically deallocated. However, until a November 2024 refactor of the CmpCheckAndFixSecurityCellsRefcount function, it was possible to load a hive with unused security descriptors still present in the linked list. This behavior has since been changed, and unused security descriptors encountered during loading are now automatically freed and removed from the list.

These examples illustrate the issue well, but none of them (as far as I know) have particularly significant security implications. However, there were also a few specific memory corruption vulnerabilities that stemmed from the fact that the registry code made theoretically sound assumptions about the hive structure, but they were not unenforced by the loader:

  • CVE-2022-37988: This bug is closely related to the fact that cells larger than 16 KiB are aligned to the nearest power of two in Windows, but this condition doesn't need to be satisfied during loading. This caused the shrinking of a cell to fail, even though it should always succeed in-place, "surprising" the client of the allocator and resulting in a use-after-free condition.
  • CVE-2022-37956: As I described in blog post #5, Windows has some logic to ensure that no leaf-type subkey list (li, lf, or lh) exceeds 511 or 1012 elements, depending on its specific type. If a list is expanded beyond this limit, it is automatically split into two lists, each half the original length. Another reasonable assumption is that the root index length would never approach the maximum value of _CM_KEY_INDEX.Count (uint16) under normal circumstances. This would require an unrealistically large number of subkeys or a very specific sequence of millions of key creations and deletions with specific names. However, it was possible to load a hive containing a subkey list of any of the four types with a length equal to 0xFFFF, and trigger a 16-bit integer overflow on the length field, leading to memory corruption. Interestingly, this is one of the few bugs that could be triggered solely with a single .bat file containing a long sequence of the reg.exe command executions.
  • CVE-2022-38037: In this case, the kernel code assumed that the hive version defined in the header (_HBASE_BLOCK.Minor) always corresponded to the type of subkey lists used in a given hive. For example, if the file version is regf 1.3, it should be impossible for it to contain lists in a format introduced in version 1.5. However, for some reason, the hive loader doesn't enforce the proper relationship between the format version and the structures used in it, which in this case led to a serious hive-based memory corruption vulnerability.

As we can see, it is crucial to differentiate between format elements that are conventions adopted by a specific implementation, and those actually enforced during the processing of the input file. If we encounter some code that makes assumptions from the former group that don't belong to the latter one, this could indicate a serious security issue.

Susceptibility to mishandling OOM conditions

Generally speaking, the implementation of any function in the Windows kernel is built roughly according to the following scheme:

NTSTATUS NtHighLevelOperation(...) {

  NTSTATUS Status;

  Status = HelperFunction1(...);

  if (!NT_SUCCESS(Status)) {

    //

    // Clean up...

    //

    return Status;

  }

  Status = HelperFunction2(...);

  if (!NT_SUCCESS(Status)) {

    //

    // Clean up...

    //

    return Status;

  }

 

  //

  // More calls...

  //

  return STATUS_SUCCESS;

}


Of course, this is a significant simplification, as real-world code contains keywords and constructs such as if statements, switch statements, various loops, and so on. The key point is that a considerable portion of higher-level functions call internal, lower-level functions specialized for specific tasks. Handling potential errors signalled by these functions is an important aspect of kernel code (or any code, for that matter). In low-level Windows code, error propagation occurs using the NTSTATUS type, which is essentially a signed 32-bit integer. A value of 0 signifies success (STATUS_SUCCESS), positive values indicate success but with additional information, and negative values denote errors. The sign of the number is checked by the NT_SUCCESS macro. During my research, I dedicated significant time to analyzing the error handling logic. Let's take a moment to think about the types of errors that could occur during registry operations, and the conditions that might cause them.

A common trait of all actions that modify data in the registry is that they allocate memory. The simplest example is the allocation of auxiliary buffers from kernel pools, requested through functions from the ExAllocatePool group. If there is very little available memory at a given point in time, one of the allocation requests may return the STATUS_INSUFFICIENT_RESOURCES error code, which will be propagated back to the original caller. And since we assume that we take on the role of a local attacker who has the ability to execute code on the machine, artificially occupying all available memory is potentially possible in many ways. So this is one way to trigger errors while performing operations on the registry, but admittedly not an ideal way, as it largely depends on the amount of RAM and the maximum pagefile size. Additionally, in a situation where the kernel has so little memory that single allocations start to fail, there is a high probability of the system crashing elsewhere before the vulnerability is successfully exploited. And finally, if several allocations are requested in nearby code in a short period of time, it seems practically impossible to take precise control over which of them will succeed and which will not.

Nonetheless, the overall concept of out-of-memory conditions is a very promising avenue for attack, especially considering that the registry primarily operates on memory-mapped hives using its own allocator, in addition to objects from kernel pools. The situation is even more favorable for an attacker due to the 2 GiB size limitation of each of the two storage types (stable and volatile) within a hive. While this is a relatively large value, it is achievable to occupy it in under a minute on today's machines. The situation is even easier if the volatile space that needs to be occupied, as it resides solely in memory and is not flushed to disk – so filling two gigabytes of memory is then a matter of seconds. It can be accomplished, for example, by creating many long registry values, which is a straightforward task when dealing with a controlled hive. However, even in system hives, this is often feasible. To perform data spraying on a given hive, we only need a single key granting us write permissions. For instance, both HKLM\Software and HKLM\System contain numerous keys that allow write access to any user in the system, effectively permitting them to fill it to capacity. Additionally, the "global registry quota" mechanism, implemented by the internal CmpClaimGlobalQuota and CmpReleaseGlobalQuota functions, ensures that the total memory occupied by registry data in the system does not exceed 4 GiB. Besides filling the entire space of a specific hive, this is thus another way to trigger out-of-memory conditions in the registry, especially when targeting a hive without write permissions. A concrete example where this mechanism could have been employed to corrupt the HKLM\SAM system hive is the CVE-2024-26181 vulnerability.

Considering all this, it is a fair assumption that a local attacker can cause any call to ExAllocatePool*, HvAllocateCell, and HvReallocateCell (with a length greater than the existing cell) to fail. This opens up a large number of potential error paths to analyze. The HvAllocateCell calls are a particularly interesting starting point for analysis, as there are quite a few of them and almost all of them belong to the attack surface accessible to a regular user:

A screenshot shows a debugger window titled xrefs to HvAllocateCell. The window lists numerous functions and their memory addresses under columns Direction, Type, Address, and Text. Nearly all entries show different system functions making a call to HvAllocateCell.

There are two primary reasons why focusing on the analysis of error paths can be a good way to find security bugs. First, it stands to reason that on regular computers used by users, it is extremely rare for a given hive to grow to 2 GiB and run out of space, or for all registry data to simultaneously occupy 4 GiB of memory. This means that these code paths are practically never executed under normal conditions, and even if there were bugs in them, there is a very small chance that they would ever be noticed by anyone. Such rarely executed code paths are always a real treat for security researchers.

The second reason is that proper error handling in code is inherently difficult. Many operations involve numerous steps that modify the hive's internal state. If an issue arises during these operations, the registry code must revert all changes and restore the registry to its original state (at least from the macro-architectural perspective). This requires the developer to be fully aware of all changes applied so far when implementing each error path. Additionally, proper error handling must be considered during the initial design of the control flow as well, because some registry actions are irreversible (e.g., freeing cells). The code must thus be structured so that all such operations are placed at the very end of the logic, where errors cannot occur anymore and successful execution is guaranteed.

One example of such a vulnerability is CVE-2023-23421, which boiled down to the following code:

NTSTATUS CmpCommitRenameKeyUoW(_CM_KCB_UOW *uow) {

  // ...

  if (!CmpAddSubKeyEx(Hive, ParentKey, NewNameKey) ||

      !CmpRemoveSubKey(Hive, ParentKey, OldNameKey)) {

    CmpFreeKeyByCell(Hive, NewNameKey);

    return STATUS_INSUFFICIENT_RESOURCES;

  }

  // ...

}


The issue here was that if the CmpRemoveSubKey call failed, the corresponding error path should have reversed the effect of the CmpAddSubKeyEx function in the previous line, but in practice it didn't. As a result, it was possible to end up with a dangling reference to a freed key in the subkey list, which was a typical use-after-free condition.

A second interesting example of this type of bug was CVE-2023-21747, where an out-of-memory error could occur during a highly sensitive operation, hive unloading. As there was no way to revert the state at the time of the OOM, the vulnerability was fixed by Microsoft by refactoring the CmpRemoveSubKeyFromList function and other related functions so that they no longer allocate memory from kernel pools and thus there is no longer a physical possibility of them failing.

Finally, I'll mention CVE-2023-38154, where the problem wasn't incorrect error handling, but a complete lack of it – the return value of the HvpPerformLogFileRecovery function was ignored, even though there was a real possibility it could end with an error. This is a fairly classic type of bug that can occur in any programming language, but it's definitely worth keeping in mind when auditing the Windows kernel.

Susceptibility to mishandling partial successes

The previous section discusses bugs in error handling where each function is responsible for reversing the state it has modified. However, some functions don't adhere to this operational model. Instead of operating on an "all-or-nothing" basis, they work on a best-effort basis, aiming to accomplish as much of a given task as possible. If an error occurs, they leave any changes made in place, e.g., because this result is still preferable to not making any changes. This introduces a third possible output state for such functions: complete success, partial success, and complete failure.

This might be problematic, as the approach is incompatible with the typical usage of the NTSTATUS type, which is best suited for conveying one of two (not three) states. In theory, it is a 32-bit integer type, so it could store the additional information of the status being a partial success, and not being unambiguously positive or negative. In practice, however, the convention is to directly propagate the last error encountered within the inner function, and the outer functions very rarely "dig into" specific error codes, instead assuming that if NT_SUCCESS returns FALSE, the entire operation has failed. Such confusion at the cross-function level may have security implications if the outer function should take some additional steps in the event of a partial success of the inner function, but due to the binary interpretation of the returned error code, it ultimately does not execute them.

A classic example of such a bug is CVE-2024-26182, which occurred at the intersection of the CmpAddSubKeyEx (outer) and CmpAddSubKeyToList (inner) functions. The problem here was that CmpAddSubKeyToList implements complex, potentially multi-step logic for expanding the subkey list, which could perform a cell reallocation and subsequently encounter an OOM error. On the other hand, the CmpAddSubKeyEx function assumed that the cell index in the subkey list should only be updated in the hive structures if CmpAddSubKeyToList fully succeeds. As a result, the partial success of CmpAddSubKeyToList could lead to a classic use-after-free situation. An attentive reader will probably notice that the return value type of the CmpAddSubKeyToList routine was BOOL and not NTSTATUS, but the bug pattern is identical.

Overall complexity introduced over time

One of the biggest problems with the modern implementation of the registry is that over the decades of developing this functionality, many changes and new features have been introduced. This has caused the level of complexity of its internal state to increase so much that it seems difficult to grasp for one person, unless they are a full-time registry expert that has worked on it full-time over a period of months or years. I personally believe that the registry existed in its most elegant form somewhere around Windows NT 3.1 – 3.51 (i.e. in the years 1993–1996). At the time, the mechanism was intuitive and logical for both developers and its users. Each object (key, value) either existed or not, each operation ended in either success or failure, and when it was requested on a particular key, you could be sure that it was actually performed on that key. Everything was simple, and black and white. However, over time, more and more shades of gray were being continuously added, departing from the basic assumptions:

  • The existence of predefined keys meant that every operation could no longer be performed on every key, as this special type of key was unsafe for many internal registry functions to use due to its altered semantics.
  • Due to symbolic links, opening a specific key doesn't guarantee that it will be the intended one, as it might be a different key that the original one points to.
  • Registry virtualization has introduced further uncertainty into key operations. When an operation is performed on a key, it is unclear whether the operation is actually executed on that specific key or redirected to a different one. Similarly, with read operations, a client cannot be entirely certain that it is reading from the intended key, as the data may be sourced from a different, virtualized location.
  • Transactions in the registry mean that a given state is no longer considered solely within the global view of the registry. At any given moment, there may also be changes that are visible only within a certain transaction (when they are initiated but not yet committed), and this complex scenario must be correctly handled by the kernel.
  • Layered keys have transformed the nature of hives, making them interdependent rather than self-contained database units. This is due to the introduction of differencing hives, which function solely as "patch diffs" and cannot exist independently without a base hive. Additionally, the semantics of certain objects and their fields have been altered. Previously, a key's existence was directly tied to the presence of a corresponding key node within the hive. Layered keys have disrupted this dependency. Now, a key with a key node can be non-existent if marked as a Tombstone, and a key without a corresponding key node can logically exist if its semantics are Merge-Unbacked, referencing a lower-level key with the same name.

Of course, all of these mechanisms were designed and implemented for a specific purpose: either to make life easier for developers/applications using the Registry API, or to introduce some new functionality that is needed today. The problem is not that they were added, but that it seems that the initial design of the registry was simply not compatible with them, so they were sort of forced into the registry, and where they didn't fit, an extra layer of tape was added to hold it all together. This ultimately led to a massive expansion of the internal state that needs to be maintained within the registry. This is evident both in the significant increase in the size of old structures (like KCB) and in the number of new objects that have been added over the years. But the most unfortunate aspect is that each of these more advanced mechanisms seems to have been designed to solve one specific problem, assuming that they would operate in isolation. And indeed, they probably do under typical conditions, but a particularly malicious user could start combining these different mechanisms and making them interact. Given the difficulty in logically determining the expected behavior of some of these combinations, it is doubtful that every such case was considered, documented, implemented, and tested by Microsoft.

The relationships between the various advanced mechanisms in the registry are humorously depicted in the image below:

An image from a Pirates of the Caribbean movie shows a standoff with characters pointing pistols at each other. Text labels overlay the scene, metaphorically linking pirate actions to Windows Registry concepts. These concepts include Predefined Keys, Layered Keys, Transactions, Symbolic Links, and Registry Virtualization.

Some examples of bugs caused by incorrect interactions between these mechanisms include CVE-2023-21675, CVE-2023-21748, CVE-2023-35356, CVE-2023-35357 and CVE-2023-35358.

Entry points

This section describes the entry points that a local attacker can use to interact with the registry and exploit any potential vulnerabilities.

Hive loading

Let's start with the operation of loading user-controlled hives. Since hive loading is only possible from disk (and not, for example, from a memory buffer), this means that to actually trigger this attack surface, the process must be able to create a file with controlled content, or at least a controlled prefix of several kilobytes in length. Regular programs operating at Medium IL generally have this capability, but write access to disk may be restricted for heavily sandboxed processes (e.g. renderer processes in browsers).

When it comes to the typical type of bugs that can be triggered in this way, what primarily comes to mind are issues related to binary data parsing, and memory safety violations such as out-of-bounds buffer accesses. It is possible to encounter more logical-type issues, but they usually rely on certain assumptions about the format not being sufficiently verified, causing subsequent operations on such a hive to run into problems. It is very rare to find a vulnerability that can be both triggered and exploited by just loading the hive, without performing any follow-up actions on it. But as CVE-2024-43452 demonstrates, it can still happen sometimes.

App hives

The introduction of Application Hives in Windows Vista caused a significant shift in the registry attack surface. It allowed unprivileged processes to directly interact with kernel code that was previously only accessible to system services and administrators. Attackers gained access to much of the NtLoadKey syscall logic, including hive file operations, hive parsing at the binary level, hive validation logic in the CmpCheckRegistry function and its subfunctions, and so on. In fact, of the 53 serious vulnerabilities I discovered during my research, 16 (around 30%) either required loading a controlled hive as an app hive, or were significantly easier to trigger using this mechanism.

It's important to remember that while app hives do open up a broad range of new possibilities for attackers, they don't offer exactly the same capabilities as loading normal (non-app) hives due to several limitations and specific behaviors:

  • They must be loaded under the special path \Registry\A, which means an app hive cannot be loaded just anywhere in the registry hierarchy. This special path is further protected from references by a fully qualified path, which also reduces their usefulness in some offensive applications.
  • The logic for unloading app hives differs from unloading standard hives because the process occurs automatically when all handles to the hive are closed, rather than manually unloading the hive through the RegUnLoadKeyW API or its corresponding syscall from the NtUnloadKey family.
  • Operations on app hive security descriptors are very limited: any calls to the RegSetKeySecurity function or RegCreateKeyExW with a non-default security descriptor will fail, which means that new descriptors cannot be added to such hives.
  • KTM transactions are unconditionally blocked for app hives.

Despite these minor restrictions, the ability to load arbitrary hives remains one of the most useful tools when exploiting registry bugs. Even if binary control of the hive is not strictly required, it can still be valuable. This is because it allows the attacker to clearly define the initial state of the hive where the attack takes place. By taking advantage of the cell allocator's determinism, it is often possible to achieve 100% exploitation success.

User hives and Mandatory User Profiles

Sometimes, triggering a specific bug requires both binary control over the hive and certain features that app hives lack, such as the ability to open a key via its full path. In such cases, an alternative to app hives exists, which might be slightly less practical but still allows for exploiting these more demanding bugs. It involves directly modifying one of the two hives assigned to every user in the system: the user hive (C:\Users\NTUSER.DAT mounted under \Registry\User\<SID>, or in other words, HKCU) or the user classes hive (C:\Users\AppData\Local\Microsoft\Windows\UsrClass.dat mounted under \Registry\User\<SID>_Classes). Naturally, when these hives are actively used by the system, access to their backing files is blocked, preventing simultaneous modification, which complicates things considerably. However, there are two ways to circumvent this problem.

The first scenario involves a hypothetical attacker who has two local accounts on the targeted system, or similarly, two different users collaborating to take control of the computer (let's call them users A and B). User A can grant user B full rights to modify their hive(s),  and then log out. User B then makes all the required binary changes to the hive and finally notifies user A that they can log back in. At this point, the Profile Service loads the modified hive on behalf of that user, and the initial goal is achieved.

The second option is more practical as it doesn't require two different users. It abuses Mandatory User Profiles, a system functionality that prioritizes the NTUSER.MAN file in the user's directory over the NTUSER.DAT file as the user hive, if it exists (it doesn't exist in the default system installation). This means that a single user can place a specially prepared hive under the NTUSER.MAN name in their home directory, then log out and log back in. Afterwards, NTUSER.MAN will be the user's active HKCU key, achieving the goal. However, the technique also has some drawbacks – it only applies to the user hive (not UsrClass.dat), and it is somewhat noisy. Once the NTUSER.MAN file has been created and loaded, there is no way to delete it by the same user, as it will always be loaded by the system upon login, effectively blocking access to it.

A few examples of bugs involving one of the two above techniques are CVE-2023-21675, CVE-2023-35356, and CVE-2023-35633. They all required the existence of a special type of key called a predefined key within a publicly accessible hive, such as HKCU. Even when predefined keys were still supported, they could not be created using the system API, and the only way to craft them was by directly setting a specific flag within the internal key node structure in the hive file.

Log file parsing: .LOG/.LOG1/.LOG2

One of the fundamental features of the registry is that it guarantees consistency at the level of interdependent cells that together form the structure of keys within a given hive. This refers to a situation where a single operation on the registry involves the simultaneous modification of multiple cells. Even if there is a power outage and the system restarts in the middle of performing this operation, the registry guarantees that all intermediate changes will either be applied or discarded. Such "atomicity" of operations is necessary in order to guarantee the internal consistency of the hive structure, which, as we know, is important to security. The mechanism is implemented by using additional files associated with the hive, where the intermediate state of registry modifications is saved with the granularity of a memory page (4 KiB), and which can be safely rolled forward or rolled back at the next hive load. Usually these are two files with the .LOG1 and .LOG2 extensions, but it is also possible to force the use of a single log file with the .LOG extension by passing the REG_HIVE_SINGLE_LOG flag to syscalls from the NtLoadKey family.

Internally, each LOG file can be encoded in one of two formats. One is the "legacy log file", a relatively simple format that has existed since the first implementation of the registry in Windows NT 3.1. Another one is the "incremental log file", a slightly more modern and complex format introduced in Windows 8.1 to address performance issues that plagued the previous version. Both formats use the same header as the normal regf format (the first 512 bytes of the _HBASE_BLOCK structure, up to the CheckSum field), with the Type field set to 0x1 (legacy log file on Windows XP and newer), 0x2 (legacy log file on Windows 2000 and older), or 0x6 (incremental log file). Further at offset 0x200, legacy log files contain the signature 0x54524944 ("DIRT") followed by the "dirty vector", while incremental log files contain successive records represented by the magic value 0x454C7648 ("HvLE").

These formats are well-documented in two unofficial regf documentations: GitHub: libyal/libregf and GitHub: msuhanov/regf.  Additional information can be found in the "Stable storage" and "Incremental logging" subsections of the Windows Internals (Part 2, 7th Edition) book and its earlier editions.

From a security perspective, it's important to note that LOG files are processed for app hives, so their handling is part of the local attack surface. On the other hand, this attack surface isn't particularly large, as it boils down to just a few functions that are called by the two highest-level routines: HvAnalyzeLogFiles and HvpPerformLogFileRecovery. The potential types of bugs are also fairly limited, mainly consisting of shallow memory safety violations. Two specific examples of vulnerabilities related to this functionality are CVE-2023-35386 and CVE-2023-38154.

Log file parsing: KTM logs

Besides ensuring atomicity at the level of individual operations, the Windows Registry also provides two ways to achieve atomicity for entire groups of operations, such as creating a key and setting several of its values as part of a single logical unit. These mechanisms are based on two different types of transactions: KTM transactions (managed by the Kernel Transaction Manager, implemented by the tm.sys driver) and lightweight transactions, which were designed specifically for the registry. Notably, lightweight transactions exist in memory only and are never written to disk, so they do not represent an attack vector during hive loading, because there is no file recovery logic.

KTM transactions are available for use in any loaded hive that doesn't have the REG_APP_HIVE and REG_HIVE_NO_RM flags. To utilize them, a transaction object must first be created using the CreateTransaction API. The resulting handle is then passed to the RegOpenKeyTransacted, RegCreateKeyTransacted, or RegDeleteKeyTransacted registry functions. Finally, the entire transaction is committed via CommitTransaction. Windows attempts to guarantee that active transactions that are caught mid-commit during a sudden system shutdown will be rolled forward when the hive is loaded again. To achieve this, the Windows kernel employs the Common Log File System interface to save serialized records detailing individual operations to the .blf files that accompany the main hive file. When a hive is loaded, the system checks for unapplied changes in these .blf files. If any are found, it deserializes the individual records and attempts to redo all the actions described within them. This logic is primarily handled by the internal functions CmpRmAnalysisPhase, CmpRmReDoPhase, and CmpRmUnDoPhase, as well as the functions surrounding them in the control flow graph.

Given that KTM transactions are never enabled for app hives, the possibility of an unprivileged user exploiting this functionality is severely limited. The only option is to focus on KTM log files associated with regular hives that a local user has some control over, namely the user hive (NTUSER.DAT) and the user classes hive (UsrClass.dat). If a transactional operation is performed on a user's HKCU hive, additional .regtrans-ms and .blf files appear in their home directory. Furthermore, if these files don't exist at first, they can be planted on the disk manually, and will be processed by the Windows kernel after logging out and logging back in. Interestingly, even when the KTM log files are actively in use, they have the read sharing mode enabled. This means that a user can write data to these logs by performing transactional operations, and read from them directly at the same time.

Historically, the handling of KTM logs has been affected by a significant number of security issues. Between 2019 and 2020, James Forshaw reported three serious bugs in this code: CVE-2019-0959, CVE-2020-1377, and CVE-2020-1378. Subsequently, during my research, I discovered three more: CVE-2023-28271, CVE-2023-28272, and CVE-2023-28293. However, the strangest thing is that, according to my tests, the entire logic for restoring the registry state from KTM logs stopped working due to code refactoring introduced in Windows 10 1607 (almost 9 years ago) and has not been fixed since. I described this observation in another report related to transactions, in a section called "KTM transaction recovery code". I'm not entirely sure whether I'm making a mistake in testing, but if this is truly the case, it means that the entire recovery mechanism currently serves no purpose and only needlessly increases the system's attack surface. Therefore, it could be safely removed or, at the very least, actually fixed.

Direct registry operations through standard syscalls

Direct operations on keys and values are the core of the registry and make up most of its associated code within the Windows kernel. These basic operations don't need any special permissions and are accessible by all users, so they constitute the primary attack surface available to a local attacker. These actions have been summarized at the beginning of blog post #2, and should probably be familiar by now. As a recap, here is a table of the available operations, including the corresponding high-level API function, system call name, and internal kernel function name if it differs from the syscall:

Operation name

Registry API name(s)

System call(s)

Internal kernel handler (if different than syscall)

Load hive

RegLoadKey

RegLoadAppKey

NtLoadKey
NtLoadKey2

NtLoadKeyEx

NtLoadKey3

-

Count open subkeys in hive

-

NtQueryOpenSubKeys

-

Flush hive

RegFlushKey

NtFlushKey

-

Open key

RegOpenKeyEx

RegOpenKeyTransacted

NtOpenKey

NtOpenKeyEx

NtOpenKeyTransacted

NtOpenKeyTransactedEx

CmpParseKey

Create key

RegCreateKeyEx

RegCreateKeyTransacted

NtCreateKey

NtCreateKeyTransacted

CmpParseKey

Delete key

RegDeleteKeyEx
RegDeleteKeyTransacted

NtDeleteKey

-

Rename key

RegRenameKey

NtRenameKey

-

Set key security

RegSetKeySecurity

NtSetSecurityObject

CmpSecurityMethod

Query key security

RegGetKeySecurity

NtQuerySecurityObject

CmpSecurityMethod

Set key information

-

NtSetInformationKey

-

Query key information

RegQueryInfoKey

NtQueryKey

-

Enumerate subkeys

RegEnumKeyEx

NtEnumerateKey

-

Notify on key change

RegNotifyChangeKeyValue

NtNotifyChangeKey

NtNotifyChangeMultipleKeys

-

Query key path

-

NtQueryObject

CmpQueryKeyName

Close key handle

RegCloseKey

NtClose

CmpCloseKeyObject

CmpDeleteKeyObject

Set value

RegSetValueEx

NtSetValueKey

-

Delete value

RegDeleteValue

NtDeleteValueKey

-

Enumerate values

RegEnumValue

NtEnumerateValueKey

-

Query value data

RegQueryValueEx

NtQueryValueKey

-

Query multiple values

RegQueryMultipleValues

NtQueryMultipleValueKey

-

Some additional comments:

  • A regular user can directly load only application hives, using the RegLoadAppKey function or its corresponding syscalls with the REG_APP_HIVE flag. Loading standard hives, using the RegLoadKey function, is reserved for administrators only. However, this operation is still indirectly accessible to other users through the NTUSER.MAN hive and the Profile Service, which can load it as a user hive during system login.
  • When selecting API functions for the table above, I prioritized their latest versions (often with the "Ex" suffix, meaning "extended"). I also chose those that are the thinnest wrappers and closest in functionality to their corresponding syscalls on the kernel side. In the official Microsoft documentation, you'll also find many older/deprecated versions of these functions, which were available in early Windows versions and now exist solely for backward compatibility (e.g., RegOpenKey, RegEnumKey). Additionally, there are also helper functions that implement more complex logic on the user-mode side (e.g., RegDeleteTree, which recursively deletes an entire subtree of a given key), but they don't add anything in terms of the kernel attack surface.
  • There are several operations natively supported by the kernel that do not have a user-mode equivalent, such as NtQueryOpenSubKeys or NtSetInformationKey. The only way to use these interfaces is to call their respective system calls directly, which is most easily achieved by calling their wrappers with the same name in the ntdll.dll library. Furthermore, even when a documented API function exists, it may not expose all the capabilities of its corresponding system call. For example, the RegQueryKeyInfo function returns some information about a key, but much more can be learned by using NtQueryKey directly with one of the supported information classes.

Moreover, there is a group of syscalls that do require administrator rights (specifically SeBackupPrivilege, SeRestorePrivilege, or PreviousMode set to KernelMode). These syscalls are used either for registry management by the kernel or system services, or for purely administrative tasks (such as performing registry backups). They are not particularly interesting from a security research perspective, as they cannot be used to elevate privileges, but it is worth mentioning them by name:

  • NtCompactKeys
  • NtCompressKey
  • NtFreezeRegistry
  • NtInitializeRegistry
  • NtLockRegistryKey
  • NtQueryOpenSubKeysEx
  • NtReplaceKey
  • NtRestoreKey
  • NtSaveKey
  • NtSaveKeyEx
  • NtSaveMergedKeys
  • NtThawRegistry
  • NtUnloadKey
  • NtUnloadKey2
  • NtUnloadKeyEx

Incorporating advanced features

Despite the fact that most power users are familiar with the basic registry operations (e.g., from using Regedit.exe), there are still some modifiers that can change the behavior of these operations, thereby complicating their implementation and potentially leading to interesting bugs. To use these modifiers, additional steps are often required, such as enabling registry virtualization, creating a transaction, or loading a differencing hive. When this is done, the information about the special key properties are encoded within the internal kernel structures, and the key handle itself is almost indistinguishable from other handles as seen by the user-mode application. When operating on such advanced keys, the logic for their handling is executed in the standard registry syscalls transparently to the user. The diagram below illustrates the general, conceptual control flow in registry-related system calls:

A flowchart outlines a system process beginning with input argument checks and referencing key handles. An internal operation handler then makes decisions based on whether a key is layered or transacted, leading to specific logic paths. The process concludes with copying output data and invoking post registry callbacks before stopping.

This is a very simplified outline of how registry syscalls work, but it shows that a function theoretically supporting one operation can actually hide many implementations that are dynamically chosen based on various factors. In terms of specifics, there are significant differences depending on the operation and whether it is a "read" or "write" one. For example, in "read" operations, the execution paths for transactional and non-transactional operations are typically combined into one that has built-in transaction support but can also operate without them. On the other hand, in "write" operations, normal and transactional operations are always performed differently, but there isn't much code dedicated to layered keys (except for the so-called key promotion operations), since when writing to a layered key, the state of keys lower on the stack is usually not as important. As for the "Internal operation handler" area marked within the large rectangle with the dotted line, these are internal functions responsible for the core logic of a specific operation, and whose names typically begin with "Cm" instead of "Nt". For example, for the NtDeleteKey syscall, the corresponding internal handler is CmDeleteKey, for NtQueryKey it is CmQueryKey, for NtEnumerateKey it is CmEnumerateKey, and so on.

In the following sections, we will take a closer look at each of the possible complications.

Predefined keys and symbolic links

Predefined keys were deprecated in 2023, so I won't spend much time on them here. It's worth mentioning that on modern systems, it wasn't possible to create them in any way using the API, or even directly using syscalls. The only way to craft such a key in the registry was to create it in binary form in a controlled hive file and have it loaded via RegLoadAppKey or as a user hive. These keys had very strange semantics, both at the key node level (unusual encoding of _CM_KEY_NODE.ValueList) and at the kernel key body object level (non-standard value of _CM_KEY_BODY.Type). Due to the need to filter out these keys at an early stage of syscall execution, there are special helper functions whose purpose is to open the key by handle and verify whether it is or isn't a predefined handle (CmObReferenceObjectByHandle and CmObReferenceObjectByName). Consequently, hunting for bugs related to predefined handles involved verifying whether each syscall used the above wrappers correctly, and whether there was some other way to perform an operation on this type of key while bypassing the type check. As I have mentioned, this is now just a thing of the past, as predefined handles in input hives are no longer supported and therefore do not pose a security risk to the system.

When it comes to symbolic links, this is a semi-documented feature that requires calling the RegCreateKeyEx function with the special REG_OPTION_CREATE_LINK flag to create them. Then, you need to set a value named "SymbolicLinkValue" and of type REG_LINK, which contains the target of the symlink as an absolute, internal registry path (\Registry\...) written using wide characters. From that point on, the link points to the specified path. However, it's important to remember that traversing symbolic links originating from non-system hives is heavily restricted: it can only occur within a single "trust class" (e.g., between the user hive and user classes hive of the same user). As a result, links located in app hives are never fully functional, because each app hive resides in its own isolated trust class, and they cannot reference themselves either, as references to paths starting with "\Registry\A" are blocked by the Windows kernel.

As for auditing symbolic links, they are generally resolved during the opening/creation of a key. Therefore, the analysis mainly involves the CmpParseKey function and lower-level functions called within it, particularly CmpGetSymbolicLinkTarget, which is responsible for reading the target of a given symlink and searching for it in existing registry structures. Issues related to symlinks can also be found in registry callbacks registered by third-party drivers, especially those that handle the RegNtPostOpenKey/RegNtPostCreateKey and similar operations. Correctly handling "reparse" return values and the multiple call loops performed by the NT Object Manager is not an easy feat to achieve.

Registry virtualization

Registry virtualization, introduced in Windows Vista, ensures backward compatibility for older applications that assume administrative privileges when using the registry. This mechanism redirects references between HKLM\Software and HKU\<SID>_Classes\VirtualStore subkeys transparently, allowing programs to "think" they write to the system hive even though they don't have sufficient permissions for it. The virtualization logic, integrated into nearly every basic registry syscall, is mostly implemented by three functions:

  • CmKeyBodyRemapToVirtualForEnum: Translates a real key inside a virtualized hive (HKLM\Software) to a virtual key inside the VirtualStore of the user classes hive during read-type operations. This is done to merge the properties of both keys into a single state that is then returned to the caller.
  • CmKeyBodyRemapToVirtual: Translates a real key to its corresponding virtual key, and is used in the key deletion and value deletion operations. This is done to delete the replica of a given key in VirtualStore or one of its values, instead of its real instance in the global hive.
  • CmKeyBodyReplicateToVirtual: Replicates the entire key structure that the caller wants to create in the virtualized hive, inside of the VirtualStore.

All of the above functions have a complicated control flow, both in terms of low-level implementation (e.g., they implement various registry path conversions) and logically – they create new keys in the registry, merge the states of different keys into one, etc. As a result, it doesn't really come as a big surprise that the code has been affected by many vulnerabilities. Triggering virtualization doesn't require any special rights, but it does need a few conditions to be met:

  • Virtualization must be specifically enabled for a given process. This is not the default behavior for 64-bit programs but can be easily enabled by calling the SetTokenInformation function with the TokenVirtualizationEnabled argument on the security token of the process.
  • Depending on the desired behavior, the appropriate combination of VirtualSource/VirtualTarget/VirtualStore flags should be set in _CM_KEY_NODE.Flags. This can be achieved either through binary control over the hive or by setting it at runtime using the NtSetInformationKey call with the KeySetVirtualizationInformation argument.
  • The REG_KEY_DONT_VIRTUALIZE flag must not be set in the _CM_KEY_NODE.VirtControlFlags field for a given key. This is usually not an issue, but if necessary, it can be adjusted either in the binary representation of the hive or using the NtSetInformationKey call with the KeyControlFlagsInformation argument.
  • In specific cases, the source key must be located in a virtualizable hive. In such scenarios, the HKLM\Software\Microsoft\DRM key becomes very useful, as it meets this condition and has a permissive security descriptor that allows all users in the system to create subkeys within it.

With regards to the first two points, many examples of virtualization-related bugs can be found in the Project Zero bug tracker. These reports include proof-of-concept code that correctly sets the appropriate flags. For simplicity, I will share that code here as well; the two C++ functions responsible for enabling virtualization for a given security token and registry key are shown below:

BOOL EnableTokenVirtualization(HANDLE hToken, BOOL bEnabled) {

  DWORD dwVirtualizationEnabled = bEnabled;

  return SetTokenInformation(hToken,

                             TokenVirtualizationEnabled,

                             &dwVirtualizationEnabled,

                             sizeof(dwVirtualizationEnabled));

}

BOOL EnableKeyVirtualization(HKEY hKey,

                             BOOL VirtualTarget,

                             BOOL VirtualStore,

                             BOOL VirtualSource) {

  KEY_SET_VIRTUALIZATION_INFORMATION VirtInfo;

  VirtInfo.VirtualTarget = VirtualTarget;

  VirtInfo.VirtualStore = VirtualStore;

  VirtInfo.VirtualSource = VirtualSource;

  VirtInfo.Reserved = 0;

  NTSTATUS Status = NtSetInformationKey(hKey,

                                        KeySetVirtualizationInformation,

                                        &VirtInfo,

                                        sizeof(VirtInfo));

  return NT_SUCCESS(Status);

}


And their example use:

HANDLE hToken;

HKEY hKey;

//

// Enable virtualization for the token.

//

if (!OpenProcessToken(GetCurrentProcess(), TOKEN_ALL_ACCESS, &hToken)) {

  printf("OpenProcessToken failed with error %u\n", GetLastError());

  return 1;

}

EnableTokenVirtualization(hToken, TRUE);

//

// Enable virtualization for the key.

//

hKey = RegOpenKeyExW(...);

EnableKeyVirtualization(hKey,

                        /*VirtualTarget=*/TRUE,

                        /*VirtualStore=*/ TRUE,

                        /*VirtualSource=*/FALSE);

Transactions

There are two types of registry transactions: KTM and lightweight. The former are transactions implemented on top of the tm.sys (Transaction Manager) driver, and they try to provide certain guarantees of transactional atomicity both during system run time and even across reboots. The latter, as the name suggests, are lightweight transactions that exist only in memory and whose task is to provide an easy and quick way to ensure that a given set of registry operations is applied atomically. As potential attackers, there are three parts of the interface that we are interested in the most: creating a transaction object, rolling back a transaction, and committing a transaction. The functions responsible for all three actions in each type of transaction are shown in the table below:

Operation

KTM (API)

KTM (system call)

Lightweight (API)

Lightweight (system call)

Create transaction

CreateTransaction

NtCreateTransaction

-

NtCreateRegistryTransaction

Rollback transaction

RollbackTransaction

NtRollbackTransaction

-

NtRollbackRegistryTransaction

Commit transaction

CommitTransaction

NtCommitTransaction

-

NtCommitRegistryTransaction

As we can see, the KTM has a public, documented API interface, which cannot be said for lightweight transactions that can only be used via syscalls. Their definitions, however, are not too difficult to reverse engineer, and they come down to the following prototypes:

NTSTATUS NtCreateRegistryTransaction(PHANDLE OutputHandle, ACCESS_MASK DesiredAccess, POBJECT_ATTRIBUTES ObjectAttributes, ULONG Reserved);

NTSTATUS NtRollbackRegistryTransaction(HANDLE Handle, ULONG Reserved);

NTSTATUS NtCommitRegistryTransaction(HANDLE Handle, ULONG Reserved);


Upon the creation of a transaction object, whether of type TmTransactionObjectType (KTM) or CmRegistryTransactionType (lightweight), its subsequent usage becomes straightforward. The transaction handle is passed to either the RegOpenKeyTransacted or the RegCreateKeyTransacted function, yielding a key handle. The key's internal properties, specifically the key body structure, will reflect its transactional nature. Operations on this key proceed identically to the non-transactional case, using the same functions. However, changes are temporarily confined to the transaction context, isolated from the global registry view. Upon the completion of all transactional operations, the user may elect either to discard the changes via a rollback, or apply them atomically through a commit. From the developer's perspective, this interface is undeniably convenient.

From an attack surface perspective, there's a substantial amount of code underlying the transaction functionality. Firstly, the handler for each base operation includes code to verify that the key isn't locked by another transaction, to allocate and initialize a UoW (unit of work) object, and then write it to the internal structures that describe the transaction. Secondly, to maintain consistency with the new functionality, the existing non-transactional code must first abort all transactions associated with a given key before it can be modified.

But that's not the end of the story. The commit process itself is also complicated, as it must cleverly circumvent various registry limitations resulting from its original design. In 2023, most of the code responsible for KTM transactions was removed as a result of CVE-2023-32019, but there is still a second engine that was initially responsible for lightweight transactions and now handles all of them. It consists of two stages: "Prepare" and "Commit". During the prepare stage, all steps that could potentially fail are performed, such as allocating all necessary cells in the target hive. Errors are allowed and correctly handled in the prepare stage, because the globally visible state of the registry does not change yet. This is followed by the commit stage, which is designed so that nothing can go wrong – it no longer performs any dynamic allocations or other complex operations, and its whole purpose is to update values in both the hive and the kernel descriptors so that transactional changes become globally visible. The internal prepare handlers for each individual operation have names starting with "CmpLightWeightPrepare" (e.g., CmpLightWeightPrepareAddKeyUoW), while the corresponding commit handlers start with "CmpLightWeightCommit" (e.g., CmpLightWeightCommitAddKeyUoW). These are the two main families of functions that are most interesting from a vulnerability research perspective. In addition to them, it is also worth analyzing the rollback functionality, which is used both when the rollback is requested directly by the user and when an error occurs in the prepare stage. This part is mainly handled by the CmpTransMgrFreeVolatileData function.

Layered keys

Layered keys are the latest major change of this type in the Windows Registry, introduced in 2016. They overturned many fundamental assumptions that had been in place until then. A given logical key no longer consists solely of one key node and a maximum of one active KCB, but of a whole stack of these objects: from the layer height of the given hive down to layer zero, which is the base hive. A key that has a key node may in practice be non-existent (if marked as a tombstone), and vice versa, a key without a key node may logically exist if there is an existing key with the same name lower in its stack. In short, this whole containerization mechanism has doubled the complexity of every single registry operation, because:

  • Querying for information about a key has become more difficult, because instead of gathering information from just one key, it has to be potentially collected from many keys at once and combined into a coherent whole for the caller.
  • Performing any "write" operations has become more difficult because before writing any information to the key at a given nesting level, you first need to make sure that the key and all its ancestors in a given hive exist, which is done in a complicated process called "key promotion".
  • Deleting and renaming a key has become more difficult, because you always have to consider and correctly handle higher-level keys that rely on the one you are modifying. This is especially true for Merge-Unbacked keys, which do not have their own representation and only reflect the state of the keys at a lower level. This also applies to ordinary keys from hives under HKLM and HKU, which by themselves have nothing to do with differencing hives, but as an integral part of the registry hierarchy, they also have to correctly support this feature.
  • Performing security access checks on a key has become more challenging due to the need to accurately pinpoint the relevant security descriptor on the key stack first.

Overall, the layered keys mechanism is so complex that it could warrant an entire blog post (or several) on its own, so I won't be able to explain all of its aspects here. Nevertheless, its existence will quickly become clear to anyone who starts reversing the registry implementation. The code related to this functionality can be identified in many ways, for example:

  • By references to functions that initialize the key node stack / KCB stack objects (i.e., CmpInitializeKeyNodeStack, CmpStartKcbStack, and CmpStartKcbStackForTopLayerKcb),
  • By dedicated functions that implement a given operation specifically on layered keys that end with "LayeredKey" (e.g., CmDeleteLayeredKey, CmEnumerateValueFromLayeredKey, CmQueryLayeredKey),
  • By references to the KCB.LayerHeight field, which is very often used to determine whether the code is dealing with a layered key (height greater than zero) or a base key (height equal to zero).

I encourage those interested in further exploring this topic to read Microsoft's Containerized Configuration patent (US20170279678A1), the "Registry virtualization" section in Chapter 10 of Windows Internals (Part 2, 7th Edition), as well as my previous blog post #6, where I briefly described many internal structures related to layered keys. All of these references are great resources that can provide a good starting point for further analysis.

When it comes to layered keys in the context of attack entry points, it's important to note that loading custom differencing hives in Windows is not straightforward. As I wrote in blog post #4, loading this type of hive is not possible at all through any standard NtLoadKey-family syscall. Instead, it is done by sending an undocumented IOCTL 0x220008 to \Device\VRegDriver, which then passes this request on to an internal kernel function named CmLoadDifferencingKey. Therefore, the first obstacle is that in order to use this IOCTL interface, one would have to reverse engineer the layout of its corresponding input structure. Fortunately, I have already done it and published it in the blog post under the VRP_LOAD_DIFFERENCING_HIVE_INPUT name. However, a second, much more pressing problem is that communicating with the VRegDriver requires administrative rights, so it can only be used for testing purposes, but not in practical privilege escalation attacks.

So, what options are we left with? Firstly, there are potential scenarios where the exploit is packaged in a mechanism that legitimately uses differencing hives, e.g., an MSIX-packaged application running in an app silo, or a specially crafted Docker container running in a server silo. In such cases, we provide our own hives by design, which are then loaded on the victim’s system on our behalf when the malicious program or container is started. The second option is to simply ignore the inability to load our own hive and use one already present in the system. In a default Windows installation, many built-in applications use differencing hives, and the \Registry\WC key can be easily enumerated and opened without any problems (unlike \Registry\A). Therefore, if we launch a program running inside an app silo (e.g., Notepad) as a local user, we can then operate on the differencing hives loaded by it. This is exactly what I did in most of my proof-of-concept exploits related to this functionality. Of course, it is possible that a given bug will require full binary control over the differencing hive in order to trigger it, but this is a relatively rare case: of the 10 vulnerabilities I identified in this code, only two of them required such a high degree of control over the hive.

Alternative registry attack targets

The most crucial attack surface associated with the registry is obviously its implementation within the Windows kernel. However, other types of software interact with the registry in many ways and can be also prone to privilege escalation attacks through this mechanism. They are discussed in the following sections.

Drivers implementing registry callbacks

Another area where potential registry-related security vulnerabilities can be found is Registry Callbacks. This mechanism, first introduced in Windows XP and still present today, provides an interface for kernel drivers to log or interfere with registry operations in real-time. One of the most obvious uses for this functionality is antivirus software, which relies on registry monitoring. Microsoft, aware of this need but wanting to avoid direct syscall hooking by drivers, was compelled to provide developers with an official, documented API for this purpose.

From a technical standpoint, callbacks can be registered using either the CmRegisterCallback function or its more modern version, CmRegisterCallbackEx. The documentation for these functions serves as a good starting point for exploring the mechanism, as it seamlessly leads to the documentation of the callback function itself, and from there to the documentation of all the structures that describe the individual operations. Generally speaking, callbacks can monitor virtually any type of registry operation, both before ("pre" callbacks) and after ("post" callbacks) it is performed. They can be used to inspect what is happening in the system and log the details of specific events of interest. Callbacks can also influence the outcome of an operation. In "pre" notifications, they can modify input data or completely take control of the operation and return arbitrary information to the caller while bypassing the standard operation logic. During "post" notification handling, it is possible to influence both the status returned to the user and the output data. Overall, depending on the amount and types of operations supported in a callback, a completely error-free implementation can be really difficult to write. It requires excellent knowledge of the inner workings of the registry, as well as a very thorough reading of the documentation related to callbacks. The contracts that exist between the Windows kernel and the callback code can be very complicated, so in addition to the sources mentioned above, it's also worth reading the entire separate series of seven articles detailing various callback considerations, titled Filtering Registry Calls.

Here are some examples of things that can go wrong in the implementation of callbacks:

  • Standard user-mode memory access bugs. As per the documentation (refer to the table at the bottom of the Remarks section), pointers to output data received in "post" type callbacks contain the original user-mode addresses passed to the syscall by the caller. This means that if the callback wants to reference this data in any way, the only guarantee it has is that these pointers have been previously probed. However, it is still important to access this memory within a try/except block and to avoid potential double-fetch vulnerabilities by always copying the data to a kernel-mode buffer first before operating on it.
  • A somewhat related but higher-level issue is excessive trust in the output data structure within "post" callbacks. The problem is that some registry syscalls return data in a strictly structured way, and since the "post" callback executes before returning to user mode, it might seem safe to trust that the output data conforms to its documented format (if one wants to use or slightly modify it). An example of such a syscall is NtQueryKey, which returns a specific structure for each of the several possible information classes. In theory, it would appear that a malicious program has not yet had the opportunity to modify this data, and it should still be valid when the callback executes. In practice, however, this is not the case, because the output data has already been copied to user-mode, and there may be a parallel user thread modifying it concurrently. Therefore, it is very important that if one wants to use the output data in the "post" callback, they must first fully sanitize it, assuming that it may be completely arbitrary and is as untrusted as any other input data.
  • Moving up another level, it's important to prevent confused deputy problems that exploit the fact that callback code runs with kernel privileges. For example, if a callback wanted to redirect access to certain registry paths to another location, and it used the ZwCreateKey call without the OBJ_FORCE_ACCESS_CHECK flag to do so, it would allow an attacker to create keys in locations where they normally wouldn't have access.
  • Bugs in the emulation of certain operations in "pre"-type callbacks. If a callback decides to handle a given request on its own and signal this to the kernel by returning the STATUS_CALLBACK_BYPASS code, it is responsible for filling all important fields in the corresponding REG_XXX_KEY_INFORMATION structure so that, in accordance with the expected syscall behavior, the output data is correctly returned to the caller (source: "When a registry filtering driver's RegistryCallback routine receives a pre-notification [...]" and "Alternatively, if the driver changes a status code from failure to success, it might have to provide appropriate output parameters.").
  • Bugs in "post"-type callbacks that change an operation's status from success to failure. If we want to block an operation after it has already been executed, we must remember that it has already occurred, with all its consequences and side effects. To successfully pretend that it did not succeed, we would have to reverse all its visible effects for the user and release the resources allocated for this purpose. For some operations, this is very difficult or practically impossible to do cleanly, so I would personally recommend only blocking operations at the "pre" stage and refraining from trying to influence their outcome at the "post" stage (source: "If the driver changes a status code from success to failure, it might have to deallocate objects that the configuration manager allocated.").
  • Challenges presented by error handling within "post"-type callbacks. As per the documentation, the kernel only differentiates between a STATUS_CALLBACK_BYPASS return value and all others, which means that it doesn't really discern callback success or failure. This is somewhat logical since, at this stage, there isn't a good way to handle failures – the operation has already been performed. On the other hand, it may be highly unintuitive, as the Windows kernel idiom "if (!NT_SUCCESS(Status)) { return Status; }" becomes ineffective here. If an error is returned, it won't propagate to user mode, and will only cause premature callback exit, potentially leaving some important operations unfinished. To address this, you should design "post" callbacks to be inherently fail-safe (e.g., include no dynamic allocations), or if this isn't feasible, implement error handling cautiously, ensuring that minor operation failures don't compromise the callback's overall logical/security guarantees.
  • Issues surrounding the use of a key object pointer passed to the callback, in one of a few specific scenarios where it can have a non-NULL value but not point to a valid key object. This topic is explored in a short article in Microsoft Learn: Invalid Key Object Pointers in Registry Notifications.
  • Issues in open/create operation callbacks due to missing or incorrect handling of symbolic links and other redirections, which are characterized by the return values STATUS_REPARSE and STATUS_REPARSE_GLOBAL.
  • Bugs that result from a lack of transaction support where it is needed. This could be an incorrect assumption that every operation performed on the registry is non-transactional and its effect is visible immediately, and not only after the transaction is committed. The API function that is used to retrieve the transaction associated with a given key (if it exists) during callback execution is CmGetBoundTransaction.
  • Issues arising from using the older API version, CmCallbackGetKeyObjectID, instead of the newer CmCallbackGetKeyObjectIDEx. The older version has some inherent problems discussed in the documentation, such as returning an outdated key path if the key name has been changed by an NtRenameKey operation.
  • Issues stemming from an overreliance on the CmCallbackGetKeyObjectID(Ex) function to retrieve a key's full path. A local user can cause these functions to deterministically fail by creating and operating on a key with a path length exceeding 65535 bytes (the maximum length of a string represented by the UNICODE_STRING structure). This can be achieved using the key renaming trick described in CVE-2022-37990, and results in the CmCallbackGetKeyObjectID(Ex) function returning the STATUS_INSUFFICIENT_RESOURCES error code. This is problematic because the documentation for this function does not mention this error code, and there is no way to defend against it from the callback's perspective. The only options are to avoid relying on retrieving the full key path altogether, or to implement a defensive fallback plan if this operation fails.
  • Logical bugs arising from attempts to block access to certain registry keys by path, but neglecting the key rename operation, which can change the key's name dynamically and bypass potential filtering logic in the handling of the open/create operations. Notably, it's difficult to blame developers for such mistakes, as even the official documentation discourages handling NtRenameKey operations, citing its high complexity (quote: "Several registry system calls are not documented because they are rarely used [...]").

As we can see, developers using these types of callbacks can fall into many traps, and the probability of introducing a bug increases with the complexity of the callback's logic.

As a security researcher, there are two approaches to enumerating this attack surface to find vulnerable callbacks: static and dynamic. The static approach involves searching the file system (especially C:\Windows\system32\drivers) for the "CmRegisterCallback" string, as every driver that registers a callback must refer to this function or its "Ex" equivalent. As for the dynamic approach, the descriptors of all callbacks in the system are linked together in a doubly-linked list that begins in the global nt!CallbackListHead object. Although the structure of these descriptors is undocumented, my analysis indicates that the pointer to the callback function is located at offset 0x28 in Windows 11. Therefore, all callbacks registered in the system at a given moment can be listed using the following WinDbg command:

0: kd> !list -x "dqs @$extret+0x28 L1" CallbackListHead

fffff801`c42f6cd8  fffff801`c42f6cd0 nt!CmpPreloadedHivesList

ffffdc88`d377e418  fffff801`56a48df0 WdFilter!MpRegCallback

ffffdc88`d8610b38  fffff801`59747410 applockerfltr!SmpRegistryCallback

ffffdc88`d363e118  fffff801`57a05dd0 UCPD+0x5dd0

ffffdc88`ed11d788  fffff801`c3c2ba50 nt!VrpRegistryCallback

ffffdc88`d860c758  fffff801`597510c0 bfs!BfsRegistryCallback


As shown, even on a clean Windows 11 system, the operating system and its drivers register a substantial number of callbacks. In the listing above, the first line of output can be ignored, as it refers to the nt!CallbackListHead object, which is the beginning of the list and not a real callback descriptor. The remaining functions are associated with the following modules:

  • WdFilter!MpRegCallback: a callback registered by Windows Defender, the default antivirus engine running on Windows.
  • applockerfltr!SmpRegistryCallback: a callback registered by the Smartlocker Filter Driver, which is one of the drivers that implement the AppLocker/SmartLocker functionality at the kernel level.
  • UCPD+0x5dd0: a callback associated with the UCPD.sys driver, which expands to "User Choice Protection Driver". This is a module that prevents third-party software from modifying the default application settings for certain file types and protocols, such as web browsers and PDF readers. As we can infer from the format of this symbol and its unresolved name, Microsoft does not currently provide PDB debug symbols for the executable image, but some information online indicates that such symbols were once available for older builds of the driver.
  • nt!VrpRegistryCallback: a callback implemented by the VRegDriver, which is part of the core Windows kernel executable image, ntoskrnl.exe. It plays a crucial role in the system, as it is responsible for redirecting key references to their counterparts within differencing hives for containerized processes. It is likely the most interesting and complex callback registered by default in Windows.
  • bfs!BfsRegistryCallback: the callback is a component of the Brokering File System driver. It is primarily responsible for supporting secure file access for applications running in an isolated environment (AppContainers). However, it also has a relatively simple registry callback that supports key opening/creation operations. It is not entirely clear why the functionality wasn't simply incorporated into the VrpRegistryCallback, which serves a very similar purpose.

In my research, I primarily focused on reviewing the callback invocations in individual registry operations (specifically calls to the CmpCallCallBacksEx function), and on the correctness of the VrpRegistryCallback function implementation. As a result, I discovered CVE-2023-38141 in the former area, and three further bugs in the VRegDriver (CVE-2023-38140, CVE-2023-36803 and CVE-2023-36576). These reports serve as a very good example of the many types of problems that can occur in registry callbacks.

Privileged registry clients: programs and drivers

The final attack target related to the registry are the highly privileged users of this interface, that is, user-mode processes running with administrator/system rights, and kernel drivers that operate on the registry. The registry is a shared resource by design, and apart from app hives mounted in the special \Registry\A key, every program in the system can refer to any active key as long as it has the appropriate permissions. And for a malicious user, this means that they can try to exploit weaknesses exhibited by other processes when interacting with the registry, and secondly, they can try to actively interfere with them. I can personally imagine two main types of issues related to incorrect use of the registry, and both of them are quite high-level by nature.

The first concern is related to the fact that the registry, as a part of the NT Object Manager model, undergoes standard access control through security access checks. Each registry key is mandatorily assigned a specific security descriptor. Therefore, as the name implies, it is crucial for system security that each key's descriptor has the minimum permissions required for proper functionality, while aligning with the author's intended security model for the application.

From a technical perspective, a specific security descriptor for a given key can be set either during its creation through the lpSecurityAttributes argument of RegCreateKeyExW, or separately by calling the RegSetKeySecurity API. If no descriptor is explicitly set, the key assumes a default descriptor based largely on the security settings of its parent key. This model makes sense from a practical standpoint. It allows most applications to avoid dealing with the complexities of custom security descriptors, while still maintaining a reasonable level of security, as high-level keys in Windows typically have well-configured security settings. Consider the well-known HKLM\Software tree, where Win32 applications have stored their global settings for many years. The assumption is that ordinary users have read access to the global configuration within that tree, but only administrators can write to it. If an installer or application creates a new subkey under HKLM\Software without explicitly setting a descriptor, it inherits the default security properties, which is sufficient in most cases.

However, certain situations require extra care to properly secure registry keys. For example, if an application stores highly sensitive data (e.g., user passwords) in the registry, it is important to ensure that both read and write permissions are restricted to the smallest possible group of users (e.g., administrators only). Additionally, when assigning custom security descriptors to keys in global system hives, you should exercise caution to avoid inadvertently granting write permissions to all system users. Furthermore, if a user has KEY_CREATE_LINK access to a global key used by higher-privileged processes, they can create a symbolic link within it, potentially resulting in a "confused deputy" problem and the ability to create registry keys under any path. In summary, for developers creating high-privilege code on Windows and utilizing the registry, it is essential to carefully handle the security descriptors of the keys they create and operate on. From a security researcher's perspective, it could be useful to develop tooling to list all keys that allow specific access types to particular groups in the system and run it periodically on different Windows versions and configurations. This approach can lead to some very easy bug discoveries, as it doesn't require any time spent on reverse engineering or code auditing.

The second type of issue is more subtle and arises because a single "configuration unit" in the registry sometimes consists of multiple elements (keys, values) and must be modified atomically to prevent an inconsistent state and potential vulnerabilities.  For such cases, there is support for transactions in the registry. If a given process manages a configuration that is critical to system security and in which different elements must always be consistent with each other, then making use of the Transacted Registry (TxR) is practically mandatory. A significantly worse, though somewhat acceptable solution may be to implement a custom rollback logic, i.e., in the event of a failure of some individual operation, manually reversing the changes that have been applied so far. The worst case scenario is when a privileged program does not realize the seriousness of introducing partial changes to the registry, and implements its logic in a way typical of using the API in a best-effort manner, i.e.: calling Win32 functions as long as they succeed, and when any of them returns an error, then simply passing it up to the caller without any additional cleanup.

Let's consider this bug class on the example of a hypothetical service that, through some local inter-process communication interface, allows users to register applications for startup. It creates a key structure under the HKLM\Software\CustomAutostart\<Application Name> path, and for each such key it stores two values: the command line to run during system startup ("CommandLine"), and the username with whose privileges to run it ("UserName"). If the username value does not exist, it implicitly assumes that the program should start with system rights. Of course, the example service intends to be secure, so it only allows setting the username to the one corresponding to the security token of the requesting process. Operations on the registry take place in the following order:

  1. Create a new key named HKLM\Software\CustomAutostart\<Application Name>,
  2. Set the "CommandLine" value to the string provided by the client,
  3. Set the "UserName" value to the string provided by the client.

The issue with this logic is that it's not transactional – if an error occurs, the execution simply aborts, leaving the partial state behind. For example, if operation #3 fails for any reason, an entry will be added to the autostart indicating that a controlled path should be launched with system rights. This directly leads to privilege escalation and was certainly not the developer's intention. One might wonder why any of these operations would fail, especially in a way controlled by an attacker. The answer is simple and was explained in the "Susceptibility to mishandling OOM conditions" section. A local attacker has at least two ways of influencing the success or failure of registry operations in the system: by filling the space of the hive they want to attack (if they have write access to at least one of its keys) or by occupying the global registry quota in memory, represented by the global nt!CmpGlobalQuota variable. Unfortunately, finding such vulnerabilities is more complicated than simply scanning the entire registry for overly permissive security descriptors. It requires identifying candidates of registry operations in the system that have appropriate characteristics (high privilege process, lack of transactionality, sensitivity to a partial/incomplete state), and then potentially reverse-engineering the specific software to get a deeper understanding of how it interacts with the registry. Tools like Process Monitor may come in handy at least in the first part of the process.

One example of a vulnerability related to the incorrect guarantee of atomicity of system-critical structures is CVE-2024-26181. As a result of exhausting the global registry quota, it could lead to permanent damage to the HKLM\SAM hive, which stores particularly important information about users in the system, their passwords, group memberships, etc.

Vulnerability primitives

In this chapter, we will focus on classifying registry vulnerabilities based on the primitives they offer, and briefly discuss their practical consequences and potential exploitation methods.

Pool memory corruption

Pool memory corruption is probably the most common type of low-level vulnerability in the Windows kernel. In the context of the registry, this bug class is somewhat rarer than in other ring-0 components, but it certainly still occurs and is entirely possible. It manifests in its most "pure" form when the corruption happens within an auxiliary object that is temporarily allocated on the pools to implement a specific operation. One such example case is a report concerning three vulnerabilities—CVE-2022-37990, CVE-2022-38038, and CVE-2022-38039—all stemming from a fairly classic 16-bit integer overflow when calculating the length of a dynamically allocated buffer. Another example is CVE-2023-38154, where the cause of the buffer overflow was slightly more intricate and originated from a lack of error handling in one of the functions responsible for recovering the hive state from LOG files.

The second type of pool memory corruption that can occur in the registry is problems managing long-lived objects that are used to cache some information from the hive mapping in more readily accessible pool memory — such as those described in post #6. In this case, we are usually dealing with UAF-type conditions, like releasing an object while there are still some active references to it. If I had to point to one object that could be most prone to this type of bug, it would probably be the Key Control Block, which is reference counted, used by the implementation of almost every registry syscall, and for which there are some very strong invariants critical for memory safety (e.g., the existence of only one KCB for a particular key in the global KCB tree). One issue related to KCBs was CVE-2022-44683, which resulted from incorrect handling of predefined keys in the NtNotifyChangeMultipleKeys system call.

Another, slightly different category of UAFs on pools are situations in which this type of condition is not a direct consequence of a vulnerability, but more of a side effect. Let's take security descriptors as an example: they are located in the hive space, but the kernel also maintains a cache reflecting the state of these descriptors on the kernel pools (in _CMHIVE.SecurityCache and related fields). Therefore, if for some reason a security descriptor in the hive is freed prematurely, this problem will also be automatically reflected in the cache, and some keys may start to have a dangling KCB.CachedSecurity pointer set to the released object. I have taken advantage of this fact many times in my reports to Microsoft, because it was very useful for reliably triggering crashes. While generating a bugcheck based on the UAF of the _CM_KEY_SECURITY structure in the hive is possible, it is much more convoluted than simply turning on the Special Pool mechanism and making the kernel refer to the cached copy of the security descriptor (a few examples: CVE-2023-23421, CVE-2023-35382, CVE-2023-38139). In some cases, exploiting memory corruption on pools may also offer some advantages over exploiting hive-based memory corruption, so it is definitely worth remembering this behavior for the future.

When it comes to the strictly technical aspects of kernel pool exploitation, I won't delve into it too deeply here. I didn't specifically focus on it in my research, and there aren't many interesting registry-specific details to mention in this context. If you are interested to learn more about this topic, please refer to the resources available online.

Hive memory corruption

The second type of memory corruption encountered in the registry is hive-based memory corruption. This class of bugs is unique to the registry and is based on the fact that data stored in hives serves a dual role. It stores information persistently on disk, but it also works as the representation of the hive in memory in the exact same form. The data is then operated on using C code through pointers, helper functions like memcpy, and so on. Given all this, it doesn't come as a surprise that classic vulnerabilities such as buffer overflows or use-after-free can also occur within this region.

So far, during my research, I have managed to find 17 hive-based memory corruption issues, which constitutes approximately 32% of all 53 vulnerabilities that have been fixed by Microsoft in security bulletins. The vast majority of them were related to just two mechanisms – reference counting security descriptors and operating on subkey lists – but there were also cases of bugs related to other types of objects.

I have started using the term "inconsistent hive state", referring to any situation where the regf format state either ceases to be internally consistent or stops accurately reflecting cached copies of the same data within other kernel objects. I described one such issue here, where the _CM_BIG_DATA.Count field stops correctly corresponding to the _CM_KEY_VALUE.DataLength field for the same registry value. However, despite this specific behavior being incorrect, according to both my analysis and Microsoft's, it doesn't have any security implications for the system. In this context, the term "hive-based memory corruption" denotes a slightly narrower group of issues that not only allow reaching any inconsistent state but specifically enable overwriting valid regf structures with attacker-controlled data.

The general scheme for exploiting hive-based memory corruption closely resembles the typical exploitation of any other memory corruption. The attacker's initial objective is to leverage the available primitive and manipulate memory allocations/deallocations to overwrite a specific object in a controlled manner. On modern systems, achieving this stage reliably within the heap or kernel pools can be challenging due to allocator randomization and enforced consistency checks. However, the cell allocator implemented by the Windows kernel is highly favorable for the attacker: it lacks any safeguards, and its behavior is entirely deterministic, which greatly simplifies this stage of exploit development. One could even argue that, given the properties of this allocator, virtually any memory corruption primitive within the regf format can be transformed into complete control of the hive in memory with some effort.

With this assumption, let's consider what to do next. Even if we have absolute control over all the internal data of the mapped hive, we are still limited to its mapping in memory, which in itself does not give us much. The question arises as to how we can "escape" from this memory region and use hive memory corruption to overwrite something more interesting, like an arbitrary address in kernel memory (e.g., the security token of our process).

First of all, it is worth noting that such an escape is not always necessary – if the attack is carried out in one of the system hives (SOFTWARE, SYSTEM, etc.), we may not need to corrupt the kernel memory at all. In this case, we could simply perform a data-only attack and modify some system configuration, grant ourselves access to important system keys, etc. However, with many bugs, attacking a highly privileged hive is not possible. Then, the other option available to the attacker is to modify one of the cells to break some invariant of the regf format, and cause a second-order side effect in the form of a kernel pool corruption. Some random ideas are:

  • Setting too long a key name or inserting the illegal character '\' into the name,
  • Creating a fake exit node key,
  • Corrupting the binary structure of a security descriptor so that the internal APIs operating on them start misbehaving,
  • Crafting a tree structure within the hive with a depth greater than the maximum allowed (512 levels of nesting),
  • ... and many, many others.

However, during experiments exploring practical exploitation, I discovered an even better method that grants an attacker the ability to perform reliable arbitrary read and write operations in kernel memory—the ultimate primitive. This method exploits the behavior of 32-bit cell index values, which exhibit unusual behavior when they exceed the hive's total size. I won't elaborate on the full technique here, but for those interested, I discussed it during my presentation at the OffensiveCon conference in May 2024. The subject of exploiting hive memory corruption will be also covered in detail in its own dedicated blog post in the future.

Invalid cell indexes

This is a class of bugs that manifests directly when an incorrect cell index appears in an object—either in a cell within the hive or in a structure on kernel pools, like KCB. These issues can be divided into three subgroups, depending on the degree of control an attacker can gain over the cell index.

Cell index 0xFFFFFFFF (HCELL_NIL)

This is a special marker that indicates that a given structure member/variable of type HCELL_INDEX doesn't point to any specific cell, which is equivalent to a NULL pointer in C. There are many situations where the value 0xFFFFFFFF (in other words, -1) is used and even desired, e.g. to signal that an optional object doesn't exist and shouldn't be processed. The kernel code is prepared for such cases and correctly checks whether a given cell index is equal to this marker before operating on it. However, problems can arise when the value ends up in a place where the kernel always expects a valid index. Any mandatory field in a specific object can be potentially subject to this problem, such as the _CM_KEY_NODE.Security field, which must always point to a valid descriptor and should never be equal to -1 (other than for exit nodes).

Some examples of such vulnerabilities include:

  • CVE-2023-21772: an unexpected value of -1 being set in _CM_KEY_NODE.Security due to faulty logic in the registry virtualization code, which first freed the old descriptor and only then attempted to allocate a new one, which could fail, leaving the key without any assigned security descriptor.
  • CVE-2023-35357: an unexpected value of -1 being set in KCB.KeyCell, because the code assumed that it was operating on a physically existing base key, while in practice it could operate on a layered key with Merge-Unbacked semantics, which does not have its own key node, but relies solely on key nodes at lower levels of the key stack.
  • CVE-2023-35358: another case of an unexpected value of -1 being set in KCB.KeyCell, while the kernel expected that at least one key in the given key node stack would have an allocated key node object. The source of the problem here was incorrect integration of transactions and differencing hives.

When such a problem occurs, it always manifests by the value -1 being passed as the cell index to the HvpGetCellPaged function. For decades, this function completely trusted its parameters, assuming that the input cell index would always be within the bounds of the given hive. Consequently, calling HvpGetCellPaged with a cell index of 0xFFFFFFFF would result in the execution of the following code:

_CELL_DATA *HvpGetCellPaged(_HHIVE *Hive, HCELL_INDEX Index) {

  _HMAP_ENTRY *Entry = &Hive->Storage[1].Map->Directory[0x3FF]->Table[0x1FF];

  return (Entry->PermanentBinAddress & (~0xF)) + Entry->BlockOffset + 0xFFF + 4;

}


In other words, the function would refer to the Volatile (1) map cell, and within it, to the last element of the Directory and then the Table arrays. Considering the "small dir" optimization described in post #6, it becomes clear that this cell map walk could result in an out-of-bounds memory access within the kernel pools (beyond the boundaries of the _CMHIVE structure). Personally, I haven't tried to transform this primitive into anything more useful, but it seems evident that with some control over the kernel memory around _CMHIVE, it should theoretically be possible to get the HvpGetCellPaged function to return any address chosen by the attacker. Further exploitation prospects would largely depend on the subsequent operations that would be performed on such a fake cell, and the extent to which a local user could influence them. In summary, I've always considered these types of bugs as "exploitable on paper, but quite difficult to exploit in practice."

Ultimately, none of this matters much, because it seems that Microsoft noticed a trend in these vulnerabilities and, in July 2023, added a special condition to the HvpGetCellFlat and HvpGetCellPaged functions:

  if (Index == HCELL_NIL) {

    KeBugCheckEx(REGISTRY_ERROR, 0x32, 1, Hive, 0xFFFFFFFF);
 
}


This basically means that the specific case of index -1 has been completely mitigated, since rather than allowing any chance of exploitation, the system now immediately shuts down with a Blue Screen of Death. As a result, the bug class no longer has any security implications. However, I do feel a bit disappointed – if Microsoft deemed the check sufficiently important to add to the code, they could have made it just a tiny bit stronger, for example:

  if ((Index & 0x7FFFFFFF) >= Hive->Storage[Index >> 31].Length) {

    KeBugCheckEx(...);

  }


The above check would reject all cell indexes exceeding the length of the corresponding storage type, and it is exactly what the HvpReleaseCellPaged function currently does. Checking this slightly stronger condition in one fell swoop would handle invalid indexes of -1 and completely mitigate the previously mentioned technique of out-of-bounds cell indexes. While not introduced yet, I still secretly hope that it will happen one day... 🙂

Dangling (out-of-date) cell indexes

Another group of vulnerabilities related to cell indexes are cases where, after a cell is freed, its index remains in an active cell within the registry. Simply put, these are just the cell-specific use-after-free conditions, and so the category very closely overlaps with the previously described hive-based memory corruption.

Notable examples of such bugs include:

  • CVE-2022-37988: Caused by the internal HvReallocateCell function potentially failing when shrinking an existing cell, which its caller assumed was impossible.
  • CVE-2023-23420: A bug in the transactional key rename operation could lead to a dangling cell index in a key's subkey list, pointing to a freed key node.
  • CVE-2024-26182: Caused by mishandling a partial success situation where an internal function might successfully perform some operations on the hive (reallocate existing subkey lists) but ultimately return an error code, causing the caller to skip updating the _CM_KEY_NODE.SubKeyLists[...] field accordingly.
  • All use-after-free vulnerabilities in security descriptors due to incorrect reference counting: CVE-2022-34707, CVE-2023-28248, CVE-2023-35356, CVE-2023-35382, CVE-2023-38139, and CVE-2024-43641.

In general, UAF bugs within the hive are powerful primitives that can typically be exploited to achieve total control over the hive's internal data. The fact that both exploits I wrote to demonstrate practical exploitation of hive memory corruption vulnerabilities fall into this category (CVE-2022-34707, CVE-2023-23420) can serve as anecdotal evidence of this statement.

Fully controlled/arbitrary cell indexes

The last type of issues where cell indexes play a major role are situations in which the user somehow obtains full control over the entire 32-bit index value, which is then referenced as a valid cell by the kernel. Notably, this is not about some second-order effect of hive memory corruption, but vulnerabilities where this primitive is the root cause of the problem. Such situations happen relatively rarely, but there have been at least two such cases in the past:

  • CVE-2022-34708: missing verification of the _CM_KEY_SECURITY.Blink field in the CmpValidateHiveSecurityDescriptors function for the root security descriptor in the hive,
  • CVE-2023-35356: referencing the _CM_KEY_NODE.ValueList.List field in a predefined key, in which the ValueList structure has completely different semantics, and its List field can be set to an arbitrary value.

Given that the correctness of cell indexes is a fairly obvious requirement known to Microsoft kernel developers, they pay close attention to verifying them thoroughly. For this reason, I think that the chance we will have many more such bugs in the future is slim. As for their exploitation, they may seem similar in nature to the way hive memory corruption can be exploited with out-of-bounds cell indexes, but in fact, these are two different scenarios. With hive-based memory corruption, we can dynamically change the value of a cell index multiple times as needed, and here, we would only have one specific 32-bit value at our disposal. If, in a hypothetical vulnerability, some interesting operations were performed on such a controlled index, I would probably still reduce the problem to the typical UAF case, try to obtain full binary control over the hive, and continue from there.

Low-level information disclosure (memory, pointers)

Since the registry code is written in C and operates with kernel privileges, and additionally has not yet been completely rewritten to use zeroing ExAllocatePool functions, it is natural that it may be vulnerable to memory disclosure issues when copying output data to user-mode. The most canonical example of such a bug was CVE-2023-38140, where the VrpPostEnumerateKey function (one of the sub-handlers of the VRegDriver registry callback) allocated a buffer on kernel pools with a user-controlled length, filled it with some amount of data – potentially less than the buffer size – and then copied the entire buffer back to user mode, including uninitialized bytes at the end of the allocation.

However, besides this typical memory disclosure scenario, it is worth noting two more things in the context of the registry. One of them is that, as we know, the registry operates not only on memory but also on various files on disk, and therefore the filesystem becomes another type of data sink where data leakage can also occur. And so, for example, in CVE-2022-35768, kernel pool memory could be disclosed directly to the hive file due to an out-of-bounds read vulnerability, and in CVE-2023-28271, both uninitialized data and various kernel-mode pointers were leaked to KTM transaction log files.

The second interesting observation is that the registry implementation does not have to be solely the source of the data leak, but can also be just a medium through which it happens. There is a certain group of keys and values that are readable by ordinary users and initialized with binary data by the kernel and drivers using ZwSetValueKey and similar functions. Therefore, there is a risk that some uninitialized data may leak through this channel, and indeed during my Bochspwn Reloaded research in 2018, I identified several instances of such leaks, such as CVE-2018-0898, CVE-2018-0899, and CVE-2018-0900.

Broken security guarantees, API contracts and common sense assumptions

Besides maintaining internal consistency and being free of low-level bugs, it's also important that the registry behaves logically and predictably, even under unusual conditions. It must adhere to the overall security model of Windows NT, operate in accordance with its public documentation, and behave in a way that aligns with common sense expectations. Failure to do so could result in various problems in the client software that interacts with it, but identifying such deviations from expected behavior can be challenging, as it requires deep understanding of the interface's high-level principles and the practical implications of violating them.

In the following subsections, I will discuss a few examples of issues where the registry's behavior was inconsistent with documentation, system architecture, or common sense.

Security access rights enforcement

The registry implementation must enforce security checks, meaning it must verify appropriate access rights to a key when opening it, and then again when performing specific operations on the obtained handle. Generally, the registry manages this well in most cases. However, there were two bugs in the past that allowed a local user to perform certain operations that they theoretically didn't have sufficient permissions for:

  • CVE-2023-21750: Due to a logic bug in the CmKeyBodyRemapToVirtual function (related to registry virtualization), it was possible to delete certain keys within the HKLM\Software hive with only KEY_READ and KEY_SET_VALUE rights, without the normally required DELETE right.
  • CVE-2023-36404: In this case, it was possible to gain access to the values of certain registry keys despite lacking appropriate rights. The attack itself was complex and required specific circumstances: loading a differencing hive overlaid on a system hive with a specially crafted key structure, and then having a system component create a secret key in that system hive. Because of the fact that the handle to the layered key would be opened earlier (and the security access check would be performed at that point in time), creating a new key at a lower level with more restricted permissions wouldn't be considered later, leading to potential information disclosure.

As shown, both these bugs were directly related to incorrect or missing permissions verification, but they weren't particularly attractive in terms of practical attacks. A much more appealing bug was CVE-2019-0881, discovered in registry virtualization a few years earlier by James Forshaw. That vulnerability allowed unprivileged users to read every registry value in the system regardless of the user's privileges, which is about as powerful as a registry infoleak can get.

Confused deputy problems with predefined keys

Predefined keys probably don't need any further introduction at this point in the series. In this specific case of the confused deputy problem, the bug report for CVE-2023-35633 captures the essence of the issue well: if a local attacker had binary control over a hive, they could cause the use of an API like RegOpenKeyExW on any key within that hive to return one of the predefined pseudo-handles like HKEY_LOCAL_MACHINE, HKEY_CURRENT_USER, etc., instead of a normal handle to that key. This behavior was undocumented and unexpected for developers using registry in their code. Unsurprisingly, finding a privileged process that did something interesting on a user-controlled hive wasn't that hard, and it turned out that there was indeed a service in Windows that opened a key inside the HKCU of each logged-in user, and recursively set permissive access rights on that key. By abusing predefined handles, it was possible to redirect the operation and grant ourselves full access to one of the global keys in the system, leading to a fairly straightforward privilege escalation. If you are interested in learning more about the bug and its practical exploitation, please refer to my Windows Registry Deja Vu: The Return of Confused Deputies presentation from CONFidence 2024. In many ways, this attack was a resurrection of a similar confused deputy problem, CVE-2010-0237, which I had discovered together with Gynvael Coldwind. The main difference was that at that time, the redirection of access to keys was achieved via symbolic links, a more obvious and widely known mechanism.

Atomicity of KTM transactions

The main feature of any transaction implementation is that it should guarantee atomicity – that is, either apply all changes being part of the transaction, or none of them. Imagine my surprise then, when I discovered that the registry transaction implementation integrated with the KTM did not guarantee atomicity at all, but merely tried really hard to maintain it. The main problem was that it wasn't designed to handle OOM errors (for example, when a hive was completely full) and, as a result, when such a problem occurred in the middle of committing a transaction, there was no good way to reverse the changes already applied. The Configuration Manager falsely returned a success code to the caller, while retrying to commit the remaining part of the transaction every 30 seconds, hoping that some space would free up in the registry in the meantime, and the operations would eventually succeed. This type of behavior obviously contradicted both the documentation and common sense about how transactions should work.

I reported this issue as CVE-2023-32019, and Microsoft fixed it by completely removing a large part of the code that implemented this functionality, as it was simply impossible to fix correctly without completely redesigning it from scratch. Fortunately, in Windows 10, an alternative transaction implementation for the registry called lightweight transactions was introduced, which was designed correctly and did not have the same problem. As a result, a decision was made to internally redirect the handling of KTM transactions within the Windows kernel to the same engine that is responsible for lightweight transactions.

Containerized registry escapes

The general goal of differencing hives and layered keys is to implement registry containerization. This mechanism creates an isolated registry view for a specific group of processes, without direct access to the host registry (a sort of "chroot" for the Windows registry). Unfortunately, there isn't much official documentation on this topic, and it's particularly difficult to find information on whether this type of containerization is a Microsoft-supported security boundary that warrants fixes in the monthly security bulletins. I think it is reasonable to expect that since the mechanism is used to isolate the registry in well supported use-cases (such as running Docker containers), it should ideally not be trivial to bypass, but I was unable to find any official statement to support or refute this assumption.

When I looked further into it, I discovered that the redirection of registry calls within containerized environments was managed by registry callbacks, specifically one called VrpRegistryCallback. While callbacks do indeed seem well suited for this purpose, the devil is in the details – specifically, error handling. I found at least two ways a containerized application could trigger an error during the execution of the internal VrpPreOpenOrCreate/VrpPostOpenOrCreate handlers. This resulted in exiting the callback prematurely while an important part of the redirection logic still hadn't been executed, and consequently led to the process gaining access to the host's registry view. Additionally, I found that another logical bug allowed access to the host's registry through differencing hives associated with other active containers in the system.

As I mentioned, I wasn't entirely clear on the state of Microsoft's support for this mechanism, but luckily I didn't have to wonder for too long. It turned out that James Forshaw had a similar dilemma and managed to reach an understanding with the vendor on the matter, which he described in his blog post.

After much back and forth with various people in MSRC a decision was made. If a container escape works from a non-administrator user, basically if you can access resources outside of the container, then it would be considered a privilege escalation and therefore serviceable.

[...]

Microsoft has not changed the MSRC servicing criteria at the time of writing. However, they will consider fixing any issue which on the surface seems to escape a Windows Server Container but doesn’t require administrator privileges. It will be classed as an elevation of privilege.


Eventually, I reported all three bugs in one report, and Microsoft fixed them shortly after as CVE-2023-36576. I particularly like the first issue described in the report (the bug in VrpBuildKeyPath), as it makes a very interesting example of how a theoretically low-level issue like a 16-bit integer overflow can have the high-level consequences of a container escape, without any memory corruption being involved.

Adherence to official key and value name length limits

The constraints on the length of key and value names are quite simple. Microsoft defines the maximum values on a dedicated documentation page called Registry Element Size Limits:

Registry element

Size limit

Key name

255 characters. The key name includes the absolute path of the key in the registry, always starting at a base key, for example, HKEY_LOCAL_MACHINE.

Value name

16,383 characters. Windows 2000: 260 ANSI characters or 16,383 Unicode characters.

Admittedly, the way this is worded is quite confusing, and I think it would be better if the information in the second column simply ended after the first period. As it stands, the explanation for "key name" seems to suggest that the 255-character limit applies to the entire key path relative to the top-level key. In reality, the limit of 255 (or to be precise, 256) characters applies to the individual name of each registry key, and value names are indeed limited to 16,383 characters. These assumptions are the basis for the entire registry code.

Despite these being fundamental and documented values, it might be surprising that the requirements weren't correctly verified in the hive loading code until October 2022. Specifically, it was possible to load a hive containing a key with a name of up to 1040 characters. Furthermore, the length of a value's name wasn't checked at all, meaning it could consist of up to 65535 characters, which is the maximum value of the uint16 type representing its length. In both cases, it was possible to exceed the theoretical limits set by the documentation by more than four times.

I reported these bugs as part of the CVE-2022-37991 report. On a default Windows installation, I found a way to potentially exploit (or at least trigger a reproducible crash) the missing check for the value name length, but I couldn't demonstrate the consequences of an overly long key name. Nevertheless, I'm convinced that with a bit more research, one could find an application or driver implementing a registry callback that assumes key names cannot be longer than 255 characters, leading to a buffer overflow or other memory corruption. This example clearly shows that even the official documentation cannot be trusted, and all assumptions, even the most fundamental ones, must be verified directly in the code during vulnerability research.

Creation of stable keys under volatile ones

Another rational behavior of the registry is that it doesn't allow you to create Stable keys under Volatile parent keys. This makes sense, as stable keys are stored on disk and persist through hive unload and system reboot, whereas volatile keys only exist in memory and vanish when the hive is unloaded. Consequently, a stable key under a volatile one wouldn't be practical, as its parent would disappear after a restart, severing its path to the registry tree root, causing the stable key to disappear as well. Therefore, under normal conditions, creating such a key is impossible, and any attempts to do so results in the  ERROR_CHILD_MUST_BE_VOLATILE error being returned to the caller. While there's no official mention of this in the documentation (except for a brief description of the error code), Raymond Chen addressed it on his blog, providing at least some documentation of this behavior.

During my research, I discovered two ways to bypass this requirement and create stable keys under volatile ones. These were issues CVE-2023-21748 and CVE-2024-26173, where the first one was related to registry virtualization, and the second to transaction support. Interestingly, in both of these cases, it was clear that a certain invariant in the registry design was being broken, but it was less clear whether this could have any real consequences for system security. After spending some time on analysis, I came to the conclusion that there was at least a theoretical chance of some security impact, due to the fact that security descriptors of volatile keys are not linked together into a global linked list in the same way stable security descriptors are. Long story short, if later in time some other stable keys in the hive started to share the security descriptor of the stable-under-volatile one, then their security would become invalidated and forcibly reset to their parent's descriptor on the next system reboot, violating the security model of the registry. Microsoft apparently shared my assessment of the situation, as they decided to fix both bugs as part of a security bulletin. Still, this is an interesting illustration of the complexity of the registry – sometimes finding an anomaly in the kernel logic can generate some kind of inconsistent state, but its implications might not be clear without further, detailed analysis.

Arbitrary key existence information leak

If someone were to ask me whether an unprivileged user should be able to check for the existence of a registry key without having any access rights to that key or its parent in a secure operating system, I would say absolutely not. However, this is possible on Windows, because the code responsible for opening keys first performs a full path lookup, and only then checks the access rights. This allows for differentiation between existing keys (return value STATUS_ACCESS_DENIED) and non-existing keys (return value STATUS_OBJECT_NAME_NOT_FOUND).

After discovering this behavior, I decided to report it to Microsoft in December 2023. The vendor's response was that it is indeed a bug, but its severity is not high enough to be fixed as an official vulnerability. I somewhat understand this interpretation, as the amount of information that can be disclosed in this way is quite low (i.e. limited configuration elements of other users), and fixing the issue would probably involve significant code refactoring and a potential performance decrease.  It's also difficult to say whether this type of boundary is properly defensible, because after one fix it might turn out that there are many other ways to leak this type of information. Therefore, the technique described in my report still works at the time of writing this blog post.

Miscellaneous

In addition to the bug classes mentioned above, there are also many other types of issues that can occur in the registry. I certainly won't be able to name them all, but briefly, here are a few more primitives that come to mind when I think about registry vulnerabilities:

  • Low-severity security bugs: These include local DoS issues such as NULL pointer dereferences, infinite loops, direct KeBugCheckEx calls, as well as classic memory leaks, low-quality out-of-bounds reads, and others. The details of a number of such bugs can be found in the p0tools/WinRegLowSeverityBugs repository on GitHub.
  • Real, but unexploitable bugs: These are bugs that are present in the code, but cannot be exploited due to some mitigating factors. Examples include bugs in the CmpComputeComponentHashes and HvCheckBin internal functions.
  • Memory management bugs: These bugs are specifically related to the management of hive section views in the context of the Registry process. This especially applies to situations where the hive is loaded from a file on a removable drive, from a remote SMB share, or from a file on a local disk but with unusual semantics (e.g., a placeholder file created through the Cloud Filter API). Two examples of this vulnerability type are CVE-2024-43452 and CVE-2024-49114.
  • Unusual primitives: These are various non standard primitives that are simply too difficult to categorize, such as CVE-2024-26177, CVE-2024-26178, WinRegLowSeverityBugs #19, or WinRegLowSeverityBugs #20.

Fuzzing considerations

Due to the Windows Registry's strictly defined format (regf) and interface (around a dozen specific syscalls that operate on it), automated testing in the form of fuzzing is certainly possible. We are dealing with kernel code here, so it's not as simple as taking any library that parses a file format and connecting it to a standard fuzzer like AFL++, Honggfuzz, or Jackalope – registry fuzzing requires a bit more work. But, in its simplest form, it could consist of just a few trivial steps: finding an existing regf file, writing a bit-flipping mutator, writing a short harness that loads the hive using RegLoadAppKey, and then running those two programs in an infinite loop and waiting for the system to crash.

It's hard to argue that this isn't some form of fuzzing, and in many cases, these kinds of methods are perfectly sufficient for finding plenty of serious vulnerabilities. After all, my entire months-long research project started with this fairly primitive fuzzing, which did more or less what I described above, with just a few additional improvements:

  • Fixing the hash in the regf header,
  • Performing a few simple operations on the hive, like enumerating subkeys and values,
  • Running on multiple machines at once,
  • Collecting code coverage information from the Windows kernel.

Despite my best efforts, this type of fuzzing was only able to find one vulnerability (CVE-2022-35768), compared to over 50 that I later discovered manually by analyzing the Windows kernel code myself. This ratio doesn't speak well for fuzzing, and it stems from the fact that the registry isn't as simple a target for automated testing as it might seem. On the contrary, each individual element of such fuzzing is quite difficult and requires a large time investment if one wishes to do it effectively. In the following sections, I'll focus on each of these components (corpus, mutator, harness and bug detection), pointing out what I think could be improved in them compared to the most basic version discussed above.

Initial corpus

The first issue a potential researcher may encounter is gathering an initial corpus of input files. Sure, one can typically find dozens of regf files even on a clean Windows installation, but the problem is that they are all very simple and don't exhibit characteristics interesting from a fuzzing perspective. In particular:

  • All of these hives are generated by the same registry implementation, which means that their state is limited to the set of states produced by Windows, and not the wider set of states accepted by the hive loader.
  • The data structures within them are practically never even close to the limits imposed by the format itself, for example:
  • The maximum length of key and value names are 256 and 16,383 characters, but most names in standard hives are shorter than 30 characters.
  • The maximum nesting depth of the tree is 512 levels, but in most hives, the nesting doesn't exceed 10 levels.
  • The maximum number of keys and values in a hive is limited only by the maximum space of 2 GiB, but standard hives usually include at most a few subkeys and associated values – certainly not the quantities that could trigger any real bugs in the code.

This means that gathering a good initial corpus of hives is very difficult, especially considering that there aren't many interesting regf hives available on the Internet, either. The other options are as follows: either simply accept the poor starting corpus and hope that these shortcomings will be made up for by a good mutator (see next section), especially if combined with coverage-based fuzzing, or try to generate a better one yourself by writing a generator based on one of the existing interfaces (the kernel registry implementation, the user-mode Offline Registry Library, or some other open-source library). As a last resort, you could also write your own regf file generator from scratch, where you would have full control over every aspect of the format and could introduce any variance at any level of abstraction. The last approach is certainly the most ambitious and time-consuming, but could potentially yield the best results.

Mutator

Overall, the issue with the mutator is very similar to the issue with the initial corpus. In both cases, the goal is to generate the most "interesting" regf files possible, according to some metric. However, in this case, we can no longer ignore the problem and hope for the best. If the mutator doesn't introduce any high-quality changes to the input file, nothing else will. There is no way around it – we have to figure out how to make our mutator test as much state of the registry implementation as possible.

For simplicity, let's assume the simplest possible mutator that randomly selects N bits in the input data and flips them, and/or selects some M bytes and replaces them with other random values. Let's consider for a moment what logical types of changes this approach can introduce to the hive structure:

  • Enable or disable some flags, e.g., in the _CM_KEY_NODE.Flags field,
  • Change the value of a field indicating the length of an array or list, e.g., _CM_KEY_NODE.NameLength, _CM_KEY_VALUE.DataLength, or a 32-bit field indicating the size of a given cell,
  • Slightly change the name of a key or value, or the data in the backing cell of a value,
  • Corrupt a value sanitized during hive loading, causing the object to be removed from the hive during the self-healing process,
  • Change the value of some cell index, usually to an incorrect value,
  • Change/corrupt the binary representation of a security descriptor in some way.

This may seem like a broad range of changes, but in fact, each of them is very local and uncoordinated with other modifications in the file. This can be compared to binary mutation of an XML file – sometimes we may corrupt/remove some critical tag or attribute, or even change some textually encoded number to another valid number – but in general, we should not expect any interesting structural changes to occur, such as changing the order of objects, adding/removing objects, duplicating objects, etc. Hives are very similar in nature. For example, it is possible to set the KEY_SYM_LINK flag in a key node by pure chance, but for this key to actually become a valid symlink, it is also necessary to remove all its current values, ​​and add a new value named "SymbolicLinkValue" of type REG_LINK containing a fully qualified registry path. With a mutator operating on single bits and bytes, the probability of this happening is effectively zero.

In my opinion, a dedicated regf mutator would need to operate simultaneously on four levels of abstraction, in order to be able to create the conditions necessary for triggering most bugs:

  1. On the high-level structure of a hive, where only logical objects matter: keys, values, security descriptors, and the relationships between them. Mutations could involve adding, removing, copying, moving, and changing the internal properties of these three main object types. These mutations should generally conform to the regf format, but sometimes push the boundaries by testing edge cases like handling long names, a large number of subkeys or values, or a deeply nested tree.
  2. On the level of specific cell types, which can represent the same information in many different ways. This primarily refers to all kinds of lists that connect higher-level objects, particularly subkey lists (index leaves, fast leaves, hash leaves, root indexes), value lists, and linked lists of security descriptors. Where permitted by the format (or sometimes even in violation of the format), the internal representation of these lists could be changed, and its elements could be rearranged or duplicated.
  3. On the level of cell and bin layout: taking the entire set of interconnected cells as input, they could be rearranged in different orders, in bins of different sizes, sometimes interspersed with empty (or artificially allocated) cells or bins. This could be used to find vulnerabilities specifically related to hive memory management, and also to potentially facilitate triggering/reproducing hive memory corruption issues more reliably.
  4. On the level of bits and bytes: although this technique is not very effective on its own, it can complement more intelligent mutations. You never know what additional problems can be revealed through completely random changes that may not have been anticipated when implementing the previous ideas. The only caveat is to be careful with the number of those bit flips, as too many of them could negate the overall improvement achieved through higher-level mutations.

As you can see, developing a good mutator requires some consideration of the hive at many levels, and would likely be a long and tedious process. The question also remains whether the time spent in this way would be worth it compared to the effects that can be achieved through manual code analysis. This is an open question, but as a fan of the registry, I would be thrilled to see an open-source project equivalent to fonttools for regf files, i.e., a library that allows "decompiling" hives into XML (or similar) and enables efficient operation on it. One can only dream... 🙂

Finally, I would like to point out that regf files are not the only type of input for which a dedicated mutator could be created. As I've already mentioned before, there are also accompanying .LOG1/.LOG2 and .blf/.regtrans-ms files, responsible for the atomicity of individual registry operations and KTM transactions, respectively. Both types of files may not be as complex as the core hive files, but mutating them might still be worthwhile, especially since some bugs have been historically found in their handling. Additionally, other registry operations performed by the harness could also be treated as part of the input. This would resemble an architecture similar to Syzkaller, and storing registry call sequences as part of the corpus would require writing a special grammar-based mutator, or possibly adapting an existing one.

Harness

While having a good mutator for registry-related files is a great start, the vast majority of potential vulnerabilities do not manifest when loading a malformed hive, but only during further operations on said hive. These bugs are mainly related to some complex and unexpected state that has arisen in the registry, and triggering it usually requires a very specific sequence of system calls. Therefore, a well-constructed harness should support a broad range of registry operations in order to effectively test as many different internal states as possible. In particular, it should:

  • Perform all standard operations on keys (opening, creating, deleting, renaming, enumerating, setting properties, querying properties, setting notifications), values (setting, deleting, enumerating, querying data) and security descriptors (querying keys for security descriptors, setting new descriptors). For the best result, it would be preferable to randomize the values of their arguments (to a reasonable extent), as well as the order in which the operations are performed.
  • Support a  "deferred close" mechanism, i.e. instead of closing key handles immediately, maintain a certain cache of such handles to refer to them at a later point in time. In particular, the idea is to sometimes perform an operation on a key that has been deleted, renamed or had its hive unloaded, in order to trigger potential bugs related to object lifetime or the verification that a given key actually exists prior to performing any action on it.
  • Load input hives with different flags. The main point here is to load hives with and without the REG_APP_HIVE flag, as the differences in the treatment of app hives and regular hives are sometimes significant enough to warrant testing both scenarios. Randomizing the states of the other few flags that can take arbitrary values could also yield positive results.
  • Support the registry virtualization mechanism, which can consist of several components:
  • Periodically enabling and disabling virtualization for the current process using the SetTokenInformation(TokenVirtualizationEnabled) call,
  • Setting various virtualization flags for individual keys using the NtSetInformationKey(KeySetVirtualizationInformation) call,
  • Creating an additional key structure under the HKU\<SID>_Classes\VirtualStore tree to exercise the mechanism of key replication / merging state in "query" type operations (e.g. in enumeration of the values of a virtualized key).
  • Use transactions, both KTM and lightweight. In particular, it would be useful to mix non-transactional calls with transactional ones, as well as transactional calls within different transactions. This way, we would be able to the code paths responsible for making sure that no two transactions collide with each other, and that non-transactional operations always roll back the entire transactional state before making any changes to the registry. It would also be beneficial if some of these transactions were committed and some rolled back, to test as much of their implementation as possible.
  • Support layered keys. For many registry operations, the layered key implementation is completely different than the standard one, and almost always more complicated. However, adding differencing hive support to the fuzzer wouldn't be trivial, as it would require additional communication with VRegDriver to load/unload the hive. It would also require making some fundamental decisions: which hive(s) do we overlay our input hive on top of? Should we keep pairs of hives in the corpus and overlay them one on top of the other, in order to control the properties of all the keys on the layered key stack? Do we limit ourselves to a key stack of two elements, or create more complicated stacks consisting of three or more hives? These are all open questions to which I don't know the answer, but I am sure that implementing some form of layered key support would positively affect the number of vulnerabilities that could be found this way.
  • Potentially support multi-threading and execute the harness logic in multiple threads at once, allowing it to trigger potential race conditions. The downside of this idea is that unless we run the fuzzing in some special environment, it would probably be non-deterministic, making timing-related bugs difficult to reproduce.

The final consideration for harness development is the prevalence of registry issues caused by improper error handling, particularly cell allocator out-of-memory errors. A potential harness feature could be to artificially trigger these circumstances, perhaps by aggressively filling almost all of the 2 GiB stable/volatile space, causing HvAllocateCell/HvReallocateCell functions to fail. However, this approach would waste significant disk space and memory, and substantially slow down fuzzing, so the net benefit is unclear. Alternative options include hooking the allocator functions to make them fail for a specific fraction of requests (e.g., using DTrace), or applying a runtime kernel modification to reduce the maximum hive space size from 2 GiB to some smaller value (e.g., 16 MiB). These ideas are purely theoretical and would require further testing.

Bug detection

Alongside a good initial corpus, mutator and harness, the fourth and final pillar of an effective fuzzing session is bug detection. After all, what good is it to generate an interesting sample and trigger a problem with a series of complicated calls, if we don't even notice the bug occurring? In typical user-mode fuzzing, bug detection is assisted by tools such as AddressSanitizer, which are integrated into the build process and add extra instrumentation to the binary to enable the detection of all invalid memory references taking place in the code. In the case of the Windows kernel, a similar role is played by the Special Pool, which isolates individual allocations on kernel pools to maximize the probability of a crash when an out-of-bounds access/use-after-free condition occurs. Additionally, it may also be beneficial to enable the Low Resources Simulation mechanism, which can cause some pool allocations to fail and thus potentially help in triggering bugs related to handling OOM conditions.

The challenge with the registry lies in the fact that most bugs don't stem from memory corruption within the kernel pools. Typically, we're dealing with either hive-based memory corruption or its early stage—an inconsistent state within the registry that violates a crucial invariant. Reaching memory corruption in such a scenario necessitates additional steps from an attacker. For instance, consider a situation where the reference count of a security descriptor is decremented without removing a reference to it in a key node. To trigger a system bugcheck, one would need to remove all other references to that security descriptor (e.g., by deleting keys), overwrite it with different data (e.g., by setting a value), and then perform an operation on it or one of its adjacent descriptors that would lead to a system crash. Each extra step significantly decreases the likelihood of achieving the desired state. The fact that cells have their own allocator further hinders fuzzing, as there's no equivalent of the Special Pool available for it.

Here are a few ideas for addressing the problem, some more realistic than others:

  • If we had a special library capable of breaking down regf files at various levels of abstraction, we could have the mutator create the input hive in a way that maximizes the chances of a crash if a bug occurs during a cell operation. For example, we could assign each key a separate security descriptor with refcount=1 (which should make triggering UAFs easier) and place each cell at the end of a separate bin, followed by another, empty bin. This behavior would be very similar to how the Special Pool works, but at the bin and cell level.
  • Again, if we had a good regf file parser, we could open the hive saved on disk after each iteration of the harness and verify its internal consistency. This would allow us to catch inconsistent hive states early, even if they didn't lead to memory corruption or a system crash in a specific case.
  • Possibly, instead of implementing the hive parsing and verification mechanism from scratch, one could try to reuse an existing implementation. In particular, an interesting idea would be to use the self-healing property of the registry. Thanks to this, after each iteration, we could theoretically load the hive once again for a short period of time, unload it, and then compare the "before" and "after" representations to see if the loader fixed any parts of the hive during the loading process. We could potentially also try to use the user-mode offreg.dll library for this purpose, which seems to share much of the hive loading code with the Windows kernel, and which would likely be more efficient to call.
  • As part of testing a given hive in a harness, we could periodically fill the entire hive (or at least all its existing bins) with random data to increase the probability of detecting UAFs by overwriting freed objects with incorrect data.

Finally, as an optional step, one could consider implementing checks at the harness level to identify logical issues in registry behavior. For example, after each individual operation, the harness could verify whether the process security token and handle access rights actually allowed it – thereby checking if the kernel correctly performed security access checks. Another idea would be to examine whether all operations within a transaction have been applied correctly during the commit phase. As we can see, there are many potential ideas, but when evaluating their potential usefulness, it is important to focus on the registry behaviors and API contracts that are most relevant to system security.

Conclusion

This concludes our exploration of the Windows Registry's role in system security and effective vulnerability discovery techniques. In the next post, we'll stay on the topic of security, but we'll shift our focus from discovering bugs to developing specific techniques for exploiting them. We'll use case studies of some experimental exploits I wrote during my research to demonstrate their practical security implications. See you then!

❌