(Collaborative post by Mateusz “j00ru” Jurczyk and Gynvael Coldwind)
It was six weeks ago when we first introduced our effort to locate and eliminate the so-called double fetch (e.g. time-of-check-to-time-of-use during user-land memory access) vulnerabilities in operating system kernels through CPU-level operating system instrumentation, a project code-named “Bochspwn” as a reference to the x86 emulator used (bochs: The Open Source IA-32 Emulation Project). In addition to discussing the instrumentation itself in both our SyScan 2013 presentation and the whitepaper we released shortly thereafter, we also went to some lengths trying to explain the different techniques which could be chained together in order to successfully and optimally exploit kernel race conditions, on the example of an extremely constrained win32k!SfnINOUTSTYLECHANGE (CVE-2013-1254) double fetch fixed by Microsoft in March 2013.
The talk has yielded a few technical discussions involving a lot of smart guys, getting us to reconsider several aspects of race condition exploitation on x86, and resulting in plenty of new ideas and improvements to the techniques we originally came up with. In particular, we would like to thank Halvar Flake (@halvarflake) and Solar Designer (@solardiz) for their extremely insightful thoughts on the subject. While we decided against releasing another 70 page long LaTeX paper to cover the new material, this blog post is to provide you with a thorough follow-up on efficiently winning memory access race conditions on IA-32 and AMD64 CPUs, including all lessons learned during the recent weeks.
Please note that the information contained in this post is mostly relevant to timing-bound kernel security issues with really small windows (in the order of several instructions) and vulnerabilities which require a significant amount of race wins to succeed, such as limited memory disclosure bugs. For exploitation of all other problems, the CPU subtleties discussed here are probably not quite useful, as they can be usually exploited without going into this level of detail.
All experimental data presented in this post has been obtained using a hardware platform equipped with an Intel Xeon W3690 @3.47GHz CPU (with Hyper-Threading disabled) and DDR3 RAM, unless stated otherwise.