Escaping the sandbox: A bug that speaks for itself

2023-11-15 01:23:56

Introduction

On this weblog submit, we’ll share the story about how we found a essential stack corruption bug that has existed in Home windows for greater than 20 years (CVE-2023-36719). The bug was present in a core Home windows OS library which is utilized by numerous software program merchandise however is most notably reachable from throughout the sandbox of all Chromium-based browsers, exposing a sandbox escape which is reachable from the net through JavaScript.

We’ll stroll by the method on how we selected the goal and the strategy that led us to find this vulnerability, in addition to some ideas on how this might result in browser and full system compromise.

Consistent with our aim to make Microsoft Edge essentially the most safe browser for Home windows, we determined to focus on a platform-specific function within the Net Speech API, which provides speech synthesis performance to assist person expertise and accessibility and finally ends up utilizing the interior Microsoft Speech API.

The Net Speech API

The Web Speech API is outlined within the JavaScript normal. It’s a set of APIs which permit net builders to make use of speech synthesis and recognition options of their net functions.

For example, think about the next JavaScript code snippet:






1
2
3



let textual content = "Hi there World!";

speechSynthesis.communicate(new SpeechSynthesisUtterance(textual content));

The speak method interacts with the platform-specific Textual content-To-Speech (TTS) service, and the audio will probably be performed on the system.

Chromium Multi-Course of Structure

We’ll briefly cowl the sandboxing structure of Chromium-based browsers. Readers who’re already acquainted with this will skip to the following part,

As you could bear in mind, Chromium-based browsers (and most widespread fashionable browsers) use a multi-process structure. All of the content material of a website, together with untrusted JavaScript code, is parsed and executed inside a separate course of with no entry to the file system, community or many different system assets. Generically, we refer to those as “sandbox” processes and there are a number of differing types, on this case we’re referring particularly to the Renderer Course of.

Diagram displaying Chromium multi-process structure. Reference

Following the precept of least privilege, most of the greater belief duties and functionalities are managed by the extra privileged Browser Course of. It has the permission to provoke many sorts of actions, together with (however not restricted to) accessing the filesystem, the system’s digicam, geolocation information, and extra.

Each time that the Renderer Course of needs to carry out a privileged motion, it sends a request to the Browser Course of over an IPC (Inter-Course of Communication) channel.

The IPC performance is applied by Mojo, a group of runtime libraries built-in into Chromium, that gives a platform-agnostic abstraction of widespread primitives, akin to message passing, shared reminiscence and platform handles.

The inner particulars of Mojo aren’t related for this weblog submit and aren’t essential to comply with the remainder of the evaluation. For extra info, please consult with the official documentation.

Crossing the Boundary

On this case, the JavaScript name initiated by the net web page will lead to an IPC name from the renderer course of to the SpeechSynthesis Mojo Interface, which is applied by the browser course of: speech_synthesis.mojom – Chromium Code Search

Successfully, calling the communicate JavaScript API crosses a privilege boundary, and any bug that we set off within the browser code might enable an attacker to achieve code execution exterior of the context of the sandbox.

Chromium doesn’t have a customized TTS engine, as such the SpeechSynthesis interface is only a wrapper across the platform-specific service.

Within the case of Home windows, the “communicate” methodology will name into the ISpVoice COM interface, as may be seen right here: tts_win.cc – Chromium Code Search






1
2
3
4



Microsoft::WRL::ComPtr<ISpVoice> speech_synthesizer_;

HRESULT end result = speech_synthesizer_->Communicate(merged_utterance.c_str(),
                                                SPF_ASYNC, &stream_number_);

The (untrusted) information that’s handed as an argument to the JavaScript name is forwarded as-is* to the browser course of and used as argument for ISpVoice::Communicate name.

* Really, it goes by some conversion course of, from string to broad string, and a few extra characters are prepended to it, but it surely doesn’t actually matter for the sake of this evaluation.

The ISpVoice COM interface

Home windows provides a framework, referred to as COM (Element Object Mannequin), which permits totally different software program parts to speak and interoperate with one another, no matter their programming language.

With out getting an excessive amount of into particulars, each time an occasion of ISpVoice object is used, like this:






1



speech_synthesizer_->Communicate(...);

the COM system framework (combase.dll) finds the trail of the library which implements such interface and takes care of routing the decision to the right perform.

In case of ISpVoice interface, the library which implements it resides in C:WindowsSystem32SpeechCommonsapi.dll.

This library will probably be dynamically loaded throughout the Browser Course of reminiscence (for the readers who’re acquainted with COM: it is because the ISpVoice service is an in-process service).

The ISpVoice interface implements the Home windows Speech API (SAPI) and may be very properly documented right here ISpVoice (SAPI 5.3) | Microsoft Learn

From the documentation we shortly study that TTS engines are complicated machines and never solely assist easy textual content however assist a particular markup language based mostly on XML that can be utilized to supply extra correct audio, specifying pitch, pronunciation, talking charge, quantity, and extra. Such language is known as SSML (Speech Synthesis Markup Language) and it’s a regular for Textual content-to-Speech engines. Microsoft additionally extends this normal by providing an extra grammar, referred to as SAPI XML grammar (XML TTS Tutorial (SAPI 5.3) | Microsoft Learn).

For instance, one might use SAPI grammar like this:






1
2



speechSynthesis.communicate(new SpeechSynthesisUtterance(`<quantity degree="50">
This textual content needs to be spoken at quantity degree fifty.</quantity>`);

The XML string handed as an argument finally ends up being parsed by the native library sapi.dll, which makes this an ideal goal for vulnerability analysis.

This breaks the Chromium mission Rule of 2 (not directly utilizing sapi.dll) as a result of the OS library code is written in C++, it parses untrusted inputs, and it runs unsandboxed throughout the Browser course of.

As emphasised earlier than, a bug on this code may result in a possible browser sandbox escape exploit. Furthermore, being reachable straight from JavaScript, this may be a kind of uncommon circumstances wherein a sandbox escape might be achieved with out compromising the Renderer course of.

Fuzzing

Now that we’ve got our goal, we needed to start trying to find exploitable bugs. For our first strategy, we determined to go for black box fuzzing to get one thing working as shortly as doable and get suggestions on the best way to iterate later.

We discovered that, total, your best option for our case was Jackalope, a fuzzer developed by Google Mission Zero which helps coverage-guided black field fuzzing with out a lot overhead and comes with a grammar engine which is very helpful to fuzz our SSML/SAPI parser.

Since we didn’t have any expertise with the COM framework and the best way to work together with the ISpVoice service, we requested Bing Chat to put in writing a harness for us, it sped up our improvement and labored flawlessly.

Immediate: Create a easy C++ program utilizing Home windows SAPI API to talk an enter string

Then, we modeled SAPI grammar and ran our harness utilizing Jackalope as the primary engine.

This strategy shortly resulted in our first bug, which was present in lower than a day of fuzzing (!)

Nevertheless, we seen that the protection measured by Jackalope (Offsets) reached a plateau in a short time in just some days as you possibly can see from the comparability screenshot, and consequently corpus dimension additionally stopped rising.

Comparability of Jackalope protection after 3 days.

That is most likely as a result of mutating the present corpus couldn’t yield new code paths with out hitting the identical bug, so we didn’t get any extra attention-grabbing check circumstances after the primary crash.

In parallel, we determined to strive one other strategy: generation-based fuzzing with Dharma (shout out to our colleague Christoph, who created it!).

Dharma can be a robust resolution, it is extraordinarily fast and simple to arrange and it’s not protection guided. As such, it could generate enormous check circumstances that extensively check all of the grammar options proper from the beginning. With this strategy, we discovered once more the identical bug inside an evening of fuzzing.

The Bug

The bug in query is reachable by the perform ISpVoice::Speak (SAPI 5.3), which takes as enter an XML string that may assist each tags particular to Microsoft SAPI format, and SSML (Speech Synthesis Markup Language) which is the usual for speech synthesis utilized by many different engines.

Trying on the supply code of that perform, we quickly realized that the XML enter goes by a 2-step processing:

First, the XML string is parsed utilizing the usual MSXML Home windows library (msxml6.dll), searching for SSML tags.
If step one fails for no matter purpose (e.g. damaged XML string, or unrecognized SSML tags), the code falls again to SAPI parsing, which is applied utilizing a customized XML parser (written in C++)

Customized parsing of untrusted information is troublesome and sometimes ends in exploitable safety bugs. As you most likely guessed, the bug we discovered resides within the latter a part of the code.

When parsing a tag, the next struct is crammed:






1
2
3
4
5
6
7
8
9



#outline MAX_ATTRS 10
struct XMLTAG
{
    XMLTAGID  eTag;
    XMLATTRIB Attrs[MAX_ATTRS];
    int       NumAttrs;
    bool      fIsStartTag;
    bool      fIsGlobal;
};

Chances are you’ll discover that the utmost variety of attributes for every tag is about to 10. It’s because in response to the specification, there is no such thing as a tag that requires that many.

Maybe unsurprisingly, the bug occurs when greater than 10 attributes are utilized in a single XML tag, as no bounds checks are carried out when including them to the Attrs array, resulting in a buffer overflow.

A easy PoC that would crash your browser course of straight from JavaScript, on the susceptible model of the library, is that this:






1
2



let textual content = `<PARTOFSP PART="modifier" PART="modifier" PART="modifier" PART="modifier" PART="modifier" PART="modifier" PART="modifier" PART="modifier" PART="modifier" PART="modifier" PART="modifier" PART="modifier" ></PARTOFSP>`;
speechSynthesis.communicate(new SpeechSynthesisUtterance(textual content));

The Overflow

Attempting to breed the bug on a susceptible model of sapi.dll ends in a crash as a consequence of invalid rip worth and utterly corrupted stack hint. This instantly factors to a doable stack buffer overflow.

WinDBG catching the browser crash.

The very first thing we seen was that the crash didn’t set off the corruption of a stack canary. This discovery intrigued us, however we wanted to collect extra info earlier than digging into this oddity.

In an effort to get to the basis reason behind the bug we needed to make use of Time Journey Debugging to simply step backwards to the precise level the place the stack received corrupted, we discovered one thing just like this:

Notice: we’re omitting the unique code and utilizing some equal snippets






1
2
3
4
5
6
7



HRESULT CSpVoice::ParseXML(…)
{
    ...
    XMLTAG Tag; // XMLTAG struct on the stack
    ... 
    ParseTag(input_str, &Tag, ...); // Name to the susceptible perform 
}

Within the ParseTag perform there may be the loop that may parse each attribute and fill within the &Tag->Attrs[] array and ultimately write previous its finish:






1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22



HRESULT ParseTag(wchar* pStr,  XMLTAG* pTag, …) {
    // some code … //
    whereas (hr == S_OK && *pStr) {
        // Fill attribute title from the enter string
        hr = FindAttr(pStr,
            &pTag->Attrs[pTag->NumAttrs].eAttr,
            &pStr);
        
        if( hr == S_OK )
        {
            // Fill attribute worth
            hr = FindAttrVal(pStr,
                &pTag->Attrs[pTag->NumAttrs].Worth.pStr,
                &pTag->Attrs[pTag->NumAttrs].Worth.Len, 
                &pStr);
            if( hr == S_OK )
            {
                ++pTag->NumAttrs;
            }
        }
    }
}

Stack cookies? No thanks

At this level it’s clear that we had a stack buffer overflow, and we had a couple of questions:

Why didn’t the method crash with a corrupted stack cookie?
We’ve management of RIP; is it exploitable? What different primitives does it give us?

Relating to the primary query, we needed to know why we hadn’t triggered a stack canary exception regardless of clearly overflowing and corrupting the decision stack. We dug out the supply code from the OS repository and started to analyze. From the construct scripts it was clear that the library was compiled with /GS (Stack Canaries). In disbelief, we opened the binary in IDA professional and noticed the canary test within the related perform.

It was clear that one thing fishy was happening and that extra investigation was required.

Let’s begin by recalling the constructions concerned right here:



  See Also
  
    
  
  
    		
			blinkingrobots		
		    Proof of the intentional use of black henbane (Hyoscyamus niger) within the Roman Netherlands | Antiquity
  





1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20



struct XMLTAG
{
    XMLTAGID  eTag;
    XMLATTRIB Attrs[MAX_ATTRS];
    int       NumAttrs;
    bool      fIsStartTag;
    bool      fIsGlobal;
};

struct XMLATTRIB
{
    XMLATTRID eAttr; // enum that identifies the attribute title
    SPLSTR     Worth; 
};

struct SPLSTR
{
    WCHAR*  pStr; // wstring pointing to the attribute worth
    int     Len; // size of the attribute worth
};

Our overflow permits us to put in writing further XMLATTRIB constructions previous the tip of the XMLTAG.Attrs array. Serializing the members of this construction means we are able to successfully write content material in chunks of 24 bytes as proven:






1
2
3



XMLATTRID eAttr; // 4 bytes enum (padded to eight)
WCHAR* pStr; // 8 bytes pointer that factors to our managed string
int len; // 4 bytes managed size (padded to eight)

What’s the very first thing we overwrite previous the Attrs array?

It’s the NumAttrs subject of XMLTAG construction. We are able to overlap it with any worth from the doable XMLATTRID values. These are chosen from the attribute title, and it may be any managed worth from 0 to 23.

Since NumAttrs is utilized by the susceptible perform to index the array, we are able to choose this quantity to trigger a non-linear overflow and leap previous the present body’s stack cookie when overwriting the stack.

That is precisely what occurred (unintentionally) with our first PoC: the eleventh attribute PART will get parsed into an XMLATTR construction with eAttr worth greater than 10. That worth overwrites NumAttrs after which the following attribute (twelfth attribute) truly overwrites straight the saved rip with out touching the stack cookies.

We tried to change the PoC by crafting a doable rip worth and run on a machine with Intel CET enabled and it resulted in a crash associated to shadow stack exception.

Actually, on most up-to-date units, overwriting rip wouldn’t be a possible exploitation technique due to shadow stack (Understanding Hardware-enforced Stack Protection), which double checks the return handle saved on the stack earlier than leaping to it.

But it surely’s value mentioning that many of the present Desktop customers aren’t protected by this mitigation, because it was launched solely just lately.

To summarize: stack canaries are ineffective, CET safety isn’t obtainable on the typical buyer machine. How straightforward does it get for attackers?

Since this bug permits us to deprave virtually something we wish on the stack, even at deeper stack frames, we might search for extra doable targets to overwrite, aside from saved rip, and there are certainly some object pointers within the stack that we might overwrite, however we have to hold some limitations in thoughts.

There are two doable methods right here:

We are able to both overwrite a pointer with full qword (the pStr worth, that factors to our managed broad string) and in that case we’d be capable of change an object with virtually arbitrary content material.

Nevertheless, crafting information, and particularly pointers, is not going to be very possible, since we are able to’t embody any 2-nullbytes sequence (x00x00) as a result of that could be a null wchar and the xml parsing will cease early.

This makes it very exhausting (inconceivable?) to craft legitimate pointers throughout the managed content material. Furthermore, other than null bytes, there are different invalid byte sequences that won’t be accepted as legitimate broad strings and would break the parsing early.

Alternatively, we might overwrite a worth within the stack that’s aligned with our managed len subject. On this case, since len is of kind int, solely the least vital 4 bytes will probably be overwritten, leaving essentially the most vital half untouched.

Even right here, to craft a sound pointer, we would want to have an enormous size, which doesn’t work in observe.

As a result of time constraints of the mission, we didn’t discover exploitation potentialities additional, however we consider that ranging from a corrupted renderer, exploitation could be rather more possible given these primitives.

Hold fuzzing the floor

After the preliminary fast success and given the need to repair the primary bug to maintain fuzzing additional, we determined to spend a while determining the best way to re-compile our goal library sapi.dll with ASAN (Address SANitizer) and code protection to check extra successfully.

Having ASAN permits us to catch potential heap errors that may in any other case have been missed by PageHeap sanitizer (that we used throughout our first iteration). Plus, it offers detailed info for stack corruptions as properly right-away, dramatically lowering debugging time in case we discovered extra of these.

As talked about earlier, the Home windows Speech API is known as from Chromium through the COM interface ISpVoice.

Our goal DLL shouldn’t be straight linked to the harness, however it’s loaded at runtime by the COM framework, based mostly on the trail specified within the system registry for that COM CLSID.

To make our fuzzer less complicated to deploy on different infrastructures akin to OneFuzz, we determined to make use of Detours to hook LoadLibraryEx and change the DLL path loaded from the system registry with the trail of our instrumented DLL as an alternative of modifying the registry.

Having the library constructed with ASAN and SANCOV allowed us to make use of another fuzzer, akin to Libfuzzer which is the usual de-facto (though deprecated).

Subsequently, our second iteration consisted of rewriting and increasing the grammar for the protection guided whitebox fuzzer. This was executed utilizing LibProtobuf-Mutator.

These days, the fuzzer retains working on our steady fuzzing infrastructure utilizing OneFuzz and retains serving to our engineers in fixing bugs on this floor.

Conclusion

On this weblog submit, we’ve got proven how we found and reported a essential bug on Home windows that was reachable from the browser utilizing the Net Speech API. We’ve defined how we used the SAPI XML grammar to craft malicious inputs that would set off reminiscence corruption within the sapi.dll library, which runs unsandboxed within the browser course of.

Bugs just like the one mentioned listed here are significantly uncommon since, when exploited, result in full compromise of the browser course of with out the necessity of further bugs (usually you would want a series of a minimum of two bugs to achieve management over the browser course of).

Furthermore, this bug would have been reachable from any Chromium based mostly browser (Edge, Chrome, Courageous Browser, and so forth…) no matter their model, which makes this case much more attention-grabbing.

One might marvel: have been different browsers affected as properly?
Usually, sure. Any browser that implements the speechSynthesis JavaScript interface and makes use of the Home windows Speech API needs to be affected.

Nevertheless, some browsers took barely totally different approaches which may spare them from this bug. Firefox, for example, does use sapi.dll, but it surely strips away XML tags from the enter, at the price of dropping all of the options that the SSML normal gives.

The necessity for contemporary browsers to work together with a number of OS parts with a purpose to supply wealthy functionalities (like speech synthesis, speech recognition, and many others.) makes issues tougher for safety, since third-party libraries are loaded within the browser course of and sometimes take inputs from untrusted sources.

On this case, we restricted our evaluation to the case of Home windows platform and to the Speech API, however there is likely to be different circumstances the place a comparable state of affairs happens.

Sooner or later, we plan to analyze extra COM parts utilized by browsers and carry out audits to exterior libraries which are loaded within the Browser Course of and will expose safety dangers.

Source Link