‘LEXSS’ injection: How to bypass lexical parsers by abusing HTML parsing logic

Researcher digs deeper into technique that uncovered flaws in popular WYSIWYG HTML text editors

Image: PortSwigger Lt

A security researcher has penned a deep dive on bypassing lexical parsers with special HTML tags that leverage HTML parsing logic to ultimately execute arbitrary JavaScript code.

Chris Davis, security consultant at Bishop Fox, has previously deployed the hacking technique to unearth high risk cross-site scripting (XSS) vulnerabilities in two popular What-You-See-Is-What-You-Get (WYSIWYG) HTML text editors.

The flaws in TinyMCE (disclosed in August 2020) and Froala (disclosed earlier this month) affected a combined 700,000 websites that incorporated the applications.

What is lexical parsing?

“Lexical parsing is a very sophisticated way of preventing XSS because it evaluates whether the data is instructions or plaintext before performing additional logic such as blocking or encoding the data,” says Davis in his technical write-up.

It separates “user data (i.e., non-dangerous textual content) from computer instructions (i.e., JavaScript and certain dangerous HTML tags)”, he continues. “In instances where the user is allowed a subset of HTML by design, this type of parsing can be used to determine what is allowed content and what will be blocked or sanitized.”

As well as WYSIWYG HTML editors, lexical sanitizing parsers are widely used to protect rich-text editors, email clients, and sanitization libraries such as DOMPurify from XSS attacks.

However, Davis demonstrates how lexical parsers can be tricked into viewing dangerous content “as text data and not computer instructions”.

This is possible because “HTML is not designed to be parsed twice; slight variations in parsing can occur between the initial HTML parser and the sanitizing parser; and sanitizing parsers often implement their own processing logic”.

Context states and namespace confusion

Key to the research are context states: data state categories into which HTML elements are sorted by the HTML parser during tokenization. “Different supplied elements alter how data in those elements is parsed and rendered by switching the context state of the data,” said Davis.

The researcher’s ‘LEXSS’ technique also exploits namespace confusion, an area of research impressively furthered by Michał Bentkowski’s DOMPurify bypass in 2020. “HTML parser will context switch to separate namespaces when it encounters MathML or SVG elements, which can be used to confuse the parser,” said Davis.

Conceptualizing XSS risk

The potential impact of XSS attacks varies by context.

“In many cases the risk will be nominal and in others catastrophic,” Chris Davis tells The Daily Swig. In the most severe cases, XSS could be exploited “to do things like transfer of funds, execution of financial securities trades or exfiltration of top secret data”.

“One way to conceptualize the risk of XSS is to consider when you’re at any website, what could an attacker do if they controlled your actions? As XSS allows that level of control within a site’s origin, generally unbeknownst to the user.”

Prevention

As for preventative steps, “when implementing applications that allow some user-controlled HTML by design”, developers should “process the HTML as close to the original parse as possible”, explains Davis.

“For organizations that are not creating these types of solutions but rather including them in their applications, a good patch policy will go a long way in preventing exploitation.”

Organizations should also “consider implementing a content security policy (CSP) into the application” to “block JavaScript injection at a browser-defined level”.

Future research

Asked why he pursued this research avenue, Davis tells The Daily Swig: “This type of context state parsing based analysis is so widespread yet relatively uncovered.

“So getting a better understanding of how HTML in general is parsed and how rich-text style editors or sanitization libraries then parse that data and how we can exploit that knowledge was, to me, fascinating.”

He adds that he expects similar flaws to surface in “some really impactful targets” such as email clients, and that digging further into HTML parsing could also be fruitful.

“I really hope this work aids other researchers in taking it to the next level,” he concludes.

‘LEXSS’ injection: How to bypass lexical parsers by abusing HTML parsing logic

What is lexical parsing?

Context states and namespace confusion

Conceptualizing XSS risk

Prevention

Future research

We’re going teetotal – It’s goodbye to The Daily Swig

Bug Bounty Radar

Indian gov flaws allowed creation of counterfeit driving licenses