More on scripts with non-english characters

In my last post I discussed an interesting find where I was able to execute scripts on a page by submitting double-byte characters in the input form.  The execution happens in part due to the fact that the resulting page is set to use ISO-8859-1 character encoding which truncates the high order byte of the double-byte character. 

As I have been asked numerous questions since, I thought I would mock up a simple application to demonstrate this phenomenon.  The sample application uses the same technology that the application in my previous post uses - Struts 1.3.8, DHTMLSuite, and JSPs.  You can download this sample application from Google Code hereSome of you may have to recompile the war as Eclipse optimized it to run on BEA WebLogic 9.2, but this should cause any significant heartburn.

When you deploy the .war file, simply access the demo by typing http://servername<:port>/International/execute/DisplaySetup in your favorite browser.  The top form on the page does a normal form submit.  The resulting page you see should not execute any scripts due to the data submission.  Once you have enough fun with that, go back to the form page and try out the bottom form which uses DHTMLSuite to submit the form and display the results to a div located on the same page.

If you would like a small list of sample strings you can use in this application to demonstrate to others, check out my colleagues' blog over at http://pentesterconfessions.blogspot.com/.

Hope this explains a lot.

Executing scripts with non-english characters

I have been working on a medium-sized development project lately and, came across a peculiar phenomenon where I could execute scripts on a page without the use of  less-than (<) or greater-than (>) symbols. Instead I used double-byte characters. For a little detail on the project, the technologies being used include Apache Struts 1.3.8, the Commons Validator plug-in, DHTMLSuite, and some other AJAX style controls to make the UI interactive. So now on to the findings.

To start out, the character encoding on all of the JSPs were set to ISO-8859-1.  In addition, validation was in place for all form fields although as it turns out, much to my chagrin, that clever people can bypass anything. Since the field in question was free form, a user could enter anything they wanted, as we developers had decided was ok since we would be vigilant in our use of output encoding. For more info on why this is should be ok, check out Jim Manico's blog article here. To accomplish our output encoding, we decided to use the <bean:write /> tag from the Struts 1.3.8 tag library as it has a fairly decent encoding practice, although it does leave a little - or quite a bit depending on your point of view - to be desired.

Now as I stated before, validation was being done on the field. The specific regex in use was ^[^<>]$. This very simple regex allows all characters that do not contain the less-than or greater-than characters. As the value submitted was being displayed in a div as text and was being "properly encoded" we felt safe. Then we started discussing the possibility of users submitting foreign characters. By foreign characters I mean characters such as ܼ or ܾ or even ܯ. That is when things got a little interesting. I should also state here that due to the architecture of the site, all form submissions took place via AJAX calls instead of normal form submissions. The reason that this is important to mention is that AJAX submissions do not format the data in the same manner that a normal form submission does. More on this in a minute.

ISO-8859-1 character-set encoding is an 8-bit, so single byte, character encoding specification. The foreign characters I just showed you are all double-byte characters. As it turns out, when you display a double-byte character on a page that is coded to display single-byte characters, the second byte of the character gets truncated. As such, the character ܼ  becomes < when displayed. The key phrase here is when displayed. It passed the validation because ܼ is not <.

Now you may be asking, what about the output encoding provided by <bean:write />? In this case it is no good. When one inspects the actual encoding implementation used by ResponseUtils, it only encodes the five major characters considered sensitive in HTML. These are <,  > , ", ', and &. So, for the same reasons that the string passes validation, it was not being properly encoded.

So what happened? We wrote a simple page that output the first 1,000 characters in hex format (&#1234;) then sampled them to see what would be displayed when the character was truncated due to the ISO-8859-1 setting in the JSP. The resulting atteck string:

ܼscriptܾalert(1);ܼܯscriptܾ

This string, when echoed to the page was displayed as <script>alert(1);</script>. Hello XSS!.

Now I promised more info on the form encoding. First off, we discovered that if we submitted the form via an HTML submit button (the normal way) that the attack did not take place. So we fired up Burp Suite and here is what was submitted:

%26%231852%3Bscript%26%231854%3Balert%281%29%3B%26%231852%3B%26%231839%3Bscript%26%231854%3B

As you can see this is the URL encoded form of the attack string listed above. This resulted in the display of harmless text to the resulting screen. On the other hand, when we submitted the form through AJAX the data sent was like this:

%DC%BCscript%DC%BEalert(1)%3B%DC%BC%DC%AFscript%DC%BE

It is obvious for one to see the difference in the encoding used. Why the encoding is different is also simple to explain. When one submits a form through an HTML submit button, the data in the form is encoded by the browser, which by default, uses application/x-www-form-urlencoded. AJAX simply does not do this. As a result, the server on the backend decodes the information differently thus returning different response data to the client for display. This different response, when used in conjunction with the display character encoding issues described earlier, results in potential XSS.

So how did we fix this, if you are wondering? First off we changed all of our JSPs to utilize UTF-8 as the character-set encoding so we were safe there. In addition, we made our regex significantly stronger.

I hope this helps!

When you have to display html from the user


Everyone knows, or should know by now, about XSS and the potentially harmful affects it can have on one's site. So, how do you protect yourself? Simple - output encoding. This is the only way to guarantee that a user's input will not be executed on your site. This doesn't give you an excuse not to do input validation, but for those cases where crafty people bypass your filter, output encoding can still save you. It really is about defense in depth. There are some good packages out there that will help you do this encoding too. Check out StringEscapeUtils from Apache Commons Lang (personal favorite), hDiv, and (second favorite) if you are a fan of the Struts framework. The key to the success of any of these options is diligence. Use it consistently, and use it everywhere you display any input generated, or potentially generated, from an end user.

Well, you may be thinking that this is all well and good, but what about sites that actually allow users to write HTML that will then be displayed to other users - like in some forums or blogs. Obviously these sites need to protect the other users from potentially harmful markup and JavaScript while still allowing users some customization of the output's display. In this case, full-fledged output encoding would render this feature useless. So, what is a developer to do? Enter the concept of sanitizing output.

Sanitizing output is very similar to whitelist input validation in that both attempt to ensure that the data you get from the user is in a format expected, and that the data is "safe". I place the word safe in quotes here, because what is considered safe in one scenario may not be true in another. In short, there is no one regex, method, or what-have-you that will save everyone. However, there is a methodology that one can apply that will handle this scenario very well.

Step 1: Explicitly define the set of allowed tags.
Step 2: For each tag defined above, explicitly define the set of allowable attributes.
Step 3: Define a set of regexes to test the input from the user against the defined tags and attributes.
Step 4: Remove anything that does not pass the regex test. (This is the sanitization part)
Step 5: Be diligent. (Just like always)

For a good example of this technique in use, look at some of the code refactorings at Refactor My Code. In this case, Jeff Atwood is trying to allow some markup to be displayed, but at the same time protect his site. As one can see from the code, C# in this case, he follows all the steps listed above to do proper output sanitization. I cannot say how diligent he was because I do not have his site code, but you get the idea.

In any case, no matter what the language, this technique can be used reliably in the event that some markup is to be displayed from the end user. Hope this helps!