Whitelist input validation, where it becomes hairy

Everyone in the security field knows that the best way to protect your site from cross-site scripting (XSS) vulnerabilities is to implement proper whitelist input validation for all input fields, including hidden fields. This is not an easy task by any means, and requires significant planning on the part of the developers and architects to really determine what characters are acceptable for a given input. Some sets are easier than others, but I digress. One problem that I forsee, and sometimes have to address is when internationalized sites are not only internationalizing the display of text to the user, but are also required to accept various text encodings from the user. In this case, the input is also internationalized and your english whitelist just went out the window, at least for non-english countries.

Fortunately, websites are not completey out of luck. Although code samples such as if (inputString.matches("\\w*")) { //do something } and Pattern strings = "^[a-zA-Z0-9]*$"; ... abound they really only apply to US-ASCII character sets. So, how does one validate against a larger character set, or better yet create validation routines that are character set agnostic? Enter POSIX validation.

Fortunately for many, the Java Virtual Machine has supported POSIX style input validation for some time. Here is a code snippet that you can use that is functionally equivalent to "^[a-zA-Z]*$".
String posixAlphaCharacters = "\\p{Alpha}*";
if (inputString.matches(posixAlphaCharacters)) {
// do something cool
}
else {
// bomb horribly
}
The problem with the above code is that it does not really give us anything. Many words that are spelled in their native way, such as München or España, will still fail this validation. So, we need to slightly tweak our code to broaden the character set. Take a look below.
String posixLatinCharacters = "[\\p{InBasic Latin}\\p{InLatin-1 Supplement}]*";
if (inputString.matches(posixLatinCharacters)) {
// do something cool
}
else {
// bomb a little less horribly
}
For the complete set of possibilities, review the Unicode Regular Expressions Definitions here.

I hope this helps you in your development. I know it helped me when I found out about it.

4 comments:

Security Retentive said...

You may wish to consider using the ESAPI from OWASP.

http://www.owasp.org/index.php/ESAPI

Doing input validation also isn't your only bet here. For a good/funny read on problems with just input validation see:

http://krvw.com/pipermail/sc-l/2008/001440.html

Matt Presson said...

I know about the ESAPI, but you would still need to implement the provided interfaces to get truly internationalized input validation. If you only use the reference implementation you will only be able to validate US-ASCII characters.

Rafal said...

Brilliant. It's hard enough to get developers to accept that they'll have to white-list - but inevitably the question arises "but what if my site is international input/output?" Thanks for the refresher - it's always better to have an answer handy!

Does the .Net side of the house have an equivalent?

Matt Presson said...

Thanks Rafal. I am not familiar with .NET, but I am sure they would. For output you could just set the charset. Output in multiple languages is harder, but also look into internationalization using properties files as this i the most common approach.