Fortunately, websites are not completey out of luck. Although code samples such as
if (inputString.matches("\\w*")) { //do something } and Pattern strings = "^[a-zA-Z0-9]*$"; ... abound they really only apply to US-ASCII character sets. So, how does one validate against a larger character set, or better yet create validation routines that are character set agnostic? Enter POSIX validation.Fortunately for many, the Java Virtual Machine has supported POSIX style input validation for some time. Here is a code snippet that you can use that is functionally equivalent to "^[a-zA-Z]*$".
String posixAlphaCharacters = "\\p{Alpha}*";
if (inputString.matches(posixAlphaCharacters)) {
// do something cool
}
else {
// bomb horribly
}The problem with the above code is that it does not really give us anything. Many words that are spelled in their native way, such as München or España, will still fail this validation. So, we need to slightly tweak our code to broaden the character set. Take a look below.String posixLatinCharacters = "[\\p{InBasic Latin}\\p{InLatin-1 Supplement}]*";
if (inputString.matches(posixLatinCharacters)) {
// do something cool
}
else {
// bomb a little less horribly
}For the complete set of possibilities, review the Unicode Regular Expressions Definitions here.I hope this helps you in your development. I know it helped me when I found out about it.
4 comments:
You may wish to consider using the ESAPI from OWASP.
http://www.owasp.org/index.php/ESAPI
Doing input validation also isn't your only bet here. For a good/funny read on problems with just input validation see:
http://krvw.com/pipermail/sc-l/2008/001440.html
I know about the ESAPI, but you would still need to implement the provided interfaces to get truly internationalized input validation. If you only use the reference implementation you will only be able to validate US-ASCII characters.
Brilliant. It's hard enough to get developers to accept that they'll have to white-list - but inevitably the question arises "but what if my site is international input/output?" Thanks for the refresher - it's always better to have an answer handy!
Does the .Net side of the house have an equivalent?
Thanks Rafal. I am not familiar with .NET, but I am sure they would. For output you could just set the charset. Output in multiple languages is harder, but also look into internationalization using properties files as this i the most common approach.
Post a Comment