Everyone knows, or should know by now, about XSS and the potentially harmful affects it can have on one's site. So, how do you protect yourself? Simple - output encoding. This is the only way to guarantee that a user's input will not be executed on your site. This doesn't give you an excuse not to do input validation, but for those cases where crafty people bypass your filter, output encoding can still save you. It really is about defense in depth. There are some good packages out there that will help you do this encoding too. Check out StringEscapeUtils from Apache Commons Lang (personal favorite), hDiv, and
Well, you may be thinking that this is all well and good, but what about sites that actually allow users to write HTML that will then be displayed to other users - like in some forums or blogs. Obviously these sites need to protect the other users from potentially harmful markup and JavaScript while still allowing users some customization of the output's display. In this case, full-fledged output encoding would render this feature useless. So, what is a developer to do? Enter the concept of sanitizing output.
Sanitizing output is very similar to whitelist input validation in that both attempt to ensure that the data you get from the user is in a format expected, and that the data is "safe". I place the word safe in quotes here, because what is considered safe in one scenario may not be true in another. In short, there is no one regex, method, or what-have-you that will save everyone. However, there is a methodology that one can apply that will handle this scenario very well.
Step 1: Explicitly define the set of allowed tags.
Step 2: For each tag defined above, explicitly define the set of allowable attributes.
Step 3: Define a set of regexes to test the input from the user against the defined tags and attributes.
Step 4: Remove anything that does not pass the regex test. (This is the sanitization part)
Step 5: Be diligent. (Just like always)
For a good example of this technique in use, look at some of the code refactorings at Refactor My Code. In this case, Jeff Atwood is trying to allow some markup to be displayed, but at the same time protect his site. As one can see from the code, C# in this case, he follows all the steps listed above to do proper output sanitization. I cannot say how diligent he was because I do not have his site code, but you get the idea.
In any case, no matter what the language, this technique can be used reliably in the event that some markup is to be displayed from the end user. Hope this helps!
3 comments:
This is where AntiSamy really shines. Define in your policy what HTML you want to accept, and the rest is blocked.
Check out OWASP's ESAPI project for input validation (with canonicalization that nothing else does) and output encoding for lots of different output contexts.
Thanks, both of you for the responses.
@marcin
I was aware of AnitSamy, but somehow forgot to mention it (getting old I guess).
@Jeff
I have looked over ESAPI, but how would it exactly work in this case if it encodes the output. the point of this scenario is not to encode the output, but still remain safe. Are there AnitSamy-like features in ESAPI now?
Post a Comment