Getting Output Encoding Right

As you may know, one of the best ways to protect your application from cross-site scripting is to use proper output encoding when displaying user editable data on the screen.  Here, I attempt to show that although the above statement sounds simple enough, it can actually be a very daunting task.

When it comes to output encoding in a web application, there are two main contexts that the developer needs to be aware of.  These are the HTML context and the JavaScript context.  These contexts are special in that the encoding scheme used for one does not apply to the other.  As such, care should be taken to ensure that the proper scheme is applied correctly on every page.

So, for the sake of argument, let us say that we are doing an assessment of a Java-based web application and discover that the developer has written code similar to the following:

<%@ page language="java" contentType="text/html; charset=UTF-8"
pageEncoding="UTF-8"
%>


<!DOCTYPE html PUBLIC "-//W3C//DTD HT
ML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<h
tml>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8">
<title>Test Encode</title>
<script>
var param = <%= request.getParameter("js") %>;
alert(param);
</script
>
</
head
>
<
body
>
There is JavaScript on this page.
</body
>
</
html>



Obviously, this example code is vulnerable to cross-scripting.  So, in our report we point this out, and afterwards the developer comes back with the following change to the code:



...
<script>
var param = <%= StringEscapeUtils.escapeJavaScript(

request.getParameter("js")) %>;
alert(param);
</script
>
...


Upon inspection, it appears as though the developer has done the right thing and found a widely distributed library that will do the output encoding for him.  This is usually always better than trying to write a library/function yourself unless you REALLY understand the subject matter.  In this case, the develop chose the Apache Commons Lang - StringEscapeUtils class.  This appears to be a good choice as it provides methods for both of the contexts mentioned above, namely escapeHtml and escapeJavaScript.  There is only one problem, the escapeJavaScript method has a very poor implementation.  Why?  Well, the code comments say it all in this case:



/**
* ...
* <p>The only difference between Java strings and JavaScript
* strings
is that in JavaScript, a single quote must be
* escaped.</p>

* ...
*/



What?!  Maybe if you are parsing, but in no way is that statement true when it comes to encoding.  According to the JavaScript reference, encoding in a JavaScript context should escape all characters with an ASCII value less than 256 in the format of \xHH where HH is the hex representation of the ASCII value of the character.  For example, the null character, ASCII value 0, becomes \x00, the new line character, ASCII value 10, becomes \x10, etc.  So where does StringEscapeUtils go wrong?  Lets look at the code.



The definition of the escapeJavaScript function is defined as follows:



public static String JavaDoc escapeJava(String str) {
return escapeJavaStyleString(str, false);
}



After more looking, we come to the function in question:



private static void escapeJavaStyleString(Writer out, 
String str, boolean escapeSingleQuote)
throws IOException {

if (out == null) {
throw new IllegalArgumentException("The Writer must not be
null"
);
}
if (str == null) {
return;
}

int sz;
sz = str.length();
for (int i = 0; i < sz; i++) {
char ch = str.charAt(i);

// handle unicode
if (ch > 0xfff) { out.write("\\u" + hex(ch)); }
else if (ch > 0xff) { out.write("\\u0" + hex(ch)); }
else if (ch > 0x7f) { out.write("\\u00" + hex(ch));}
else if (ch < 32) {
switch (ch) {
case '\b': out.write('\\'); out.write('b');
break;
case '
\n': out.write('\\'); out.write('n');
break;
case '
\t': out.write('\\'); out.write('t');
break;
case '
\f': out.write('\\'); out.write('f');
break;
case '
\r': out.write('\\'); out.write('r');
break;
default :
if (ch > 0xf) { out.write("\\u00" + hex(ch)); }
else { out.write("\\u000" + hex(ch)); }
break;
}
}
else {
switch (ch) {
case '
\'': if (escapeSingleQuote) {
out.write('\\');
}
out.write('
\'');
break;
case '"': out.write('\\'); out.write('"');
break;
case '
\\': out.write('\\'); out.write('\\');
break;
default : out.write(ch);
break;
}
}
}
}


If you look closely at this code, you will notice that it fails to follow the specification for JavaScript encoding.  This code is only worried about the ' (single quote), " (double quote), and / (slash) as well as a few of the control characters.  The problem is they fail to consider characters that have special meaning in JavaScript such as ; (semi-colon), () (parenthesis), etc.



So, what does this mean?  A call to the application with the js parameter set to 1; alert(String.fromCharCode(88,83,83)) results in the following code being generated on the client:



...

<script> 
  var param = 1
; alert(String.fromCharCode(88,83,83));

  alert(param);


</script>

...



Now, there are two alerts happening instead of one.  Change this attack from alert(something) to eval(something really bad) and you still face the same problem.



So what is the solution?  Don't use StringEscapeUtils.escapeJavaScript() and hope to be safe from XSS in the JavaScript context.  Use something more robust (and implemented correctly) like the OWASP Reform project or the OWASP ESAPI.

6 comments:

Beto said...

Hi! Thanks for the post!

I'm new to Ajax security, and I was browsing the web searching for reform implementation examples, or something like that.

Please write something about how to use reform, or add a few links where people can find more (trusted) information.

In the Owasp page, and the google code page says nothing. I want a few examples in order to understand the basics and then do my way.

(I'm following the comments in this post)

Thanks for your help :)

Matt Presson said...

Beto,
I will try and get you a post on using Reform in the next few days.

Thanks for reading.

Beto said...

Thanks a lot for your help Matt.

Christof Meerwald said...

To me it looks like you are misunderstanding the intended purpose of the escapeJavaScript function and complaing that it's not doing what you want it to do.

Note that it's part of the StringEscapeUtils class which should tell you that it's meant to be used for strings. And inside strings, a semicolon, etc doesn't have any special meaning.

So the intended use would be as:

var param = '<%= StringEscapeUtils.escapeJavaScript(
request.getParameter("js")) %>';
alert(param);

And if you use it that way it's absolutely fine - it's just not meant to be used to sanitise (or whatever you would want it to do) random text to be used as Javascript code.

Matt Presson said...

Christof,
I have to disagree with your assessment of the intended usage of the escapeJavaScript function.

The purpose of an encoding function is to translate characters that would otherwise be interpreted as special/control characters, for that context, into something that will not be interpreted as a special/control character. As a semicolon is seen as a control character by the JavaScript interpreter, it should be escaped by the called function. By your argument, nothing should be escaped because inside a String any character is valid. It really doesn't have anything to do with Strings per se.

I will agree that placing single quotes around a variable that will be used as a string would protect you, but this is only because the single quote itself is escaped by a backslash. If this were not the case, even your implementation would be vulnerable.

My point is that not all variables taken from a request will be used as Strings. For instance, what function would you use if you wanted to treat the parameter as a number?

var param = '<%= StringEscapeUtils.escapeJavaScript(
request.getParameter("js")) %>';
alert(param + 1);

Your implementation would result in concatenation instead of addition. I know you could add code to cast it to a number, but why if you don't have to. Just encode the characters that are treated as special by the interpreter. It just so happens that in this case some were missed.

Christof Meerwald said...

To me, there is a reason why the class is called StringEscapeUtils - most likely it's because it's supposed to be used with strings (and characters like single quotes need to be escaped in string literals, but semicolons don't need to be escaped).

If you don't want to work with strings, then a StringEscapeUtils class clearly isn't the most appropriate class to use.

If the intended purpose of a class/method is not obvious, then I suggest filing a bug report to either get the documentation clarified or the implementation fixed.