Concepts and Practices
Before analysing specific attacks and how to protect against them, it is necessary to have a foundation on some basic principles of Web application security. These principles are not difficult to grasp, but they require a particular mindset about data; simply put, a security-conscious mindset assumes that all data received in input is tainted and this data must be filtered before use and escaped when leaving the application. Understanding and practising these concepts is essential to ensure the security of your applications.
All Input Is Tainted
Perhaps the most important concept in any transaction is that of trust. Do you trust the data being processed? Can you? This answer is easy if you know the origin of the data. In short, if the data originates from a foreign source such as user form input, the query string, or even an RSS feed, it cannot be trusted. It is tainted data.
Data from these sources—and many others—is tainted because it is not certain whether it contains characters that might be executed in the wrong context. For example, a query string value might contain data that was manipulated by a user to contain Javascript that, when echoed to a Web browser, could have harmful consequences.
As a general rule of thumb, the data in all of PHP’s superglobals arrays should be considered tainted. This is because either all or some of the data provided in the superglobal arrays comes from an external source. Even the $_SERVER array is not fully safe, because it contains some data provided by the client. The one exception to this rule is the $_SESSION superglobal array, which is persisted on the server and
never over the Internet.
Before processing tainted data, it is important to filter it. Once the data is filtered, then it is considered safe to use. There are two approaches to filtering data: the whitelist approach and the blacklist approach.
Whitelist vs. Blacklist Filtering
Two common approaches to filtering input are whitelist and blacklist filtering. The blacklist approach is the less restrictive form of filtering that assumes the programmer knows everything that should not be allowed to pass through. For example, some forums filter profanity using a blacklist approach. That is, there is a specific
set of words that are considered inappropriate for that forum; these words are filtered out. However, any word that is not in that list is allowed. Thus, it is necessary to add new words to the list from time to time, as moderators see fit. This example may not directly correlate to specific problems faced by programmers attempting to mitigate attacks, but there is an inherent problem in blacklist filtering that is evident here: blacklists must be modified continually, and expanded as new attack vectors become apparent.
On the other hand, whitelist filtering is much more restrictive, yet it affords the programmer the ability to accept only the input he expects to receive. Instead of identifying data that is unacceptable, a whitelist identifies only the data that is acceptable. This is information you already have when developing an application; it may change in the future, but you maintain control over the parameters that change
and are not left to the whims of would-be attackers. Since you control the data that you accept, attackers are unable to pass any data other than what your whitelist allows. For this reason, whitelists afford stronger protection against attacks than blacklists.
Filter Input
Since all input is tainted and cannot be trusted, it is necessary to filter your input to ensure that input received is input expected. To do this, use a whitelist approach, as described earlier. As an example, consider the following HTML form:
<form method="POST">
Username: <input type="text" name="username" /><br />
Password: <input type="text" name="password" /><br />
Favourite colour:
<select name="colour">
<option>Red</option>
<option>Blue</option>
<option>Yellow</option>
<option>Green</option>
</select><br />
<input type="submit" />
</form>
This form contains three input elements: username, password, and colour. For this example, username should contain only alphabetic characters, password should contain only alphanumeric characters, and colour should contain any of “Red,” “Blue,” “Yellow,” or “Green.” It is possible to implement some client-side validation code using JavaScript to enforce these rules, but, as described later in the section onspoofed forms, it is not always possible to force users to use only your form and, thus, your client-side rules. Therefore, server-side filtering is important for security, while client-side validation is important for usability.
To filter the input received with this form, start by initializing a blank array. It is important to use a name that sets this array apart as containing only filtered data; this example uses the name $clean. Later in your code, when encountering the variable $clean[’username’], you can be certain that this value has been filtered. If, however, you see $_POST[’username’] used, you cannot be certain that the data is trustworthy. Thus, discard the variable and use the one from the $clean array instead. The following code example shows one way to filter the input for this form:
$clean = array();
if (ctype_alpha($_POST[’username’]))
{
$clean[’username’] = $_POST[’username’];
}
if (ctype_alnum($_POST[’password’]))
{
$clean[’password’] = $_POST[’password’];
}
$colours = array(’Red’, ’Blue’, ’Yellow’, ’Green’);
if (in_array($_POST[’colour’], $colours))
{
$clean[’colour’] = $_POST[’colour’];
}
Filtering with a whitelist approach places the control firmly in your hands and ensures that your application will not receive bad data. If, for example, someone tries to pass a username or colour that is not allowed to the processing script, the worst than can happen is that the $clean array will not contain a value for username or colour. If username is required, then simply display an error message to the user and ask them
to provide correct data. You should force the user to provide correct information rather than trying to clean and sanitize it on your own. If you attempt to sanitize the data, you may end up with bad data, and you’ll run into the same problems that result with the use of blacklists.
Escape Output
Output is anything that leaves your application, bound for a client. The client, in this case, is anything from aWeb browser to a database server, and just as you should filter all incoming data, you should escape all outbound data. Whereas filtering input protects your application from bad or harmful data, escaping output protects the client and user from potentially damaging commands.
Escaping output should not be regarded as part of the filtering process, however. These two steps, while equally important, serve distinct and different purposes. Filtering ensures the validity of data coming into the application; escaping protects you and your users from potentially harmful attacks. Output must be escaped because clients—Web browsers, database servers, and so on—often take action when encountering special characters. For Web browsers, these special characters form
HTML tags; for database servers, they may include quotation marks and SQL keywords. Therefore, it is necessary to know the intended destination of output and to escape accordingly. Escaping output intended for a database will not suffice when sending that same output to a Web browser—data must be escaped according to its destination. Since most PHP applications deal primarily with the Web and databases, this section will focus on escaping output for these mediums, but you should always be aware of the destination of your output and any special characters or commands that destination may accept and act upon—and be ready escape those characters or commands accordingly.
To escape output intended for a Web browser, PHP provides htmlspecialchars() and htmlentities(), the latter being the most exhaustive and, therefore, recommended function for escaping. The following code example illustrates the use of htmlentities() to prepare output before sending it to the browser. Another concept illustrated is the use of an array specifically designed to store output. If you prepare output by escaping it and storing it to a specific array, you can then use the latter’s contents without having to worry about whether the output has been escaped. If you encounter a variable in your script that is being outputted and is not part of this array, then it should be regarded suspiciously. This practice will help make your code easier to read and maintain. For this example, assume that the value for $user_message comes from a database result set.
$html = array();
$html[’message’] = htmlentities($user_message, ENT_QUOTES, ’UTF-8’);
echo $html[’message’];
Escape output intended for a database server, such as in an SQL statement, with the database-driver-specific *_escape_string() function; when possible, use prepared statements. Since PHP 5.1 includes PHP Data Objects (PDO), you may use prepared statements for all database engines for which there is a PDO driver. If the database engine does not natively support prepared statements, then PDO emulates this feature
transparently for you.
The use of prepared statements allows you to specify placeholders in an SQL statement. This statement can then be used multiple times throughout an application, substituting new values for the placeholders, each time. The database engine (or PDO, if emulating prepared statements) performs the hard work of actually escaping the values for use in the statement. The Database Programming chapter contains more information on prepared statements, but the following code provides a simple example for binding parameters to a prepared statement.
// First, filter the input
$clean = array();
if (ctype_alpha($_POST[’username’]))
{
$clean[’username’] = $_POST[’username’];
}
// Set a named placeholder in the SQL statement for username
$sql = ’SELECT * FROM users WHERE username = :username’;
// Assume the database handler exists; prepare the statement
$stmt = $dbh->prepare($sql);
// Bind a value to the parameter
$stmt->bindParam(’:username’, $clean[’username’]);
// Execute and fetch results
$stmt->execute();
$results = $stmt->fetchAll();
Register Globals
When set to On, the register_globals configuration directive automatically injects variables into scripts. That is, all variables from the query string, posted forms, session store, cookies, and so on are available in what appear to be locally-named variables. Thus, if variables are not initialized before use, it is possible for a malicious user to set script variables and compromise an application.
Consider the following code used in an environment where register_globals is set to On. The $loggedin variable is not initialized, so a user for whom checkLogin() would fail can easily set $loggedin by passing loggedin=1 through the query string. In this way, anyone can gain access to a restricted portion of the site. To mitigate this risk, simply set $loggedin = FALSE at the top of the script or turn off register_globals, which is the preferred approach. While setting register_globals to Off is the preferred approached, it is a best practice to always initialize variables.
if (checkLogin())
{
$loggedin = TRUE;
}
if ($loggedin)
{
// do stuff only for logged in users
}
Note that a by-product of having register_globals turned on is that it is impossible to determine the origin of input. In the previous example, a user could set $loggedin from the query string, a posted form, or a cookie. Nothing restricts the scope in which the user can set it, and nothing identifies the scope from which it comes. A best practice for maintainable and manageable code is to use the appropriate superglobal array for the location from which you expect the data to originate—$_GET, $_POST, or $_COOKIE. This accomplishes two things: first of all, you will know the origin of the data; in addition, users are forced to play by your rules when sending data to your application.
Before PHP 4.2.0, the register_globals configuration directive was set to On by default. Since then, this directive has been set to Off by default; as of PHP 6, it will no longer exist.
Read more ...
Before analysing specific attacks and how to protect against them, it is necessary to have a foundation on some basic principles of Web application security. These principles are not difficult to grasp, but they require a particular mindset about data; simply put, a security-conscious mindset assumes that all data received in input is tainted and this data must be filtered before use and escaped when leaving the application. Understanding and practising these concepts is essential to ensure the security of your applications.
All Input Is Tainted
Perhaps the most important concept in any transaction is that of trust. Do you trust the data being processed? Can you? This answer is easy if you know the origin of the data. In short, if the data originates from a foreign source such as user form input, the query string, or even an RSS feed, it cannot be trusted. It is tainted data.
Data from these sources—and many others—is tainted because it is not certain whether it contains characters that might be executed in the wrong context. For example, a query string value might contain data that was manipulated by a user to contain Javascript that, when echoed to a Web browser, could have harmful consequences.
As a general rule of thumb, the data in all of PHP’s superglobals arrays should be considered tainted. This is because either all or some of the data provided in the superglobal arrays comes from an external source. Even the $_SERVER array is not fully safe, because it contains some data provided by the client. The one exception to this rule is the $_SESSION superglobal array, which is persisted on the server and
never over the Internet.
Before processing tainted data, it is important to filter it. Once the data is filtered, then it is considered safe to use. There are two approaches to filtering data: the whitelist approach and the blacklist approach.
Whitelist vs. Blacklist Filtering
Two common approaches to filtering input are whitelist and blacklist filtering. The blacklist approach is the less restrictive form of filtering that assumes the programmer knows everything that should not be allowed to pass through. For example, some forums filter profanity using a blacklist approach. That is, there is a specific
set of words that are considered inappropriate for that forum; these words are filtered out. However, any word that is not in that list is allowed. Thus, it is necessary to add new words to the list from time to time, as moderators see fit. This example may not directly correlate to specific problems faced by programmers attempting to mitigate attacks, but there is an inherent problem in blacklist filtering that is evident here: blacklists must be modified continually, and expanded as new attack vectors become apparent.
On the other hand, whitelist filtering is much more restrictive, yet it affords the programmer the ability to accept only the input he expects to receive. Instead of identifying data that is unacceptable, a whitelist identifies only the data that is acceptable. This is information you already have when developing an application; it may change in the future, but you maintain control over the parameters that change
and are not left to the whims of would-be attackers. Since you control the data that you accept, attackers are unable to pass any data other than what your whitelist allows. For this reason, whitelists afford stronger protection against attacks than blacklists.
Filter Input
Since all input is tainted and cannot be trusted, it is necessary to filter your input to ensure that input received is input expected. To do this, use a whitelist approach, as described earlier. As an example, consider the following HTML form:
<form method="POST">
Username: <input type="text" name="username" /><br />
Password: <input type="text" name="password" /><br />
Favourite colour:
<select name="colour">
<option>Red</option>
<option>Blue</option>
<option>Yellow</option>
<option>Green</option>
</select><br />
<input type="submit" />
</form>
This form contains three input elements: username, password, and colour. For this example, username should contain only alphabetic characters, password should contain only alphanumeric characters, and colour should contain any of “Red,” “Blue,” “Yellow,” or “Green.” It is possible to implement some client-side validation code using JavaScript to enforce these rules, but, as described later in the section onspoofed forms, it is not always possible to force users to use only your form and, thus, your client-side rules. Therefore, server-side filtering is important for security, while client-side validation is important for usability.
To filter the input received with this form, start by initializing a blank array. It is important to use a name that sets this array apart as containing only filtered data; this example uses the name $clean. Later in your code, when encountering the variable $clean[’username’], you can be certain that this value has been filtered. If, however, you see $_POST[’username’] used, you cannot be certain that the data is trustworthy. Thus, discard the variable and use the one from the $clean array instead. The following code example shows one way to filter the input for this form:
$clean = array();
if (ctype_alpha($_POST[’username’]))
{
$clean[’username’] = $_POST[’username’];
}
if (ctype_alnum($_POST[’password’]))
{
$clean[’password’] = $_POST[’password’];
}
$colours = array(’Red’, ’Blue’, ’Yellow’, ’Green’);
if (in_array($_POST[’colour’], $colours))
{
$clean[’colour’] = $_POST[’colour’];
}
Filtering with a whitelist approach places the control firmly in your hands and ensures that your application will not receive bad data. If, for example, someone tries to pass a username or colour that is not allowed to the processing script, the worst than can happen is that the $clean array will not contain a value for username or colour. If username is required, then simply display an error message to the user and ask them
to provide correct data. You should force the user to provide correct information rather than trying to clean and sanitize it on your own. If you attempt to sanitize the data, you may end up with bad data, and you’ll run into the same problems that result with the use of blacklists.
Escape Output
Output is anything that leaves your application, bound for a client. The client, in this case, is anything from aWeb browser to a database server, and just as you should filter all incoming data, you should escape all outbound data. Whereas filtering input protects your application from bad or harmful data, escaping output protects the client and user from potentially damaging commands.
Escaping output should not be regarded as part of the filtering process, however. These two steps, while equally important, serve distinct and different purposes. Filtering ensures the validity of data coming into the application; escaping protects you and your users from potentially harmful attacks. Output must be escaped because clients—Web browsers, database servers, and so on—often take action when encountering special characters. For Web browsers, these special characters form
HTML tags; for database servers, they may include quotation marks and SQL keywords. Therefore, it is necessary to know the intended destination of output and to escape accordingly. Escaping output intended for a database will not suffice when sending that same output to a Web browser—data must be escaped according to its destination. Since most PHP applications deal primarily with the Web and databases, this section will focus on escaping output for these mediums, but you should always be aware of the destination of your output and any special characters or commands that destination may accept and act upon—and be ready escape those characters or commands accordingly.
To escape output intended for a Web browser, PHP provides htmlspecialchars() and htmlentities(), the latter being the most exhaustive and, therefore, recommended function for escaping. The following code example illustrates the use of htmlentities() to prepare output before sending it to the browser. Another concept illustrated is the use of an array specifically designed to store output. If you prepare output by escaping it and storing it to a specific array, you can then use the latter’s contents without having to worry about whether the output has been escaped. If you encounter a variable in your script that is being outputted and is not part of this array, then it should be regarded suspiciously. This practice will help make your code easier to read and maintain. For this example, assume that the value for $user_message comes from a database result set.
$html = array();
$html[’message’] = htmlentities($user_message, ENT_QUOTES, ’UTF-8’);
echo $html[’message’];
Escape output intended for a database server, such as in an SQL statement, with the database-driver-specific *_escape_string() function; when possible, use prepared statements. Since PHP 5.1 includes PHP Data Objects (PDO), you may use prepared statements for all database engines for which there is a PDO driver. If the database engine does not natively support prepared statements, then PDO emulates this feature
transparently for you.
The use of prepared statements allows you to specify placeholders in an SQL statement. This statement can then be used multiple times throughout an application, substituting new values for the placeholders, each time. The database engine (or PDO, if emulating prepared statements) performs the hard work of actually escaping the values for use in the statement. The Database Programming chapter contains more information on prepared statements, but the following code provides a simple example for binding parameters to a prepared statement.
// First, filter the input
$clean = array();
if (ctype_alpha($_POST[’username’]))
{
$clean[’username’] = $_POST[’username’];
}
// Set a named placeholder in the SQL statement for username
$sql = ’SELECT * FROM users WHERE username = :username’;
// Assume the database handler exists; prepare the statement
$stmt = $dbh->prepare($sql);
// Bind a value to the parameter
$stmt->bindParam(’:username’, $clean[’username’]);
// Execute and fetch results
$stmt->execute();
$results = $stmt->fetchAll();
Register Globals
When set to On, the register_globals configuration directive automatically injects variables into scripts. That is, all variables from the query string, posted forms, session store, cookies, and so on are available in what appear to be locally-named variables. Thus, if variables are not initialized before use, it is possible for a malicious user to set script variables and compromise an application.
Consider the following code used in an environment where register_globals is set to On. The $loggedin variable is not initialized, so a user for whom checkLogin() would fail can easily set $loggedin by passing loggedin=1 through the query string. In this way, anyone can gain access to a restricted portion of the site. To mitigate this risk, simply set $loggedin = FALSE at the top of the script or turn off register_globals, which is the preferred approach. While setting register_globals to Off is the preferred approached, it is a best practice to always initialize variables.
if (checkLogin())
{
$loggedin = TRUE;
}
if ($loggedin)
{
// do stuff only for logged in users
}
Note that a by-product of having register_globals turned on is that it is impossible to determine the origin of input. In the previous example, a user could set $loggedin from the query string, a posted form, or a cookie. Nothing restricts the scope in which the user can set it, and nothing identifies the scope from which it comes. A best practice for maintainable and manageable code is to use the appropriate superglobal array for the location from which you expect the data to originate—$_GET, $_POST, or $_COOKIE. This accomplishes two things: first of all, you will know the origin of the data; in addition, users are forced to play by your rules when sending data to your application.
Before PHP 4.2.0, the register_globals configuration directive was set to On by default. Since then, this directive has been set to Off by default; as of PHP 6, it will no longer exist.