Accessing The W3C HTML Validator API Using Streams


Are you generating HTML using PHP? This will allow you to automatically validate it before it goes live.

The Basics Of The API

The W3C HTML validator provides an API that allows a variety of access mechanisms. The most useful two are validating a file by URL and validating a chunk of HTML source. The access mechanism is using a POST request and the details are contained in a document at http://validator.w3.org/docs/api.html.

Using Streams Instead Of Other Methods

A variety of methods exist within PHP for producing a POST request eg using CURL. One of the simplest however is to use streams that allow simple specification of parameters and use of standard file access mechanisms such as file_get_contents to initiate the request. A variety of wrappers are available eg FTP, HTTP and in this case HTTP is used.

The POST data is specified as an array of options:

PHP

$exampleContent = "<!DOCTYPE html> ..string of HTML source..";
$postDataRequest = [
  'http' => [
    'method' => 'POST',
    'header' => "Content-type: application/x-www-form-urlencoded",
    'user_agent' => "Mozilla/5.0",
    'content' => http_build_query([
       'fragment' => $exampleContent,
       'output' => 'soap12'
                ])
        ]];
 

The 'method' is set as 'POST' (note this must be upper case) and the header set as a form input. A user string must be supplied otherwise the request is rejected by the W3C validator - this can be almost anything except blank but it is probably easiest to just supply the user agent string of the calling program eg derived from $_SERVER['HTTP_USER_AGENT'].

Additional options are necessary to specify how the input HTML source is to be provided and how the validation output is to be sent. Using the keyword 'fragment' allows a complete HTML source text to be provided as part of the POST request, as in the example here. The other most useful method is 'uri' which allows a web URL pointing to an HTML file to be provided.

The POST data array is turned into a stream context using stream_context_create, and this is sent to the validator URL using file_get_contents.

The output can either be the standard output provided by the validator or can be an XML file containing all necessary information - specifying 'output' as 'soap12' gives the XML.

If you aren't interested in detail of the errors and warnings, the response header also includes the number of errors and the number of warnings which can be accessed via the system populated array $http_response_header.

PHP

$validationUrl = "http://validator.w3.org/check";
$context = stream_context_create($postDataRequest);
$validatorResponse = file_get_contents(
                        $validationUrl,
                        false,
                        $context);
/*
 * A summary is returned in the response header
 */

$errorsString = $http_response_header[5];
$warningString = $http_response_header[6];
/*
 * XML output can also be obtained
 */

$xmlResults = simplexml_load_string($validatorResponse);
/* and process away*/
 

Pitfalls To Be Careful Of

1. a user agent string is necessary or a 403-Forbidden response will be returned

2. if you are processing lots of files, you need to pause for 1 sec between each one, otherwise W3C will get annoyed and possibly bar you from further use

3. The XML output has a warning total one more than the total number of warnings detailed (the extra not shown is the warning about HTML5 validation being experimental)

4. The XML response is quite (overly) structured, and includes namespaces making processing more complicated than it could be. Note that html5.validator.nu offers the same HTML5 validator as W3C, with a similar API (http://about.validator.nu/#api) and a much simpler XML output!