php-src/ext/tidy
John Coggeshall 3ab2935250 Tons of changes for Tidy 2.0 -- output buffering, multiple documents,
dual nature ($a->parse_file() and $a = tidy_parse_file()), etc.
2003-12-14 06:02:33 +00:00
..
examples Fixed a --enable-maintainer-zts glitch where TSRMLS_DC was being used 2003-09-25 09:32:55 +00:00
tests Updated test cases and examples and cleaned up the new OO code so it will 2003-09-22 18:40:38 +00:00
config.m4 Adding the tidy extension to PECL 2003-08-01 00:22:43 +00:00
CREDITS Tidy Hijacked! Read all about it ;) 2003-09-20 19:45:32 +00:00
libtidy.txt Added a note & fix for a memory leak in libtidy. 2003-09-21 20:21:39 +00:00
package.xml Tons of changes for Tidy 2.0 -- output buffering, multiple documents, 2003-12-14 06:02:33 +00:00
php_tidy.h Tons of changes for Tidy 2.0 -- output buffering, multiple documents, 2003-12-14 06:02:33 +00:00
README Updated test cases and examples and cleaned up the new OO code so it will 2003-09-22 18:40:38 +00:00
tidy.c Tons of changes for Tidy 2.0 -- output buffering, multiple documents, 2003-12-14 06:02:33 +00:00
tidy.dsp It builds much better when linked to the right lib 2003-08-08 12:15:17 +00:00
TODO Tons of changes for Tidy 2.0 -- output buffering, multiple documents, 2003-12-14 06:02:33 +00:00

README FOR ext/tidy by John Coggeshall <john@php.net>

Tidy Version: 0.7b

Tidy is an extension based on Libtidy (http://tidy.sf.net/) and allows a PHP developer
to clean, repair, and traverse HTML, XHTML, and XML documents -- including ones with
embedded scripting languages such as PHP or ASP within them using OO constructs.

---------------------------------------------------------------------------------------
!! Important Note !!
---------------------------------------------------------------------------------------
At this time libtidy has a small memory leak inside the ParseConfigFileEnc() function
used to load configuration from a file. If you intend to use this functionality apply
the "libtidy.txt" patch (cd tidy/src/; patch -p0 < libtidy.txt) to libtidy sources and
then recompile libtidy.
---------------------------------------------------------------------------------------

The Tidy extension has two separate APIs, one for general parsing, cleaning, and
repairing and another for document traversal. The general API is provided below:

  tidy_create()                     Reinitialize the tidy engine
  tidy_parse_file($file)	    Parse the document stored in $file
  tidy_parse_string($str)    	    Parse the string stored in $str
  
  tidy_clean_repair()               Clean and repair the document
  tidy_diagnose()	            Diagnose a parsed document
  
  tidy_setopt($opt, $val)           Set a configuration option $opt to $val
  tidy_getopt($opt)                Retrieve a configuration option
  
    ** note: $opt is a string representing the option. Although no formal
    documentation yet exists for PHP, you can find a description of many
    of them at http://www.w3.org/People/Raggett/tidy/ and a list of supported
    options in the phpinfo(); output**
  
  tidy_get_output()                 Return the cleaned tidy HTML as a string
  tidy_get_error_buffer()           Return a log of the errors and warnings
                                    returned by tidy
  
  tidy_get_release()                Return the Libtidy release date
  tidy_get_status()                 Return the status of the document
  tidy_get_html_ver()               Return the major HTML version detected for
                                    the document;
                                    
  tidy_is_xhtml()                   Determines if the document is XHTML
  tidy_is_xml()                     Determines if the document is a generic XML
  
  tidy_error_count()                Returns the number of errors in the document
  tidy_warning_count()              Returns the number of warnings in the document
  tidy_access_count()               Returns the number of accessibility-related
                                    warnings in the document.
  tidy_config_count()               Returns the number of configuration errors found
  
  tidy_load_config($file)           Loads the specified configuration file
  tidY_load_config_enc($file,
                       $enc)        Loads the specified config file using the specified
                                    character encoding
  tidy_set_encoding($enc)           Sets the current character encoding for the document
  tidy_save_config($file)           Saves the current config to $file
  
  
Beyond these general-purpose API functions, Tidy also supports the following
functions which are used to retrieve an object for document traversal:
  
  tidy_get_root()              Returns an object starting at the root of the
                                    document
  tidy_get_head()              Returns an object starting at the <HEAD> tag
  tidy_get_html()              Returns an object starting at the <HTML> tag
  tidy_get_body()              Returns an object starting at the <BODY> tag
  
All Navigation of the specified document is done via the PHP5 object constructs.
There are two types of objects which Tidy can create. The first is TidyNode, which
represents HTML Tags, Text, and more (see the TidyNode_Type Constants). The second
is TidyAttr, which represents an attribute within an HTML tag (TidyNode). The
functionality of these objects is represented by the following schema:

class TidyNode {

    public $name;               // name of node (i.e. HEAD)
    public $value;              // value of node (everything between tags)
    public $type;               // type of node (text, php, asp, etc.)
    public $id;                 // id of node (i.e. TIDY_TAG_HEAD)
    
    public function attributes();            // an array of attributes (see TidyAttr)
    public function children();           // an array of child nodes
    
    function has_siblings();    // any sibling nodes?
    function has_children();    // any child nodes?
       
    function is_comment();      // is node a comment?
    function is_xhtml();        // is document XHTML?
    function is_xml();          // is document generic XML (not HTML/XHTML)
    function is_text();         // is node text?
    function is_html();         // is node an HTML tag?
    
    function is_jste();         // is jste block?
    function is_asp();          // is Microsoft ASP block?
    function is_php();          // is PHP block?
    
    function next();            // returns next node
    function prev();            // returns prev node
        
    /* Searches for a particular attribute in the current node based
       on node ID. If found returns a TidyAttr object for it */
    function get_attr($attr_id);

    /*
}

class TidyAttr {

    public $name;           // attribute name i.e. HREF
    public $value;          // attribute value
    public $id;             // attribute id i.e. TIDY_ATTR_HREF

}

Examples of using these objects to navigate the tree can be found in the examples/
directory (I suggest looking at urlgrab.php and dumpit.php)

E-mail thoughts, suggestions, patches, etc. to <john@php.net>