develooper Front page | perl.libwww | Postings from July 2001

RFC: HTML::Encoding

From:
Bjoern Hoehrmann
Date:
July 28, 2001 18:46
Subject:
RFC: HTML::Encoding
Message ID:
gjo6mtofr5nhlbmte8679qm2m8ud9q0dfg@4ax.com
Hi,

   I've written up a module that collects encoding informations for
(X)HTML files. (X)HTML files may carry encoding information in

  1. the higher-level protocol (e.g. the Content-Type headers charset
     parameter in HTTP and MIME)

  2. the XML declaration (for XHTML documents)

  3. the byte order mark at the beginning of the file

  4. meta elements like 

      <meta http-equiv='Content-Type'
            content='text/html;charset=iso-8859-1'>

At user option it tries to extract the explicit given informations
information from these instances. After that process it sorts the
list according to the order above, in list context it returns the
list, in scalar context it returns the first encoding in the list
(i.e. the encoding the user agent must use to parse the document).

This looks like

  #!perl -w
  use strict;
  use warnings;
  use LWP::UserAgent;
  use HTML::Encoding '';
  
  my $r = LWP::UserAgent->new->request(
    HTTP::Request->new(GET => 'http://www.w3.org/'));
  
  print scalar HTML::Encoding::get_encoding 
  
    check_bom     => 1,
    check_xmldecl => 1,
    check_meta    => 1,
    headers       => $r->headers,
    string        => $r->content

This would currently print out 'us-ascii' as http://www.w3.org/ returns
Content-Type: text/html;charset=us-ascii; in list context this would
return

  [
    { source => 4, encoding => 'us-ascii' },
    { source => 1, encoding => 'us-ascii' },
  ]

since the page has also a meta header

  <meta http-equiv='Content-Type'
        content='text/html;charset=us-ascii' />

The POD says:

[...]
  The source value is mapped to one of the constants FROM_META,
  FROM_BOM, FROM_XMLDECL and FROM_HEADER. You can import these
  constants solely into your namespace or using the ":constants"
  symbol, e.g.

    use HTML::Encoding ':constants';
[...]

This is usable if you want to check if there is a mismatch between the
declared encodings.

Some issues that came to my mind while writing this module:

  * HTTP::Headers should provide some information whether
    LWP::UserAgent already parsed the header section of the
    HTML file; so I wouldn't need to do the same thing again.
    currently one cannot distinguish if there were multiple
    Content-Type: headers in the original response or if they
    come from meta elements

  * HTML::Encoding currently uses HTML::Parser to extract the
    meta element if version 3.21 or later is available (maybe
    I'll switch to HTML::HeadParser ...) The problem is, that
    HTML::Parser is AFAIK currently unable to process documents
    encoded in some encoding that is not compatible with
    US-ASCII (UTF-16BE for example)

    I think it is out of scope of HTML::Encoding to recode the
    given string to some US-ASCII compatible encoding (that'd
    be UTF-8) in order to parse the document; this should be
    done by HTML::Parser using some encoding parameter.

    Personally I'd say that HTML::Parser should only output
    UTF-8 encoded characters as XML::Parser does, but this
    will certainly clash with current users who expect to get
    ISO-8859-1 or something like that out of it...

    Is it likely that HTML::Parser incorporates such a feature
    using the Unicode::* modules or Text::Iconv or whatever is
    currently available?

  * Is the currently really no module that does what
    HTML::Encoding is supposed to do? In general you have to
    use the module everytime you try to do anything with an
    HTML document; hmm, maybe western people got too used to
    ISO-8859-1...

The current version can be found at

  http://www.websitedev.de/perl/HTML-Encoding-0.01.tar.gz

You'll currently need Perl 5.6.0 to use it. The file currently lacks of
a proper README and test files...

Is the module name appropriate? Any other comments or suggestions? I
greatly appreciate them :-)

Thanks for your time,
-- 
Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About