Front page | perl.libwww |
Postings from July 2001
RFC: HTML::Encoding
From:
Bjoern Hoehrmann
Date:
July 28, 2001 18:46
Subject:
RFC: HTML::Encoding
Message ID:
gjo6mtofr5nhlbmte8679qm2m8ud9q0dfg@4ax.com
Hi,
I've written up a module that collects encoding informations for
(X)HTML files. (X)HTML files may carry encoding information in
1. the higher-level protocol (e.g. the Content-Type headers charset
parameter in HTTP and MIME)
2. the XML declaration (for XHTML documents)
3. the byte order mark at the beginning of the file
4. meta elements like
<meta http-equiv='Content-Type'
content='text/html;charset=iso-8859-1'>
At user option it tries to extract the explicit given informations
information from these instances. After that process it sorts the
list according to the order above, in list context it returns the
list, in scalar context it returns the first encoding in the list
(i.e. the encoding the user agent must use to parse the document).
This looks like
#!perl -w
use strict;
use warnings;
use LWP::UserAgent;
use HTML::Encoding '';
my $r = LWP::UserAgent->new->request(
HTTP::Request->new(GET => 'http://www.w3.org/'));
print scalar HTML::Encoding::get_encoding
check_bom => 1,
check_xmldecl => 1,
check_meta => 1,
headers => $r->headers,
string => $r->content
This would currently print out 'us-ascii' as http://www.w3.org/ returns
Content-Type: text/html;charset=us-ascii; in list context this would
return
[
{ source => 4, encoding => 'us-ascii' },
{ source => 1, encoding => 'us-ascii' },
]
since the page has also a meta header
<meta http-equiv='Content-Type'
content='text/html;charset=us-ascii' />
The POD says:
[...]
The source value is mapped to one of the constants FROM_META,
FROM_BOM, FROM_XMLDECL and FROM_HEADER. You can import these
constants solely into your namespace or using the ":constants"
symbol, e.g.
use HTML::Encoding ':constants';
[...]
This is usable if you want to check if there is a mismatch between the
declared encodings.
Some issues that came to my mind while writing this module:
* HTTP::Headers should provide some information whether
LWP::UserAgent already parsed the header section of the
HTML file; so I wouldn't need to do the same thing again.
currently one cannot distinguish if there were multiple
Content-Type: headers in the original response or if they
come from meta elements
* HTML::Encoding currently uses HTML::Parser to extract the
meta element if version 3.21 or later is available (maybe
I'll switch to HTML::HeadParser ...) The problem is, that
HTML::Parser is AFAIK currently unable to process documents
encoded in some encoding that is not compatible with
US-ASCII (UTF-16BE for example)
I think it is out of scope of HTML::Encoding to recode the
given string to some US-ASCII compatible encoding (that'd
be UTF-8) in order to parse the document; this should be
done by HTML::Parser using some encoding parameter.
Personally I'd say that HTML::Parser should only output
UTF-8 encoded characters as XML::Parser does, but this
will certainly clash with current users who expect to get
ISO-8859-1 or something like that out of it...
Is it likely that HTML::Parser incorporates such a feature
using the Unicode::* modules or Text::Iconv or whatever is
currently available?
* Is the currently really no module that does what
HTML::Encoding is supposed to do? In general you have to
use the module everytime you try to do anything with an
HTML document; hmm, maybe western people got too used to
ISO-8859-1...
The current version can be found at
http://www.websitedev.de/perl/HTML-Encoding-0.01.tar.gz
You'll currently need Perl 5.6.0 to use it. The file currently lacks of
a proper README and test files...
Is the module name appropriate? Any other comments or suggestions? I
greatly appreciate them :-)
Thanks for your time,
--
Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
-
RFC: HTML::Encoding
by Bjoern Hoehrmann