Karl,
Thanks a lot for this tip. That's exactly what i needed.
The key phrase seems to be (for me): "If your document uses transitional
markup, make sure your DOCTYPE reflects that fact and does not have a URI".
Does that mean I can/should use the transitional doctype declaration:
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/transitional.dtd">
<html xmlns="http://www.w3.org/TR/xhtml1">
... but i have to delete the
"http://www.w3.org/TR/xhtml1/DTD/transitional.dtd" part?
I tried, but the images still have white gaps under them.
Can you specify the doctype I need for me?
thanks in advance,
frank
Message: 3
Subject: Re: [Tidy-dev] White lines in Netcape 6
From: Karl Ove Hufthammer <***@bigfoot.com>
To: tidy-***@lists.sourceforge.net
Date: Fri, 22 Mar 2002 16:27:08 +0100
Post by Frank VisserThis may be a known problem, but I just stumbled on it.
I have tidied a couple of sites which display well in IE
5.x and Netscape 4.x, but in Netscape 6 there are white
horizontal lines around images.
Please see:
<URL: http://developer.netscape.com/evangelism/docs/articles/img-table/
--
Karl Ove Hufthammer
--__--__--
Message: 4
Date: Fri, 22 Mar 2002 11:14:10 -0500
To: ***@interaccess.com
From: Charles Reitzel <***@rcn.com>
Subject: Re: [Tidy-dev] Tidy for html-xml parser and embedded C++.
Cc: tidy-***@lists.sourceforge.net
Hi Thaddeus,
First, if I read you right, you are asking for a library version of HTML
Tidy. A couple folks have forged ahead with Tidy libraries. See
http://www.lemburg.com/files/python/mxTidy.html and
http://www.dysfunctionals.org/~lee/TidyCPP.zip, also
http://perso.wanadoo.fr/ablavier/TidyCOM/. Any of these will lag
somewhat
behind the current version. For example, I have used TidyCOM with VB
successfully to do bulk Word-To-HTML conversions.
Otherwise, what you want to do _is_ doable, just not easily in C. In a
shell script or, better, Perl it is not a problem. Simply use existing
Tidy options to send all the errors to a file, output to another file
and,
perhaps, various informational messages to the standard output. The
non-informational messages (warnings and errors) are easily parsed with
a
regular expression or even C strstr().
Also, if you _documenting_ C code, you might try placing C source code
within either the <pre> or <code> tags.
Hope this helps and send along any follow up questions you may have.
thanks,
Charlie
Post by Frank Visserbe better off sending it to this group.
One addition to what I've written. I'm doing this on Linux.
First.
Some one I suggested that I send my query to this mailing list.
I haven't been able to find any way to subscribe to this mailing list,
so please either send me the answer directly or show me how
to subscribe.
My problem.
I've written software which crawls through web pages ie given
a web page, I find all the links ( and all the images ) on that web
page. ( The purpose of this is that I get a lot of manuals books etc.
as a tar gzipped set of html documents [eg the Python documentation ].
I then install these on my local web server [ accessible only from my
Lan of which I am the only user ]. I download stuff faster than I can
add a link, so the crawler finds all the files and adds links to files
( I try to be top down--make a best guess of what the index pages are
). Then I find all the links on the links etc.
The main problem I have is parsing the page to find the links.
At first I tried using regular expressions, and it mostly worked.
1) Fragile and there seemed to be multiple expections to the rules
that kept growing.
2) Slow.
So then I used expat to parse the files.
Which was fine for the xml files, but didn't work
for the html files ( of course).
The solution to this was: if expat choked on the file, then change
tidy -asxhtml -m $filename.
Unfortunately tidy chokes on some of the files.
Very few and it looks worthwhile to go on a case by case basis.
The biggest offenders seem to be web pages that contain embedded
C++. For example: vector<T>. Tidy interprets this as a tag <T>.
1) Instead of calling tidy via a system call,
I would like to take the tidy source, remove main and write a
char *tidy(char *buffer,char *error);
Where buffer is the to be parsed file, error is a buffer containing
error messages and tidy returns a xhtml version of the buffer.
2) If this tidy function encounters an error, I would like some way of
being told what character in the buffer the error firsts occurs
memcpy(tidy_buffer,original_buffer, sizeof(file));
tidy(tidy_buffer);
while((int char_pos=error_is_bad_tag())!-0)
{
fix_tag_a_pos(&original_buffer, char_pos)
memcpy(tidy_buffer,original_buffer, sizeof(file));
tidy(tidy_buffer);
}
_______________________________________________
Tidy-develop mailing list
https://lists.sourceforge.net/lists/listinfo/tidy-develop
--__--__--
Message: 5
Date: Fri, 22 Mar 2002 17:33:23 +0100
To: ***@interaccess.com
From: Lee Goddard <***@LeeGoddard.com>
Subject: Re: [Tidy-dev] Tidy for html-xml parser and embedded C++.
Post by Frank VisserThe main problem I have is parsing the page to find the links.
Your best bet is to use Perl; as it was designed for this there are
modules
for exactly this job. For example:
NAME
HTML::LinkExtor - Extract links from an HTML document
SYNOPSIS
require HTML::LinkExtor;
$p = HTML::LinkExtor->new(\&cb, "http://www.perl.org/");
sub cb {
my($tag, %links) = @_;
print "$tag @{[%links]}\n";
}
$p->parse_file("index.html");
DESCRIPTION
*HTML::LinkExtor* is an HTML parser that extracts links from an
HTML
document. The *HTML::LinkExtor* is a subclass of *HTML::Parser*.
This
means that the document should be given to the parser by calling
the
$p->parse() or $p->parse_file() methods.
$p = HTML::LinkExtor->new([$callback[, $base]])
The constructor takes two optional arguments. The first is a
reference to a callback routine. It will be called as links are
found. If a callback is not provided, then links are just
accumulated internally and can be retrieved by calling the
$p->links() method.
The $base argument is an optional base URL used to absolutize
all
URLs found. You need to have the *URI* module installed if you
provide $base.
The callback is called with the lowercase tag name as first
argument, and then all link attributes as separate key/value
pairs.
All non-link attributes are removed.
$p->links
Returns a list of all links found in the document. The returned
values will be anonymous arrays with the follwing elements:
[$tag, $attr => $url1, $attr2 => $url2,...]
The $p->links method will also truncate the internal link list.
This
means that if the method is called twice without any parsing
between
them the second call will return an empty list.
Also note that $p->links will always be empty if a callback
routine
was provided when the *HTML::LinkExtor* was created.
EXAMPLE
This is an example showing how you can extract links from a
document
received using LWP:
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;
$url = "http://www.perl.org/"; # for instance
$ua = LWP::UserAgent->new;
# Set up a callback that collect image links
my @imgs = ();
sub callback {
my($tag, %attr) = @_;
return if $tag ne 'img'; # we only look closer at <img ...>
push(@imgs, values %attr);
}
# Make the parser. Unfortunately, we don't know the base yet
# (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);
# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
sub {$p->parse($_[0])});
# Expand all image URLs to absolute ones
my $base = $res->base;
@imgs = map { $_ = url($_, $base)->abs; } @imgs;
# Print them out
print join("\n", @imgs), "\n";
SEE ALSO
the HTML::Parser manpage, the HTML::Tagset manpage, the LWP
manpage, the
URI::URL manpage
COPYRIGHT
Copyright 1996-2001 Gisle Aas.
This library is free software; you can redistribute it and/or
modify it
under the same terms as Perl itself.
Tool completed successfully
hth
lee
--__--__--
_______________________________________________
Tidy-develop mailing list
Tidy-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tidy-develop
End of Tidy-develop Digest