NAME
    HTML::AsText::Fix - extends HTML::Element::as_text() to render text
    properly

VERSION
    version 0.001

SYNOPSIS
    # fix individual objects my $tree =
    HTML::TreeBuilder::XPath->new_from_content($html); my $guard =
    HTML::AsText::Fix::object($tree);

    # fix deeply nested objects use URI; use Web::Scraper;

        # First, create your scraper block
        my $tweets = scraper {
            process "li.status", "tweets[]" => scraper {
                process ".entry-content", body => 'TEXT';
                process ".entry-date", when => 'TEXT';
                process 'a[rel="bookmark"]', link => '@href';
            };
        };

        my $res;
        {
            my $guard = HTML::AsText::Fix::global();
            $res = $tweets->scrape( URI->new("http://twitter.com/creaktive") );
        }

DESCRIPTION
    Consider the following HTML sample:

        <p>
            <span>AAA</span>
            BBB
        </p>
        <h2>CCC</h2>
        DDD
        <br>
        EEE

    "HTML::Element::as_text()" method stringifies it as *AAABBBCCCDDDEEE*.
    Despite being correct, this is far from the actual renderization within
    a "real" browser. links(1), lynx(1) & w3m(1) break lines this way:

        AAABBB
        CCC
        DDD
        EEE

    This module tries to implement the same behavior in the method "as_text"
    in HTML::Element. By default, $/ value is inserted in place of line
    breaks, and "\x{200b}" (Unicode zero-width space) separates text from
    adjacent inline elements.

  Distinction between block/inline nodes
    "span", for instance, is an inline node:

        <p><span>A</span>pple</p>

    In that case, there really shouldn't be a space between "A" and "pple".
    To handle inline nodes properly, only block nodes are separated by line
    break. Following nodes are currently assumed being blocks:

    *   p

    *   h1 h2 h3 h4 h5 h6

    *   dl dt dd

    *   ol ul li

    *   dir

    *   address

    *   blockquote

    *   center

    *   del

    *   div

    *   hr

    *   ins

    *   noscript script

    *   pre

    *   br (just to make sense)

    (source: <http://en.wikipedia.org/wiki/HTML_element#Block_elements>)

FUNCTIONS
  as_text
    The replacement function. Not to be used separately. It is injected
    inside HTML::Element.

  global
    Hook into every HTML::Element within the lexical scope. Returns the
    guard object, destroying it will unhook safely.

    Accepts following options:

    *   lf_char: character inserted between block nodes (by default, $/);

    *   zwsp_char: character inserted between inline nodes (by default,
        "\x{200b}", Unicode zero-width space);

    *   trim: trim heading/trailing spaces (considers "\x{A0}" as space!);

    *   extra_chars: extra characters to trim;

    *   skip_dels: if true, then text content under "del" nodes is not
        included in what's returned.

    For example, to completely get rid of separation between inline nodes:

        my $guard = HTML::AsText::Fix::global(zwsp_char => '');

  object
    Hook object instance. Accepts the same options as "global":

        my $guard = HTML::AsText::Fix::object($tree, zwsp_char => '');

SEE ALSO
    *   HTML::Element

    *   HTML::Tree

    *   HTML::FormatText

    *   Monkey::Patch

ACKNOWLEDGEMENTS
    *   Αριστοτέλης Παγκαλτζής <https://metacpan.org/author/ARISTOTLE>

    *   Toby Inkster <https://metacpan.org/author/TOBYINK>

AUTHOR
    Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE
    This software is copyright (c) 2012 by Stanislaw Pusep.

    This is free software; you can redistribute it and/or modify it under
    the same terms as the Perl 5 programming language system itself.