Huge tags can lead to a failure of preg_replace, thus erasing the whole
fetched content.
Fixes https://github.com/wallabag/wallabag/issues/5847
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
Readability was previously removing (was trying to actually, see next
section) invisible nodes using a pattern from `unlikelyCandidates`. This
was quite hacky and was removed during a backport of logics from
mozilla/readability. There is still a need to remove them so here we
are. We still use a pattern but specifically against the style
attribute. We also remove nodes with the attribute `hidden`.
The clean feature of tidy actually replaces inline style attributes
with css classes thus preventing readability to detect invisible nodes,
see https://github.com/htacg/tidy-html5/blob/5.6.0/src/clean.c#L1488
We therefore set clean configuration to false.
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
This change backports several things from mozilla/readability:
- Add child score to all ancestors instead of the first parent only
- Check 5 top candidates and try to find alternative candidates within
ancestors, this can help to find a better parent and grab more content
- Reduce patterns from `unlikelyCandidates` to the one used by Mozilla
as ours tend to remove useful nodes
- Score headers (h2 to h6) by default in addition to div, p, td and
section
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
> Method "Psr\Log\LoggerAwareInterface::setLogger()" might add "void" as a native return type declaration in the future. Do the same in implementation "Readability\Readability" now to avoid errors or add an explicit @return annotation to suppress this message.
HTML 4.01 Strict only allows block-level elements within noscript, form and
blockquote. The `enclose-block-text` option fixes the instances when those
elements contain inline elements or text by wrapping the children in paragraphs.
HTML 5 has looser content model and allows noscript elements basically anywhere,
including paragraphs, making the noscript elements inherit the parent element’s
content model. This means that tidy will produce invalid HTML nesting paragraphs
for `p > noscript > text`, a structure that would be invalid on two counts
in HTML 4 Strict profile but is completely valid in HTML 5.
Popular WordPress image lazy-loading code produces precisely that structure
so tidy “corrects” it to invalid code. In a proper HTML parser, the produced
code would force close the outer paragraph, making the noscript element
its sibling instead of a child. The only reason this does not break Graby’s code
for stripping the lazy-loading HTML is that libxml2 contains a bug
counteracting this:
https://gitlab.gnome.org/GNOME/libxml2/-/issues/205
Since all three elements allow flow content in HTML 5, it does not make much
sense to enable this option any more. The only possible issues that could occur
is producing HTML code not conforming to 4.01 Strict but that was never guaranteed,
as our example shows, and having blockquotes contain text nodes not wrapped
in paragraphs, which might be expected by some ancient stylesheets
but that is only minor and easily fixable visual backwards incompatibility.
`electrolinux/php-html5lib` was quite old and incompatible with the upcoming Composer 2.0.
Jumping to `masterminds/html5` for the same result. Also the lib is maintained.
Also:
- keep README in vendors
- use new Scrutinizer engine
- test with lower deps
- remove php-coveralls dev deps and download the phar during the CI build
A change released in tidy 5.6.0 breaks php-tidy when using
tidy_parse_string+tidy_clean_repair and wrap=0, incorrectly wrapping
every single word. Also it seems that $tidy->value should not be used to
retrieve the repaired html as far as it is undocumented and for internal
use.
We replace the call with tidy_repair_string which directly returns the
repaired string.
Relates to https://github.com/htacg/tidy-html5/issues/673
Relates to https://bugs.php.net/bug.php?id=75947
Tests pass.
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
This isn't the best solution but the previous one using `@` wasn't really better.
Appending a string into a fragment might generate some warning if the string contains bad entity.
For example `+`.
Some contents have a `infocontent` node (ot sth different) and they are real content.
Using only `info` as regex is too agressive and remove legitimate content.
Matching the whole word `info` (or `infos`) should be a better choice
We append a new node when it isn't a `div` or `p` (like when it's an `article`) with the same id which generate a DOM error "DOMElement::setAttribute(): ID blabla already defined".
Most the time, they can be usefull.
At least, it'll be a link to something unrelated. But we won't lose a link inside the content.
Also, adding some extra space.