`DOMAttr::$value` must be a `string`.
Let’s add helpers for manipulating the `readability` attribute
so that we do not have to keep casting it from and to `string`
in order to appease `strict_types`.
Once we bump minimum PHP version, we will get newer PHP-CS-Fixer,
which will try to apply this cleanups.
Also manually tweak anonymous functions so that they are cleanly formatted
once we switch to `fn` syntax.
This is deprecated since PHP 8.2:
Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead
It was used because `DOMDocument`, which uses libxml2 internally, will parse the HTML as ISO-8859-1, unless the document contains an XML encoding declaration or HTML meta tag setting character set.
Since first such element wins, putting the `meta[charset]` up front will ensure the parser uses the correct encoding, even if the document contains incorrect meta tag (e.g. when the document is converted to UTF-8 without also updating the metadata by the software passing it to Readability).
https://stackoverflow.com/a/39148511/160386
1) src/Readability.php (braces, no_unneeded_control_parentheses, single_line_comment_spacing, global_namespace_import, no_unused_imports, phpdoc_align)
2) src/JSLikeHTMLElement.php (phpdoc_separation)
Switch code blocks to Markdown syntax to work around `phpdoc_separation`, ApiGen uses Markdown these days anyway.
Huge tags can lead to a failure of preg_replace, thus erasing the whole
fetched content.
Fixes https://github.com/wallabag/wallabag/issues/5847
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
Readability was previously removing (was trying to actually, see next
section) invisible nodes using a pattern from `unlikelyCandidates`. This
was quite hacky and was removed during a backport of logics from
mozilla/readability. There is still a need to remove them so here we
are. We still use a pattern but specifically against the style
attribute. We also remove nodes with the attribute `hidden`.
The clean feature of tidy actually replaces inline style attributes
with css classes thus preventing readability to detect invisible nodes,
see https://github.com/htacg/tidy-html5/blob/5.6.0/src/clean.c#L1488
We therefore set clean configuration to false.
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
This change backports several things from mozilla/readability:
- Add child score to all ancestors instead of the first parent only
- Check 5 top candidates and try to find alternative candidates within
ancestors, this can help to find a better parent and grab more content
- Reduce patterns from `unlikelyCandidates` to the one used by Mozilla
as ours tend to remove useful nodes
- Score headers (h2 to h6) by default in addition to div, p, td and
section
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
> Method "Psr\Log\LoggerAwareInterface::setLogger()" might add "void" as a native return type declaration in the future. Do the same in implementation "Readability\Readability" now to avoid errors or add an explicit @return annotation to suppress this message.