Since we require PHP 7.4, contravariance in param types is supported,
so we do not need to worry about subclasses that widen the param type.
It will be only breaking in the unlikely case a subclass uses a type that
contradicts the PHPDoc type annotation and does not not extend `DOMNode`.
Also fix the type annotation since some invocations pass it a `DOMText`,
an arbitrary sibling/child `DOMNode` or even `null`.
`DOMAttr::$value` must be a `string`.
Let’s add helpers for manipulating the `readability` attribute
so that we do not have to keep casting it from and to `string`
in order to appease `strict_types`.
`DOMNodeList` implements `Traversable`.
There are some `for` loops left but we cannot simply replace those:
PHP follows the DOM specification, which requires that `NodeList`
objects in the DOM are live. As a result, any operation that removes
a node list member node from its parent (such as `removeChild`,
`replaceChild` or `appendChild`) will cause the next node
in the iterator to be skipped.
We could work around that by converting those node lists to static arrays
using `iterator_to_array` but not sure if it is worth it.
It would fail for e.g. `<div> <p>foo</p> </div>`.
mozilla/readability uses children for the tag lookup, which return only elements.
PHP does not have children property so b580cf216d
mistakenly used `childNodes` instead, but that can return any node type.
Let’s filter the children ourselves.
Also add comments from mozilla/readability’s `_hasSingleTagInsideElement`.
Once we bump minimum PHP version, we will get newer PHP-CS-Fixer,
which will try to apply this cleanups.
Also manually tweak anonymous functions so that they are cleanly formatted
once we switch to `fn` syntax.
This is deprecated since PHP 8.2:
Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead
It was used because `DOMDocument`, which uses libxml2 internally, will parse the HTML as ISO-8859-1, unless the document contains an XML encoding declaration or HTML meta tag setting character set.
Since first such element wins, putting the `meta[charset]` up front will ensure the parser uses the correct encoding, even if the document contains incorrect meta tag (e.g. when the document is converted to UTF-8 without also updating the metadata by the software passing it to Readability).
https://stackoverflow.com/a/39148511/160386
1) src/Readability.php (braces, no_unneeded_control_parentheses, single_line_comment_spacing, global_namespace_import, no_unused_imports, phpdoc_align)
2) src/JSLikeHTMLElement.php (phpdoc_separation)
Switch code blocks to Markdown syntax to work around `phpdoc_separation`, ApiGen uses Markdown these days anyway.
Huge tags can lead to a failure of preg_replace, thus erasing the whole
fetched content.
Fixes https://github.com/wallabag/wallabag/issues/5847
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
Readability was previously removing (was trying to actually, see next
section) invisible nodes using a pattern from `unlikelyCandidates`. This
was quite hacky and was removed during a backport of logics from
mozilla/readability. There is still a need to remove them so here we
are. We still use a pattern but specifically against the style
attribute. We also remove nodes with the attribute `hidden`.
The clean feature of tidy actually replaces inline style attributes
with css classes thus preventing readability to detect invisible nodes,
see https://github.com/htacg/tidy-html5/blob/5.6.0/src/clean.c#L1488
We therefore set clean configuration to false.
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
This change backports several things from mozilla/readability:
- Add child score to all ancestors instead of the first parent only
- Check 5 top candidates and try to find alternative candidates within
ancestors, this can help to find a better parent and grab more content
- Reduce patterns from `unlikelyCandidates` to the one used by Mozilla
as ours tend to remove useful nodes
- Score headers (h2 to h6) by default in addition to div, p, td and
section
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>