We use `DOMDocument::registerNodeClass()` to make DOM methods return
`JSLikeHTMLElement` instead of `DOMElement`. Unfortunately, it is not
possible for PHPStan to detect that so we need to cast it ourselves:
https://github.com/phpstan/phpstan/discussions/10748
We may want to deprecate it in the future just to get rid of this mess.
Also add PHPStan stubs for DOM classes so that we do not need to cast everything.
It is fine to do that globally as we only ever use DOM with `JSLikeHTMLElement` registered.
This patch also allows us to get rid of the assertions in tests.
Once we bump minimum PHP version, we will get newer PHP-CS-Fixer,
which will try to apply this cleanups.
Also manually tweak anonymous functions so that they are cleanly formatted
once we switch to `fn` syntax.
Readability was previously removing (was trying to actually, see next
section) invisible nodes using a pattern from `unlikelyCandidates`. This
was quite hacky and was removed during a backport of logics from
mozilla/readability. There is still a need to remove them so here we
are. We still use a pattern but specifically against the style
attribute. We also remove nodes with the attribute `hidden`.
The clean feature of tidy actually replaces inline style attributes
with css classes thus preventing readability to detect invisible nodes,
see https://github.com/htacg/tidy-html5/blob/5.6.0/src/clean.c#L1488
We therefore set clean configuration to false.
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
This change backports several things from mozilla/readability:
- Add child score to all ancestors instead of the first parent only
- Check 5 top candidates and try to find alternative candidates within
ancestors, this can help to find a better parent and grab more content
- Reduce patterns from `unlikelyCandidates` to the one used by Mozilla
as ours tend to remove useful nodes
- Score headers (h2 to h6) by default in addition to div, p, td and
section
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
We append a new node when it isn't a `div` or `p` (like when it's an `article`) with the same id which generate a DOM error "DOMElement::setAttribute(): ID blabla already defined".
Most the time, they can be usefull.
At least, it'll be a link to something unrelated. But we won't lose a link inside the content.
Also, adding some extra space.