Because of PHP 8.2 deprecation, in f14428e4c0, we stopped converting non-ASCII characters to HTML entities. Instead, we started to explicitly insert `meta[charset]` tag at the start of the document.
Later, we discovered that was breaking `html[lang]` so, in efbbc86df9, we made the insertion smarter. One of the improvements was that it would not insert the `meta[charset]` tag when it was already present.
That, however, broke websites that had `title` tag before `meta[charset]`. On those, libxml2 would decode the `title` contents as ISO-8859-1.
We could improve the logic (e.g. check that there is not text content before `meta[charset]`) or insert the tag unconditionally but it will probably be simplest to just go back to converting the non-ASCII characters to entities, just using non-deprecated function variant.
We use `DOMDocument::registerNodeClass()` to make DOM methods return
`JSLikeHTMLElement` instead of `DOMElement`. Unfortunately, it is not
possible for PHPStan to detect that so we need to cast it ourselves:
https://github.com/phpstan/phpstan/discussions/10748
We may want to deprecate it in the future just to get rid of this mess.
Also add PHPStan stubs for DOM classes so that we do not need to cast everything.
It is fine to do that globally as we only ever use DOM with `JSLikeHTMLElement` registered.
This patch also allows us to get rid of the assertions in tests.
`DOMDocument::loadHTML` will parse HTML documents as ISO-8859-1 if there is no `meta[charset]` tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD `articleBody` field would be parsed with incorrect encoding.
In f14428e4c0, we tried to resolve it by putting `meta[charset]` tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a `html` element, losing the attributes of the original `html` tag.
Let’s try to insert the `meta[charset]` tag into the proper place in the HTML document.
We do not need to use the same trick with `JSLikeHTMLElement::__set`.
That expects smaller HTML fragments, not `html` documents, so creating `html` and `head` elements will not be a problem.
Once we bump minimum PHP version, we will get newer PHP-CS-Fixer,
which will try to apply this cleanups.
Also manually tweak anonymous functions so that they are cleanly formatted
once we switch to `fn` syntax.
Readability was previously removing (was trying to actually, see next
section) invisible nodes using a pattern from `unlikelyCandidates`. This
was quite hacky and was removed during a backport of logics from
mozilla/readability. There is still a need to remove them so here we
are. We still use a pattern but specifically against the style
attribute. We also remove nodes with the attribute `hidden`.
The clean feature of tidy actually replaces inline style attributes
with css classes thus preventing readability to detect invisible nodes,
see https://github.com/htacg/tidy-html5/blob/5.6.0/src/clean.c#L1488
We therefore set clean configuration to false.
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
This change backports several things from mozilla/readability:
- Add child score to all ancestors instead of the first parent only
- Check 5 top candidates and try to find alternative candidates within
ancestors, this can help to find a better parent and grab more content
- Reduce patterns from `unlikelyCandidates` to the one used by Mozilla
as ours tend to remove useful nodes
- Score headers (h2 to h6) by default in addition to div, p, td and
section
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
We append a new node when it isn't a `div` or `p` (like when it's an `article`) with the same id which generate a DOM error "DOMElement::setAttribute(): ID blabla already defined".
Most the time, they can be usefull.
At least, it'll be a link to something unrelated. But we won't lose a link inside the content.
Also, adding some extra space.