Because of PHP 8.2 deprecation, in f14428e4c0, we stopped converting non-ASCII characters to HTML entities. Instead, we started to explicitly insert `meta[charset]` tag at the start of the document.
Later, we discovered that was breaking `html[lang]` so, in efbbc86df9, we made the insertion smarter. One of the improvements was that it would not insert the `meta[charset]` tag when it was already present.
That, however, broke websites that had `title` tag before `meta[charset]`. On those, libxml2 would decode the `title` contents as ISO-8859-1.
We could improve the logic (e.g. check that there is not text content before `meta[charset]`) or insert the tag unconditionally but it will probably be simplest to just go back to converting the non-ASCII characters to entities, just using non-deprecated function variant.
`isset` does two things: it checks if the entity resulting from the chain of accessors is defined, and if it is not `null`.
We know that the properties are always defined so we can just replace those `isset`s with `null` checks.
This will allow us to reason about the code more clearly.
We use `DOMDocument::registerNodeClass()` to make DOM methods return
`JSLikeHTMLElement` instead of `DOMElement`. Unfortunately, it is not
possible for PHPStan to detect that so we need to cast it ourselves:
https://github.com/phpstan/phpstan/discussions/10748
We may want to deprecate it in the future just to get rid of this mess.
Also add PHPStan stubs for DOM classes so that we do not need to cast everything.
It is fine to do that globally as we only ever use DOM with `JSLikeHTMLElement` registered.
This patch also allows us to get rid of the assertions in tests.
Since we require PHP 7.4, contravariance in param types is supported,
so we do not need to worry about subclasses that widen the param type.
It will be only breaking in the unlikely case a subclass uses a type that
contradicts the PHPDoc type annotation and does not not extend `DOMNode`.
Also fix the type annotation since some invocations pass it a `DOMText`,
an arbitrary sibling/child `DOMNode` or even `null`.
`DOMAttr::$value` must be a `string`.
Let’s add helpers for manipulating the `readability` attribute
so that we do not have to keep casting it from and to `string`
in order to appease `strict_types`.
`DOMDocument::loadHTML` will parse HTML documents as ISO-8859-1 if there is no `meta[charset]` tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD `articleBody` field would be parsed with incorrect encoding.
In f14428e4c0, we tried to resolve it by putting `meta[charset]` tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a `html` element, losing the attributes of the original `html` tag.
Let’s try to insert the `meta[charset]` tag into the proper place in the HTML document.
We do not need to use the same trick with `JSLikeHTMLElement::__set`.
That expects smaller HTML fragments, not `html` documents, so creating `html` and `head` elements will not be a problem.
`parse_url($this->url, \PHP_URL_HOST)` will return `null` for local filesystem path.
Casting it to `string` will produce an empty regular expression,
which would match any link when computing link density.
`DOMNodeList` implements `Traversable`.
There are some `for` loops left but we cannot simply replace those:
PHP follows the DOM specification, which requires that `NodeList`
objects in the DOM are live. As a result, any operation that removes
a node list member node from its parent (such as `removeChild`,
`replaceChild` or `appendChild`) will cause the next node
in the iterator to be skipped.
We could work around that by converting those node lists to static arrays
using `iterator_to_array` but not sure if it is worth it.
It would fail for e.g. `<div> <p>foo</p> </div>`.
mozilla/readability uses children for the tag lookup, which return only elements.
PHP does not have children property so b580cf216d
mistakenly used `childNodes` instead, but that can return any node type.
Let’s filter the children ourselves.
Also add comments from mozilla/readability’s `_hasSingleTagInsideElement`.
Once we bump minimum PHP version, we will get newer PHP-CS-Fixer,
which will try to apply this cleanups.
Also manually tweak anonymous functions so that they are cleanly formatted
once we switch to `fn` syntax.