Because of PHP 8.2 deprecation, in f14428e4c0, we stopped converting non-ASCII characters to HTML entities. Instead, we started to explicitly insert `meta[charset]` tag at the start of the document.
Later, we discovered that was breaking `html[lang]` so, in efbbc86df9, we made the insertion smarter. One of the improvements was that it would not insert the `meta[charset]` tag when it was already present.
That, however, broke websites that had `title` tag before `meta[charset]`. On those, libxml2 would decode the `title` contents as ISO-8859-1.
We could improve the logic (e.g. check that there is not text content before `meta[charset]`) or insert the tag unconditionally but it will probably be simplest to just go back to converting the non-ASCII characters to entities, just using non-deprecated function variant.
`DOMDocument::loadHTML` will parse HTML documents as ISO-8859-1 if there is no `meta[charset]` tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD `articleBody` field would be parsed with incorrect encoding.
In f14428e4c0, we tried to resolve it by putting `meta[charset]` tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a `html` element, losing the attributes of the original `html` tag.
Let’s try to insert the `meta[charset]` tag into the proper place in the HTML document.
We do not need to use the same trick with `JSLikeHTMLElement::__set`.
That expects smaller HTML fragments, not `html` documents, so creating `html` and `head` elements will not be a problem.
(cherry picked from commit efbbc86df9)
Had to strip type hints since we still target PHP 5.6.
`parse_url($this->url, \PHP_URL_HOST)` will return `null` for local filesystem path.
Casting it to `string` will produce an empty regular expression,
which would match any link when computing link density.
(cherry picked from commit c7208f6ad2)
This also fixes a warning since 1.x passes the `null` directly to `preg_replace` instead of explicitly casting it to `string`.
Once we bump minimum PHP version, we will get newer PHP-CS-Fixer,
which will try to apply this cleanups.
(partially cherry picked from commit 648d8c605b)
Though avoid disabling `modernize_strpos` since it was only introduced in PHP-CS-Fixer 3.2.0:
2ca22a27c4
Also had to disable `visibility_required` for constants since those require PHP ≥ 7.1:
https://cs.symfony.com/doc/rules/class_notation/visibility_required.html
And remove type hint from `grabArticle` since implicitly nullable types were deprecated in PHP 8.4:
https://wiki.php.net/rfc/deprecate-implicitly-nullable-types
But we cannot use explicitly nullable types, which require PHP ≥ 7.1:
https://wiki.php.net/rfc/nullable_types
Also switch code blocks to Markdown syntax to work around `phpdoc_separation`, ApiGen uses Markdown these days anyway.
(partially cherry picked from commit 9ed89bde92)
This is deprecated since PHP 8.2:
Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead
It was used because `DOMDocument`, which uses libxml2 internally, will parse the HTML as ISO-8859-1, unless the document contains an XML encoding declaration or HTML meta tag setting character set.
Since first such element wins, putting the `meta[charset]` up front will ensure the parser uses the correct encoding, even if the document contains incorrect meta tag (e.g. when the document is converted to UTF-8 without also updating the metadata by the software passing it to Readability).
https://stackoverflow.com/a/39148511/160386
(cherry picked from commit f14428e4c0)
Huge tags can lead to a failure of preg_replace, thus erasing the whole
fetched content.
Fixes https://github.com/wallabag/wallabag/issues/5847
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
> Method "Psr\Log\LoggerAwareInterface::setLogger()" might add "void" as a native return type declaration in the future. Do the same in implementation "Readability\Readability" now to avoid errors or add an explicit @return annotation to suppress this message.