Because of PHP 8.2 deprecation, in f14428e4c0, we stopped converting non-ASCII characters to HTML entities. Instead, we started to explicitly insert `meta[charset]` tag at the start of the document.
Later, we discovered that was breaking `html[lang]` so, in efbbc86df9, we made the insertion smarter. One of the improvements was that it would not insert the `meta[charset]` tag when it was already present.
That, however, broke websites that had `title` tag before `meta[charset]`. On those, libxml2 would decode the `title` contents as ISO-8859-1.
We could improve the logic (e.g. check that there is not text content before `meta[charset]`) or insert the tag unconditionally but it will probably be simplest to just go back to converting the non-ASCII characters to entities, just using non-deprecated function variant.
`DOMDocument::loadHTML` will parse HTML documents as ISO-8859-1 if there is no `meta[charset]` tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD `articleBody` field would be parsed with incorrect encoding.
In f14428e4c0, we tried to resolve it by putting `meta[charset]` tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a `html` element, losing the attributes of the original `html` tag.
Let’s try to insert the `meta[charset]` tag into the proper place in the HTML document.
We do not need to use the same trick with `JSLikeHTMLElement::__set`.
That expects smaller HTML fragments, not `html` documents, so creating `html` and `head` elements will not be a problem.
(cherry picked from commit efbbc86df9)
Had to strip type hints since we still target PHP 5.6.
`parse_url($this->url, \PHP_URL_HOST)` will return `null` for local filesystem path.
Casting it to `string` will produce an empty regular expression,
which would match any link when computing link density.
(cherry picked from commit c7208f6ad2)
This also fixes a warning since 1.x passes the `null` directly to `preg_replace` instead of explicitly casting it to `string`.
Once we bump minimum PHP version, we will get newer PHP-CS-Fixer,
which will try to apply this cleanups.
(partially cherry picked from commit 648d8c605b)
Though avoid disabling `modernize_strpos` since it was only introduced in PHP-CS-Fixer 3.2.0:
2ca22a27c4
Also had to disable `visibility_required` for constants since those require PHP ≥ 7.1:
https://cs.symfony.com/doc/rules/class_notation/visibility_required.html
And remove type hint from `grabArticle` since implicitly nullable types were deprecated in PHP 8.4:
https://wiki.php.net/rfc/deprecate-implicitly-nullable-types
But we cannot use explicitly nullable types, which require PHP ≥ 7.1:
https://wiki.php.net/rfc/nullable_types
Also switch code blocks to Markdown syntax to work around `phpdoc_separation`, ApiGen uses Markdown these days anyway.
(partially cherry picked from commit 9ed89bde92)
This is deprecated since PHP 8.2:
Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead
It was used because `DOMDocument`, which uses libxml2 internally, will parse the HTML as ISO-8859-1, unless the document contains an XML encoding declaration or HTML meta tag setting character set.
Since first such element wins, putting the `meta[charset]` up front will ensure the parser uses the correct encoding, even if the document contains incorrect meta tag (e.g. when the document is converted to UTF-8 without also updating the metadata by the software passing it to Readability).
https://stackoverflow.com/a/39148511/160386
(cherry picked from commit f14428e4c0)
Huge tags can lead to a failure of preg_replace, thus erasing the whole
fetched content.
Fixes https://github.com/wallabag/wallabag/issues/5847
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
> Method "Psr\Log\LoggerAwareInterface::setLogger()" might add "void" as a native return type declaration in the future. Do the same in implementation "Readability\Readability" now to avoid errors or add an explicit @return annotation to suppress this message.
HTML 4.01 Strict only allows block-level elements within noscript, form and
blockquote. The `enclose-block-text` option fixes the instances when those
elements contain inline elements or text by wrapping the children in paragraphs.
HTML 5 has looser content model and allows noscript elements basically anywhere,
including paragraphs, making the noscript elements inherit the parent element’s
content model. This means that tidy will produce invalid HTML nesting paragraphs
for `p > noscript > text`, a structure that would be invalid on two counts
in HTML 4 Strict profile but is completely valid in HTML 5.
Popular WordPress image lazy-loading code produces precisely that structure
so tidy “corrects” it to invalid code. In a proper HTML parser, the produced
code would force close the outer paragraph, making the noscript element
its sibling instead of a child. The only reason this does not break Graby’s code
for stripping the lazy-loading HTML is that libxml2 contains a bug
counteracting this:
https://gitlab.gnome.org/GNOME/libxml2/-/issues/205
Since all three elements allow flow content in HTML 5, it does not make much
sense to enable this option any more. The only possible issues that could occur
is producing HTML code not conforming to 4.01 Strict but that was never guaranteed,
as our example shows, and having blockquotes contain text nodes not wrapped
in paragraphs, which might be expected by some ancient stylesheets
but that is only minor and easily fixable visual backwards incompatibility.
`electrolinux/php-html5lib` was quite old and incompatible with the upcoming Composer 2.0.
Jumping to `masterminds/html5` for the same result. Also the lib is maintained.
Also:
- keep README in vendors
- use new Scrutinizer engine
- test with lower deps
- remove php-coveralls dev deps and download the phar during the CI build
A change released in tidy 5.6.0 breaks php-tidy when using
tidy_parse_string+tidy_clean_repair and wrap=0, incorrectly wrapping
every single word. Also it seems that $tidy->value should not be used to
retrieve the repaired html as far as it is undocumented and for internal
use.
We replace the call with tidy_repair_string which directly returns the
repaired string.
Relates to https://github.com/htacg/tidy-html5/issues/673
Relates to https://bugs.php.net/bug.php?id=75947
Tests pass.
Signed-off-by: Kevin Decherf <kevin@kdecherf.com>
This isn't the best solution but the previous one using `@` wasn't really better.
Appending a string into a fragment might generate some warning if the string contains bad entity.
For example `+`.
Some contents have a `infocontent` node (ot sth different) and they are real content.
Using only `info` as regex is too agressive and remove legitimate content.
Matching the whole word `info` (or `infos`) should be a better choice
We append a new node when it isn't a `div` or `p` (like when it's an `article`) with the same id which generate a DOM error "DOMElement::setAttribute(): ID blabla already defined".
Most the time, they can be usefull.
At least, it'll be a link to something unrelated. But we won't lose a link inside the content.
Also, adding some extra space.