Do not use `mb_convert_encoding` with `HTML-ENTITIES` as target encoding

This is deprecated since PHP 8.2:

    Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead

It was used because `DOMDocument`, which uses libxml2 internally, will parse the HTML as ISO-8859-1, unless the document contains an XML encoding declaration or HTML meta tag setting character set.
Since first such element wins, putting the `meta[charset]` up front will ensure the parser uses the correct encoding, even if the document contains incorrect meta tag (e.g. when the document is converted to UTF-8 without also updating the metadata by the software passing it to Readability).

https://stackoverflow.com/a/39148511/160386
pull/80/head
Jan Tojnar 3 years ago
parent 23f824a1ce
commit f14428e4c0
  1. 3
      src/JSLikeHTMLElement.php
  2. 2
      src/Readability.php

@ -79,14 +79,13 @@ class JSLikeHTMLElement extends \DOMElement
} else { } else {
// $value is probably ill-formed // $value is probably ill-formed
$f = new \DOMDocument(); $f = new \DOMDocument();
$value = mb_convert_encoding($value, 'HTML-ENTITIES', 'UTF-8');
// Using <htmlfragment> will generate a warning, but so will bad HTML // Using <htmlfragment> will generate a warning, but so will bad HTML
// (and by this point, bad HTML is what we've got). // (and by this point, bad HTML is what we've got).
// We use it (and suppress the warning) because an HTML fragment will // We use it (and suppress the warning) because an HTML fragment will
// be wrapped around <html><body> tags which we don't really want to keep. // be wrapped around <html><body> tags which we don't really want to keep.
// Note: despite the warning, if loadHTML succeeds it will return true. // Note: despite the warning, if loadHTML succeeds it will return true.
$result = $f->loadHTML('<htmlfragment>' . $value . '</htmlfragment>'); $result = $f->loadHTML('<meta charset="utf-8"><htmlfragment>' . $value . '</htmlfragment>');
if ($result) { if ($result) {
$import = $f->getElementsByTagName('htmlfragment')->item(0); $import = $f->getElementsByTagName('htmlfragment')->item(0);

@ -1426,7 +1426,7 @@ class Readability implements LoggerAwareInterface
unset($tidy); unset($tidy);
} }
$this->html = mb_convert_encoding((string) $this->html, 'HTML-ENTITIES', 'UTF-8'); $this->html = '<meta charset="utf-8">' . (string) $this->html;
if ('html5lib' === $this->parser || 'html5' === $this->parser) { if ('html5lib' === $this->parser || 'html5' === $this->parser) {
$this->dom = (new HTML5())->loadHTML($this->html); $this->dom = (new HTML5())->loadHTML($this->html);

Loading…
Cancel
Save