`DOMDocument::loadHTML` will parse HTML documents as ISO-8859-1 if there is no `meta[charset]` tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD `articleBody` field would be parsed with incorrect encoding. Inpull/103/headf14428e4c0, we tried to resolve it by putting `meta[charset]` tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a `html` element, losing the attributes of the original `html` tag. Let’s try to insert the `meta[charset]` tag into the proper place in the HTML document. We do not need to use the same trick with `JSLikeHTMLElement::__set`. That expects smaller HTML fragments, not `html` documents, so creating `html` and `head` elements will not be a problem. (cherry picked from commitefbbc86df9) Had to strip type hints since we still target PHP 5.6.
parent
5afefcff34
commit
f1c6297e3c
2 changed files with 84 additions and 1 deletions
Loading…
Reference in new issue