Because of PHP 8.2 deprecation, in f14428e4c0, we stopped converting non-ASCII characters to HTML entities. Instead, we started to explicitly insert `meta[charset]` tag at the start of the document.
Later, we discovered that was breaking `html[lang]` so, in efbbc86df9, we made the insertion smarter. One of the improvements was that it would not insert the `meta[charset]` tag when it was already present.
That, however, broke websites that had `title` tag before `meta[charset]`. On those, libxml2 would decode the `title` contents as ISO-8859-1.
We could improve the logic (e.g. check that there is not text content before `meta[charset]`) or insert the tag unconditionally but it will probably be simplest to just go back to converting the non-ASCII characters to entities, just using non-deprecated function variant.
`DOMDocument::loadHTML` will parse HTML documents as ISO-8859-1 if there is no `meta[charset]` tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD `articleBody` field would be parsed with incorrect encoding.
In f14428e4c0, we tried to resolve it by putting `meta[charset]` tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a `html` element, losing the attributes of the original `html` tag.
Let’s try to insert the `meta[charset]` tag into the proper place in the HTML document.
We do not need to use the same trick with `JSLikeHTMLElement::__set`.
That expects smaller HTML fragments, not `html` documents, so creating `html` and `head` elements will not be a problem.
(cherry picked from commit efbbc86df9)
Had to strip type hints since we still target PHP 5.6.
We append a new node when it isn't a `div` or `p` (like when it's an `article`) with the same id which generate a DOM error "DOMElement::setAttribute(): ID blabla already defined".
Most the time, they can be usefull.
At least, it'll be a link to something unrelated. But we won't lose a link inside the content.
Also, adding some extra space.