Fix hasSingleTagInsideElement method

It would fail for e.g. `<div> <p>foo</p> </div>`.

mozilla/readability uses children for the tag lookup, which return only elements.
PHP does not have children property so b580cf216d
mistakenly used `childNodes` instead, but that can return any node type.

Let’s filter the children ourselves.

Also add comments from mozilla/readability’s `_hasSingleTagInsideElement`.
pull/88/head
Jan Tojnar 2 years ago
parent 29122763db
commit 677f3f096e
  1. 13
      src/Readability.php

@ -1477,14 +1477,23 @@ class Readability implements LoggerAwareInterface
);
}
/**
* Checks if `$node` has only whitespace and a single element with `$tag` for the tag name.
* Returns false if `$node` contains non-empty text nodes
* or if it contains no element with given tag or more than 1 element.
*/
private function hasSingleTagInsideElement(\DOMElement $node, string $tag): bool
{
if (1 !== $node->childNodes->length || $node->childNodes->item(0)->nodeName !== $tag) {
$childNodes = iterator_to_array($node->childNodes);
$children = array_filter($childNodes, fn ($childNode) => $childNode instanceof \DOMElement);
// There should be exactly 1 element child with given tag
if (1 !== \count($children) || $children[0]->nodeName !== $tag) {
return false;
}
$a = array_filter(
iterator_to_array($node->childNodes),
$childNodes,
fn ($childNode) => $childNode instanceof \DOMText && preg_match($this->regexps['hasContent'], $this->getInnerText($childNode))
);

Loading…
Cancel
Save