php-readability

Commit Graph

Author	SHA1	Message	Date
Jeremy Benoist	0b21d4ab2d	Clean	6 months ago
Jeremy Benoist	720f0d5503	Allow new psr/log & monolog	6 months ago
Jan Tojnar	40219d4595	Fix character decoding regression when `title` precedes `meta[charset]` Because of PHP 8.2 deprecation, in `f14428e4c0`, we stopped converting non-ASCII characters to HTML entities. Instead, we started to explicitly insert `meta[charset]` tag at the start of the document. Later, we discovered that was breaking `html[lang]` so, in `efbbc86df9`, we made the insertion smarter. One of the improvements was that it would not insert the `meta[charset]` tag when it was already present. That, however, broke websites that had `title` tag before `meta[charset]`. On those, libxml2 would decode the `title` contents as ISO-8859-1. We could improve the logic (e.g. check that there is not text content before `meta[charset]`) or insert the tag unconditionally but it will probably be simplest to just go back to converting the non-ASCII characters to entities, just using non-deprecated function variant.	10 months ago
Jan Tojnar	f1c6297e3c	Fix discarding `html[lang]` `DOMDocument::loadHTML` will parse HTML documents as ISO-8859-1 if there is no `meta[charset]` tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD `articleBody` field would be parsed with incorrect encoding. In `f14428e4c0`, we tried to resolve it by putting `meta[charset]` tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a `html` element, losing the attributes of the original `html` tag. Let’s try to insert the `meta[charset]` tag into the proper place in the HTML document. We do not need to use the same trick with `JSLikeHTMLElement::__set`. That expects smaller HTML fragments, not `html` documents, so creating `html` and `head` elements will not be a problem. (cherry picked from commit `efbbc86df9`) Had to strip type hints since we still target PHP 5.6.	1 year ago
Jan Tojnar	235baf965c	Do not set domainRegExp for local files `parse_url($this->url, \PHP_URL_HOST)` will return `null` for local filesystem path. Casting it to `string` will produce an empty regular expression, which would match any link when computing link density. (cherry picked from commit `c7208f6ad2`) This also fixes a warning since 1.x passes the `null` directly to `preg_replace` instead of explicitly casting it to `string`.	1 year ago
Jan Tojnar	8421ed5962	Update coding style for upcoming PHP-CS-Fixer changes Once we bump minimum PHP version, we will get newer PHP-CS-Fixer, which will try to apply this cleanups. (partially cherry picked from commit `648d8c605b`) Though avoid disabling `modernize_strpos` since it was only introduced in PHP-CS-Fixer 3.2.0: `2ca22a27c4` Also had to disable `visibility_required` for constants since those require PHP ≥ 7.1: https://cs.symfony.com/doc/rules/class_notation/visibility_required.html And remove type hint from `grabArticle` since implicitly nullable types were deprecated in PHP 8.4: https://wiki.php.net/rfc/deprecate-implicitly-nullable-types But we cannot use explicitly nullable types, which require PHP ≥ 7.1: https://wiki.php.net/rfc/nullable_types Also switch code blocks to Markdown syntax to work around `phpdoc_separation`, ApiGen uses Markdown these days anyway. (partially cherry picked from commit `9ed89bde92`)	1 year ago
Jan Tojnar	6f4404030b	Do not use `mb_convert_encoding` with `HTML-ENTITIES` as target encoding This is deprecated since PHP 8.2: Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead It was used because `DOMDocument`, which uses libxml2 internally, will parse the HTML as ISO-8859-1, unless the document contains an XML encoding declaration or HTML meta tag setting character set. Since first such element wins, putting the `meta[charset]` up front will ensure the parser uses the correct encoding, even if the document contains incorrect meta tag (e.g. when the document is converted to UTF-8 without also updating the metadata by the software passing it to Readability). https://stackoverflow.com/a/39148511/160386 (cherry picked from commit `f14428e4c0`)	1 year ago
Kevin Decherf	651e8a6bb0	Strip script and style tags through ::clean() method instead of preg_replace Huge tags can lead to a failure of preg_replace, thus erasing the whole fetched content. Fixes https://github.com/wallabag/wallabag/issues/5847 Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Jérémy Benoist	fabf096ce6	Fix deprecated message > Method "Psr\Log\LoggerAwareInterface::setLogger()" might add "void" as a native return type declaration in the future. Do the same in implementation "Readability\Readability" now to avoid errors or add an explicit @return annotation to suppress this message.	4 years ago
Kevin Decherf	eb72a315c4	Clean empty figure tags without ending See 'Tag omission' https://developer.mozilla.org/en-US/docs/Web/HTML/Element/figure Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Jeremy Benoist	ea1368fac0	Body can be wiped without tidy Re-create it in that case. Also run CS-Fixer.	5 years ago
Jan Tojnar	7cea79c23a	readability: stop tidy from wrapping noscript text HTML 4.01 Strict only allows block-level elements within noscript, form and blockquote. The `enclose-block-text` option fixes the instances when those elements contain inline elements or text by wrapping the children in paragraphs. HTML 5 has looser content model and allows noscript elements basically anywhere, including paragraphs, making the noscript elements inherit the parent element’s content model. This means that tidy will produce invalid HTML nesting paragraphs for `p > noscript > text`, a structure that would be invalid on two counts in HTML 4 Strict profile but is completely valid in HTML 5. Popular WordPress image lazy-loading code produces precisely that structure so tidy “corrects” it to invalid code. In a proper HTML parser, the produced code would force close the outer paragraph, making the noscript element its sibling instead of a child. The only reason this does not break Graby’s code for stripping the lazy-loading HTML is that libxml2 contains a bug counteracting this: https://gitlab.gnome.org/GNOME/libxml2/-/issues/205 Since all three elements allow flow content in HTML 5, it does not make much sense to enable this option any more. The only possible issues that could occur is producing HTML code not conforming to 4.01 Strict but that was never guaranteed, as our example shows, and having blockquotes contain text nodes not wrapped in paragraphs, which might be expected by some ancient stylesheets but that is only minor and easily fixable visual backwards incompatibility.	5 years ago
Jeremy Benoist	6a8ecf232f	Use a new deps for HTML5 parser `electrolinux/php-html5lib` was quite old and incompatible with the upcoming Composer 2.0. Jumping to `masterminds/html5` for the same result. Also the lib is maintained. Also: - keep README in vendors - use new Scrutinizer engine - test with lower deps - remove php-coveralls dev deps and download the phar during the CI build	6 years ago
Jeremy Benoist	b1acc9ed73	Fix PHPStan (again) Also cleanup	6 years ago
Jeremy Benoist	11d2946904	Add openload.co to media detection	7 years ago
nicofrand	ff78c63e6d	Skip empty (empty innerHTML) nodes when grabbing article	7 years ago
Jeremy Benoist	bb65caf864	Fix “A non well formed numeric value encountered”	7 years ago
Simounet	2e20f76195	\bout removed from negative content	7 years ago
Jeremy Benoist	74d9cc605a	Enable PHPStan	7 years ago
Jeremy Benoist	2dce2879bf	Update fixer rules Following graby, wallabag, etc.	7 years ago
Kevin Decherf	26c881d864	tidy: use tidy_repair_string instead of tidy_parse_string+tidy_clean_repair A change released in tidy 5.6.0 breaks php-tidy when using tidy_parse_string+tidy_clean_repair and wrap=0, incorrectly wrapping every single word. Also it seems that $tidy->value should not be used to retrieve the repaired html as far as it is undocumented and for internal use. We replace the call with tidy_repair_string which directly returns the repaired string. Relates to https://github.com/htacg/tidy-html5/issues/673 Relates to https://bugs.php.net/bug.php?id=75947 Tests pass. Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	7 years ago
Simounet	422c74f29c	Giphy added to allowed medias	7 years ago
Simounet	63cd304dba	Media class added to positive candidates Fix Mediapart images.	8 years ago
Kevin Decherf	4c68cc9f09	Keep elements with 'footnote' as possible candidates Should fix https://github.com/wallabag/wallabag/issues/3100 Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	9 years ago
Jeremy Benoist	613a63c062	CS	9 years ago
Jeremy Benoist	05089bbd03	Add missing HTML5 class	9 years ago
Jeremy Benoist	f2a43b476c	Avoid PHP Warning This isn't the best solution but the previous one using `@` wasn't really better. Appending a string into a fragment might generate some warning if the string contains bad entity. For example `+`.	9 years ago
Jeremy Benoist	8b1c3f147d	Don't be to hard on 'links' attribute	9 years ago
Jeremy Benoist	ff754b80bd	Avoid childnode becoming null to generate a warning	9 years ago
Jeremy Benoist	d97bece7c5	Don’t be too agressive Some links got a “tooltip-link” and shouldn’t be removed by php-readability because they are usefull to the content	10 years ago
Jeremy Benoist	3de4e918b4	Convert header & section to p And took `pre` element in score	10 years ago
Jeremy Benoist	5182d6cb11	“info” is too agressive in unlikelyCandidates Some contents have a `infocontent` node (ot sth different) and they are real content. Using only `info` as regex is too agressive and remove legitimate content. Matching the whole word `info` (or `infos`) should be a better choice	10 years ago
Jeremy Benoist	2ef400bf73	Enable php-cs-fixer	10 years ago
Jeremy Benoist	00f622e9b7	Revert BC changes - avoid method signature update - revert moving logic out of the constructor	10 years ago
Jeremy Benoist	c756ec067e	Fix tests `getInnerText` might receive a null DOMElement if the xpath or query return no element.	10 years ago
Jeremy Benoist	8ab7d76cd5	Use Monolog instead of custom solution Remove that ugly `openlog` & `syslog`	10 years ago
Jeremy Benoist	149a333b40	Remove addPreFilter Pre filters are used in the __construct so adding more pre filters once the object is instantiated is useless.	10 years ago
Jeremy Benoist	209c404d7b	Fix instanceof DOMElement We previously checked `instanceof DOMElement` which was wrong since we are in the namespace class, the class `Readability\DOMElement` does not exists.	10 years ago
Jeremy Benoist	2951936e00	CS & PHPDoc	10 years ago
Jeremy Benoist	850ade16b6	Cleanup	10 years ago
Jeremy Benoist	dc590542f0	Avoid adding id that might already exists We append a new node when it isn't a `div` or `p` (like when it's an `article`) with the same id which generate a DOM error "DOMElement::setAttribute(): ID blabla already defined".	10 years ago
Jeremy Benoist	111cb08034	Improve negative element - add unlikelyCandidates: head - add negative: recommend	11 years ago
Jeremy Benoist	f71c3a4196	Do not remove html tag attributes They might contains useful information (at least language)	11 years ago
Jeremy Benoist	b77876b30a	Do not remove nofollow links Most the time, they can be usefull. At least, it'll be a link to something unrelated. But we won't lose a link inside the content. Also, adding some extra space.	11 years ago
Jeremy Benoist	175196d6c2	Avoid error with   Fix #5	11 years ago
Jeremy Benoist	2b5af601d5	Do not format output to avoid breaking apps It'll require to jump to 2.0.0 and I think it's to soon	11 years ago
Jeremy Benoist	d01eb2ac1e	Use class instead of id to avoid error It generates error like `ID XXX already defined`	11 years ago
Jeremy Benoist	c5a4a490e1	CS	11 years ago
Jeremy Benoist	c67189248e	Backport changes from wallabag `e9e4ff87f8`	11 years ago
Jeremy Benoist	91b80b70e2	Update HTML5 tags From https://github.com/htacg/tidy-html5/blob/master/src/tags.c#L296	11 years ago

1 2

55 Commits (774363e18df0c00f1b640f60f46536c867ef143d)