php-readability

Commit Graph

Author	SHA1	Message	Date
Jan Tojnar	8b89d70b1a	Fix character decoding regression when `title` precedes `meta[charset]` Because of PHP 8.2 deprecation, in `f14428e4c0`, we stopped converting non-ASCII characters to HTML entities. Instead, we started to explicitly insert `meta[charset]` tag at the start of the document. Later, we discovered that was breaking `html[lang]` so, in `efbbc86df9`, we made the insertion smarter. One of the improvements was that it would not insert the `meta[charset]` tag when it was already present. That, however, broke websites that had `title` tag before `meta[charset]`. On those, libxml2 would decode the `title` contents as ISO-8859-1. We could improve the logic (e.g. check that there is not text content before `meta[charset]`) or insert the tag unconditionally but it will probably be simplest to just go back to converting the non-ASCII characters to entities, just using non-deprecated function variant.	10 months ago
Jan Tojnar	3e9b15db46	tests: Check encoding was preserved in `testHtmlLang` The fix introduced in `efbbc86df9` alongside this test also manipulates `meta[charset]` but we were not checking if it does not break encoding.	10 months ago
Jan Tojnar	1226daa8f8	Use JSLikeHTMLElement in type hints We use `DOMDocument::registerNodeClass()` to make DOM methods return `JSLikeHTMLElement` instead of `DOMElement`. Unfortunately, it is not possible for PHPStan to detect that so we need to cast it ourselves: https://github.com/phpstan/phpstan/discussions/10748 We may want to deprecate it in the future just to get rid of this mess. Also add PHPStan stubs for DOM classes so that we do not need to cast everything. It is fine to do that globally as we only ever use DOM with `JSLikeHTMLElement` registered. This patch also allows us to get rid of the assertions in tests.	1 year ago
Jan Tojnar	32267cb7b4	Add type annotations to properties To preserve BC, we are not using type hints for now.	1 year ago
Jan Tojnar	efbbc86df9	Fix discarding `html[lang]` `DOMDocument::loadHTML` will parse HTML documents as ISO-8859-1 if there is no `meta[charset]` tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD `articleBody` field would be parsed with incorrect encoding. In `f14428e4c0`, we tried to resolve it by putting `meta[charset]` tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a `html` element, losing the attributes of the original `html` tag. Let’s try to insert the `meta[charset]` tag into the proper place in the HTML document. We do not need to use the same trick with `JSLikeHTMLElement::__set`. That expects smaller HTML fragments, not `html` documents, so creating `html` and `head` elements will not be a problem.	1 year ago
Jan Tojnar	541fab34a0	tests: Remove pointless debug assignment It is unused since `8ab7d76cd5`.	1 year ago
Jan Tojnar	90869d877e	tests: Use ::class for DOMDocument class name Also capitalize it properly.	1 year ago
Jan Tojnar	da755013aa	Remove extra set_error_handler callback argument It is unused and would cause an error on PHP ≥ 8.0: https://www.php.net/manual/en/function.set-error-handler.php#refsect1-function.set-error-handler-parameters Not sure if the handler is even necessary – it was introduced in `175196d6c2` but I did not manage to reproduce the original error (Entity 'nbsp' not defined). It was probably fixed by `f2a43b476c`.	1 year ago
Jan Tojnar	c7b10dcc45	Avoid E_STRICT constant It will be deprecated in PHP 8.4 and it is meaningless nowadays anyway: https://wiki.php.net/rfc/deprecations_php_8_4#remove_e_strict_error_level_and_deprecate_e_strict_constant The use of the constant was introduced in `175196d6c2`.	1 year ago
Jan Tojnar	648d8c605b	Update coding style for upcoming PHP-CS-Fixer changes Once we bump minimum PHP version, we will get newer PHP-CS-Fixer, which will try to apply this cleanups. Also manually tweak anonymous functions so that they are cleanly formatted once we switch to `fn` syntax.	2 years ago
Jan Tojnar	23f824a1ce	tests: Fix “THE ERROR HANDLER HAS CHANGED!”	3 years ago
Jeremy Benoist	c2a1639b34	Add Rector	4 years ago
Kevin Decherf	a44c4e5482	Add routine to remove invisible nodes Readability was previously removing (was trying to actually, see next section) invisible nodes using a pattern from `unlikelyCandidates`. This was quite hacky and was removed during a backport of logics from mozilla/readability. There is still a need to remove them so here we are. We still use a pattern but specifically against the style attribute. We also remove nodes with the attribute `hidden`. The clean feature of tidy actually replaces inline style attributes with css classes thus preventing readability to detect invisible nodes, see https://github.com/htacg/tidy-html5/blob/5.6.0/src/clean.c#L1488 We therefore set clean configuration to false. Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Kevin Decherf	b580cf216d	Backport some logics from mozilla/readability This change backports several things from mozilla/readability: - Add child score to all ancestors instead of the first parent only - Check 5 top candidates and try to find alternative candidates within ancestors, this can help to find a better parent and grab more content - Reduce patterns from `unlikelyCandidates` to the one used by Mozilla as ours tend to remove useful nodes - Score headers (h2 to h6) by default in addition to div, p, td and section Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Jeremy Benoist	66215a6c80	Require PHP >= 7.2 - remove test on Composer v1 - remove deprecated function - move `loadHtml()` into `init()` instead of `__construct` Kinda prepare 2.0 version :)	4 years ago
peter279k	97c02e8ad4	Using assertSame to make assertion equal strict	4 years ago
Jeremy Benoist	d0af21814a	Ditch `assertContains` & `assertNotContains`	4 years ago
Jeremy Benoist	ea1368fac0	Body can be wiped without tidy Re-create it in that case. Also run CS-Fixer.	5 years ago
Jeremy Benoist	bb65caf864	Fix “A non well formed numeric value encountered”	7 years ago
Jeremy Benoist	74d9cc605a	Enable PHPStan	7 years ago
Jeremy Benoist	2dce2879bf	Update fixer rules Following graby, wallabag, etc.	7 years ago
Jeremy Benoist	9ab6d0d9e8	Updating to 7.2	7 years ago
Kevin Decherf	3a7350a8a7	tests: fix possible typo in testPostFilters() leading to failure Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	9 years ago
Kevin Decherf	4c68cc9f09	Keep elements with 'footnote' as possible candidates Should fix https://github.com/wallabag/wallabag/issues/3100 Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	9 years ago
Jeremy Benoist	05089bbd03	Add missing HTML5 class	9 years ago
Jeremy Benoist	85fb92a042	Fix tests	9 years ago
Jeremy Benoist	ff754b80bd	Avoid childnode becoming null to generate a warning	9 years ago
Jeremy Benoist	2ef400bf73	Enable php-cs-fixer	10 years ago
Jeremy Benoist	00f622e9b7	Revert BC changes - avoid method signature update - revert moving logic out of the constructor	10 years ago
Jeremy Benoist	8ab7d76cd5	Use Monolog instead of custom solution Remove that ugly `openlog` & `syslog`	10 years ago
Jeremy Benoist	149a333b40	Remove addPreFilter Pre filters are used in the __construct so adding more pre filters once the object is instantiated is useless.	10 years ago
Jeremy Benoist	209c404d7b	Fix instanceof DOMElement We previously checked `instanceof DOMElement` which was wrong since we are in the namespace class, the class `Readability\DOMElement` does not exists.	10 years ago
Jeremy Benoist	dc590542f0	Avoid adding id that might already exists We append a new node when it isn't a `div` or `p` (like when it's an `article`) with the same id which generate a DOM error "DOMElement::setAttribute(): ID blabla already defined".	10 years ago
Jeremy Benoist	7c30d76b6e	Ensure tests are running without Tidy	11 years ago
Jeremy Benoist	b77876b30a	Do not remove nofollow links Most the time, they can be usefull. At least, it'll be a link to something unrelated. But we won't lose a link inside the content. Also, adding some extra space.	11 years ago
Jeremy Benoist	175196d6c2	Avoid error with   Fix #5	11 years ago
Jeremy Benoist	908a49824f	Add test on title	11 years ago
Jeremy Benoist	1963319a55	Improve Travis & add Scrutinizer + CS + Update README	11 years ago
Jeremy	b81cf8d1c5	Adjust test & php compatible version	11 years ago
Jeremy	881e441bdf	Initial commit	11 years ago

42 Commits (0739bd5ed65d048589c0565e986296a87fa8c30c)