php-readability

Commit Graph

Author	SHA1	Message	Date
Jan Tojnar	efbbc86df9	Fix discarding `html[lang]` `DOMDocument::loadHTML` will parse HTML documents as ISO-8859-1 if there is no `meta[charset]` tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD `articleBody` field would be parsed with incorrect encoding. In `f14428e4c0`, we tried to resolve it by putting `meta[charset]` tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a `html` element, losing the attributes of the original `html` tag. Let’s try to insert the `meta[charset]` tag into the proper place in the HTML document. We do not need to use the same trick with `JSLikeHTMLElement::__set`. That expects smaller HTML fragments, not `html` documents, so creating `html` and `head` elements will not be a problem.	1 year ago
Jan Tojnar	541fab34a0	tests: Remove pointless debug assignment It is unused since `8ab7d76cd5`.	1 year ago
Jan Tojnar	90869d877e	tests: Use ::class for DOMDocument class name Also capitalize it properly.	1 year ago
Jérémy Benoist	4258559b8a	Merge pull request #100 from jtojnar/phpunit-bridge7 composer: Allow phpunit-bridge 7.0	1 year ago
Jan Tojnar	1ac761d708	composer: Allow phpunit-bridge 7.0	1 year ago
Jérémy Benoist	d3053fbce4	Merge pull request #99 from jtojnar/phpstan2 phpstan: Upgrade to version 2	1 year ago
Jan Tojnar	4c929754e9	phpstan: Upgrade to version 2 https://github.com/phpstan/phpstan/blob/2.1.x/UPGRADING.md Required also bumping Rector since it uses PHPStan internally.	1 year ago
Jan Tojnar	1d7cdf3a12	phpstan: Use standard config path This allows developer to create their own own config file, e.g. for setting `editorUrl`: https://phpstan.org/user-guide/output-format#opening-file-in-an-editor	1 year ago
Jérémy Benoist	f825dcf55a	Merge pull request #90 from jtojnar/foreaches Iterate node lists with foreach	1 year ago
Jan Tojnar	9a9373de4b	Iterate node lists with `foreach` `DOMNodeList` implements `Traversable`. There are some `for` loops left but we cannot simply replace those: PHP follows the DOM specification, which requires that `NodeList` objects in the DOM are live. As a result, any operation that removes a node list member node from its parent (such as `removeChild`, `replaceChild` or `appendChild`) will cause the next node in the iterator to be skipped. We could work around that by converting those node lists to static arrays using `iterator_to_array` but not sure if it is worth it.	1 year ago
Jan Tojnar	d454c3a462	Remove dead iteration code This was forgotten in `b580cf216d`.	1 year ago
Jan Tojnar	8b1ef07401	Extract `for`-iterated items into variables This simplifies the code a bit and will make it slightly easier in case we decide to switch to `foreach` iteration.	1 year ago
Jan Tojnar	5885dbbe78	Remove pointless `stdClass` `DOMNode::$childNodes` always contained `DOMNodeList`.	1 year ago
Jérémy Benoist	6947999782	Merge pull request #92 from jtojnar/ci-fix ci: Fix & add PHP 8.4	1 year ago
Jan Tojnar	da755013aa	Remove extra set_error_handler callback argument It is unused and would cause an error on PHP ≥ 8.0: https://www.php.net/manual/en/function.set-error-handler.php#refsect1-function.set-error-handler-parameters Not sure if the handler is even necessary – it was introduced in `175196d6c2` but I did not manage to reproduce the original error (Entity 'nbsp' not defined). It was probably fixed by `f2a43b476c`.	1 year ago
Jan Tojnar	5b9551d1e3	ci: Add PHP 8.4 PHP 8.4 is in beta, with final version scheduled for November so it is time to start testing it.	1 year ago
Jan Tojnar	c7b10dcc45	Avoid E_STRICT constant It will be deprecated in PHP 8.4 and it is meaningless nowadays anyway: https://wiki.php.net/rfc/deprecations_php_8_4#remove_e_strict_error_level_and_deprecate_e_strict_constant The use of the constant was introduced in `175196d6c2`.	1 year ago
Jan Tojnar	80adfe870b	Fix coding style With php-cs-fixer 3.64.0, the `native_function_invocation` rule no longer passed.	1 year ago
Jérémy Benoist	cb6b6ac577	Merge pull request #88 from jtojnar/has-single-fix	2 years ago
Jan Tojnar	677f3f096e	Fix hasSingleTagInsideElement method It would fail for e.g. `<div> <p>foo</p> </div>`. mozilla/readability uses children for the tag lookup, which return only elements. PHP does not have children property so `b580cf216d` mistakenly used `childNodes` instead, but that can return any node type. Let’s filter the children ourselves. Also add comments from mozilla/readability’s `_hasSingleTagInsideElement`.	2 years ago
Jérémy Benoist	29122763db	Merge pull request #89 from jtojnar/php74 Require PHP 7.4	2 years ago
Jan Tojnar	89d3b74259	Rectorize to PHP 7.4 Switches to short anonymous function syntax.	2 years ago
Jan Tojnar	e792644fe8	Drop PHP < 7.4 support This will allow us to use flexible heredocs in test, as well as typed properties and other goodies. https://www.php.net/releases/7_3_0.php https://www.php.net/releases/7_4_0.php	2 years ago
Jan Tojnar	648d8c605b	Update coding style for upcoming PHP-CS-Fixer changes Once we bump minimum PHP version, we will get newer PHP-CS-Fixer, which will try to apply this cleanups. Also manually tweak anonymous functions so that they are cleanly formatted once we switch to `fn` syntax.	2 years ago
Jérémy Benoist	f28191a728	Merge pull request #86 from jtojnar/ci-bump ci: Update actions	2 years ago
Jan Tojnar	2103853a1b	ci: Bump coveralls to 2.7.0 - Fixes PHP 8 support https://github.com/php-coveralls/php-coveralls/releases/tag/v2.4.3	2 years ago
Jan Tojnar	7f4c6cfcbd	ci: Update actions Mostly just of nodejs bump: - https://github.com/actions/checkout/releases/tag/v4.0.0 - https://github.com/ramsey/composer-install/releases/tag/3.0.0	2 years ago
Jérémy Benoist	38870cdff1	Merge pull request #80 from jtojnar/stricter Fix some CI issues	3 years ago
Jan Tojnar	9bdd3b6b2e	ci: Add PHP 8.2 and 8.3	3 years ago
Jan Tojnar	f14428e4c0	Do not use `mb_convert_encoding` with `HTML-ENTITIES` as target encoding This is deprecated since PHP 8.2: Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead It was used because `DOMDocument`, which uses libxml2 internally, will parse the HTML as ISO-8859-1, unless the document contains an XML encoding declaration or HTML meta tag setting character set. Since first such element wins, putting the `meta[charset]` up front will ensure the parser uses the correct encoding, even if the document contains incorrect meta tag (e.g. when the document is converted to UTF-8 without also updating the metadata by the software passing it to Readability). https://stackoverflow.com/a/39148511/160386	3 years ago
Jan Tojnar	23f824a1ce	tests: Fix “THE ERROR HANDLER HAS CHANGED!”	3 years ago
Jan Tojnar	2a57124528	composer: upgrade rector	3 years ago
Jan Tojnar	0975574bdb	Rector: Upgrade configuration	3 years ago
Jan Tojnar	9ed89bde92	Fix PHP-Cs-Fixer changes 1) src/Readability.php (braces, no_unneeded_control_parentheses, single_line_comment_spacing, global_namespace_import, no_unused_imports, phpdoc_align) 2) src/JSLikeHTMLElement.php (phpdoc_separation) Switch code blocks to Markdown syntax to work around `phpdoc_separation`, ApiGen uses Markdown these days anyway.	3 years ago
Jan Tojnar	2c6c6d5987	PHPStan: Use stable PHPUnit path phpunit-bridge will create a symlink.	3 years ago
Jan Tojnar	c5407ec07c	composer: Add scripts for development	3 years ago
Jérémy Benoist	7cd8476d38	Merge pull request #79 from j0k3r/fix/psr-log-2-3 Allow `psr/log` 2.0 & 3.0	4 years ago
Jeremy Benoist	82083c872b	Allow `psr/log` 2.0 & 3.0	4 years ago
Kevin Decherf	6689f19956	Strip script and style tags through ::clean() method instead of preg_replace Huge tags can lead to a failure of preg_replace, thus erasing the whole fetched content. Fixes https://github.com/wallabag/wallabag/issues/5847 Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Jérémy Benoist	0c0653dad6	Merge pull request #73 from Kdecherf/fix/impr Fix `isPhrasingContent` conditions, text node replacement	4 years ago
Kevin Decherf	2ab87d7445	Fix isPhrasingContent conditions, text node replacement It also disables reverting forced paragraph elements as it can break layouts or corrupt content. Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Jérémy Benoist	8af69ad68c	Merge pull request #71 from j0k3r/feature/enable-rector Add Rector	4 years ago
Jeremy Benoist	c2a1639b34	Add Rector	4 years ago
Jérémy Benoist	ccf1b336c5	Merge pull request #64 from Kdecherf/improvements	4 years ago
Kevin Decherf	a44c4e5482	Add routine to remove invisible nodes Readability was previously removing (was trying to actually, see next section) invisible nodes using a pattern from `unlikelyCandidates`. This was quite hacky and was removed during a backport of logics from mozilla/readability. There is still a need to remove them so here we are. We still use a pattern but specifically against the style attribute. We also remove nodes with the attribute `hidden`. The clean feature of tidy actually replaces inline style attributes with css classes thus preventing readability to detect invisible nodes, see https://github.com/htacg/tidy-html5/blob/5.6.0/src/clean.c#L1488 We therefore set clean configuration to false. Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Kevin Decherf	b580cf216d	Backport some logics from mozilla/readability This change backports several things from mozilla/readability: - Add child score to all ancestors instead of the first parent only - Check 5 top candidates and try to find alternative candidates within ancestors, this can help to find a better parent and grab more content - Reduce patterns from `unlikelyCandidates` to the one used by Mozilla as ours tend to remove useful nodes - Score headers (h2 to h6) by default in addition to div, p, td and section Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Jérémy Benoist	2e9349f076	Merge pull request #69 from j0k3r/feature/php-7.2 Require PHP >= 7.2	4 years ago
Jeremy Benoist	c4bba53dbe	Remove Scrutinizer	4 years ago
Jeremy Benoist	66215a6c80	Require PHP >= 7.2 - remove test on Composer v1 - remove deprecated function - move `loadHtml()` into `init()` instead of `__construct` Kinda prepare 2.0 version :)	4 years ago
Jérémy Benoist	b1a20a9575	Merge pull request #68 from open-source-contributions/master Using assertSame to make assertion equal strict	4 years ago

1 2 3 4

183 Commits (efbbc86df9716a3ab1ed8a351d9e8316f3a2aab0) All Branches Search

183 Commits (efbbc86df9716a3ab1ed8a351d9e8316f3a2aab0)

All Branches