php-readability

Commit Graph

Author	SHA1	Message	Date
Kevin Decherf	3ff7bcccc1	Merge `41ef59212f` into `3042990efc`	9 months ago
Jérémy Benoist	3042990efc	Merge pull request #106 from jtojnar/encode Fix character decoding regression when `title` precedes `meta[charset]`	10 months ago
Jan Tojnar	8b89d70b1a	Fix character decoding regression when `title` precedes `meta[charset]` Because of PHP 8.2 deprecation, in `f14428e4c0`, we stopped converting non-ASCII characters to HTML entities. Instead, we started to explicitly insert `meta[charset]` tag at the start of the document. Later, we discovered that was breaking `html[lang]` so, in `efbbc86df9`, we made the insertion smarter. One of the improvements was that it would not insert the `meta[charset]` tag when it was already present. That, however, broke websites that had `title` tag before `meta[charset]`. On those, libxml2 would decode the `title` contents as ISO-8859-1. We could improve the logic (e.g. check that there is not text content before `meta[charset]`) or insert the tag unconditionally but it will probably be simplest to just go back to converting the non-ASCII characters to entities, just using non-deprecated function variant.	10 months ago
Jan Tojnar	3e9b15db46	tests: Check encoding was preserved in `testHtmlLang` The fix introduced in `efbbc86df9` alongside this test also manipulates `meta[charset]` but we were not checking if it does not break encoding.	10 months ago
Jérémy Benoist	7413a38ff0	Merge pull request #104 from jtojnar/html-shadowing Fix discarding `html[lang]`	1 year ago
Jérémy Benoist	a18cd0f2a9	Merge pull request #102 from jtojnar/local-no-domain Do not set domainRegExp for local files	1 year ago
Jan Tojnar	efbbc86df9	Fix discarding `html[lang]` `DOMDocument::loadHTML` will parse HTML documents as ISO-8859-1 if there is no `meta[charset]` tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD `articleBody` field would be parsed with incorrect encoding. In `f14428e4c0`, we tried to resolve it by putting `meta[charset]` tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a `html` element, losing the attributes of the original `html` tag. Let’s try to insert the `meta[charset]` tag into the proper place in the HTML document. We do not need to use the same trick with `JSLikeHTMLElement::__set`. That expects smaller HTML fragments, not `html` documents, so creating `html` and `head` elements will not be a problem.	1 year ago
Jan Tojnar	541fab34a0	tests: Remove pointless debug assignment It is unused since `8ab7d76cd5`.	1 year ago
Jan Tojnar	90869d877e	tests: Use ::class for DOMDocument class name Also capitalize it properly.	1 year ago
Jan Tojnar	c7208f6ad2	Do not set domainRegExp for local files `parse_url($this->url, \PHP_URL_HOST)` will return `null` for local filesystem path. Casting it to `string` will produce an empty regular expression, which would match any link when computing link density.	1 year ago
Jérémy Benoist	4258559b8a	Merge pull request #100 from jtojnar/phpunit-bridge7 composer: Allow phpunit-bridge 7.0	1 year ago
Jan Tojnar	1ac761d708	composer: Allow phpunit-bridge 7.0	1 year ago
Jérémy Benoist	d3053fbce4	Merge pull request #99 from jtojnar/phpstan2 phpstan: Upgrade to version 2	1 year ago
Jan Tojnar	4c929754e9	phpstan: Upgrade to version 2 https://github.com/phpstan/phpstan/blob/2.1.x/UPGRADING.md Required also bumping Rector since it uses PHPStan internally.	1 year ago
Jan Tojnar	1d7cdf3a12	phpstan: Use standard config path This allows developer to create their own own config file, e.g. for setting `editorUrl`: https://phpstan.org/user-guide/output-format#opening-file-in-an-editor	1 year ago
Jérémy Benoist	f825dcf55a	Merge pull request #90 from jtojnar/foreaches Iterate node lists with foreach	1 year ago
Jan Tojnar	9a9373de4b	Iterate node lists with `foreach` `DOMNodeList` implements `Traversable`. There are some `for` loops left but we cannot simply replace those: PHP follows the DOM specification, which requires that `NodeList` objects in the DOM are live. As a result, any operation that removes a node list member node from its parent (such as `removeChild`, `replaceChild` or `appendChild`) will cause the next node in the iterator to be skipped. We could work around that by converting those node lists to static arrays using `iterator_to_array` but not sure if it is worth it.	1 year ago
Jan Tojnar	d454c3a462	Remove dead iteration code This was forgotten in `b580cf216d`.	1 year ago
Jan Tojnar	8b1ef07401	Extract `for`-iterated items into variables This simplifies the code a bit and will make it slightly easier in case we decide to switch to `foreach` iteration.	1 year ago
Jan Tojnar	5885dbbe78	Remove pointless `stdClass` `DOMNode::$childNodes` always contained `DOMNodeList`.	1 year ago
Jérémy Benoist	6947999782	Merge pull request #92 from jtojnar/ci-fix ci: Fix & add PHP 8.4	1 year ago
Jan Tojnar	da755013aa	Remove extra set_error_handler callback argument It is unused and would cause an error on PHP ≥ 8.0: https://www.php.net/manual/en/function.set-error-handler.php#refsect1-function.set-error-handler-parameters Not sure if the handler is even necessary – it was introduced in `175196d6c2` but I did not manage to reproduce the original error (Entity 'nbsp' not defined). It was probably fixed by `f2a43b476c`.	1 year ago
Jan Tojnar	5b9551d1e3	ci: Add PHP 8.4 PHP 8.4 is in beta, with final version scheduled for November so it is time to start testing it.	1 year ago
Jan Tojnar	c7b10dcc45	Avoid E_STRICT constant It will be deprecated in PHP 8.4 and it is meaningless nowadays anyway: https://wiki.php.net/rfc/deprecations_php_8_4#remove_e_strict_error_level_and_deprecate_e_strict_constant The use of the constant was introduced in `175196d6c2`.	1 year ago
Jan Tojnar	80adfe870b	Fix coding style With php-cs-fixer 3.64.0, the `native_function_invocation` rule no longer passed.	1 year ago
Jérémy Benoist	cb6b6ac577	Merge pull request #88 from jtojnar/has-single-fix	2 years ago
Jan Tojnar	677f3f096e	Fix hasSingleTagInsideElement method It would fail for e.g. `<div> <p>foo</p> </div>`. mozilla/readability uses children for the tag lookup, which return only elements. PHP does not have children property so `b580cf216d` mistakenly used `childNodes` instead, but that can return any node type. Let’s filter the children ourselves. Also add comments from mozilla/readability’s `_hasSingleTagInsideElement`.	2 years ago
Jérémy Benoist	29122763db	Merge pull request #89 from jtojnar/php74 Require PHP 7.4	2 years ago
Jan Tojnar	89d3b74259	Rectorize to PHP 7.4 Switches to short anonymous function syntax.	2 years ago
Jan Tojnar	e792644fe8	Drop PHP < 7.4 support This will allow us to use flexible heredocs in test, as well as typed properties and other goodies. https://www.php.net/releases/7_3_0.php https://www.php.net/releases/7_4_0.php	2 years ago
Jan Tojnar	648d8c605b	Update coding style for upcoming PHP-CS-Fixer changes Once we bump minimum PHP version, we will get newer PHP-CS-Fixer, which will try to apply this cleanups. Also manually tweak anonymous functions so that they are cleanly formatted once we switch to `fn` syntax.	2 years ago
Jérémy Benoist	f28191a728	Merge pull request #86 from jtojnar/ci-bump ci: Update actions	2 years ago
Jan Tojnar	2103853a1b	ci: Bump coveralls to 2.7.0 - Fixes PHP 8 support https://github.com/php-coveralls/php-coveralls/releases/tag/v2.4.3	2 years ago
Jan Tojnar	7f4c6cfcbd	ci: Update actions Mostly just of nodejs bump: - https://github.com/actions/checkout/releases/tag/v4.0.0 - https://github.com/ramsey/composer-install/releases/tag/3.0.0	2 years ago
Jérémy Benoist	38870cdff1	Merge pull request #80 from jtojnar/stricter Fix some CI issues	3 years ago
Jan Tojnar	9bdd3b6b2e	ci: Add PHP 8.2 and 8.3	3 years ago
Jan Tojnar	f14428e4c0	Do not use `mb_convert_encoding` with `HTML-ENTITIES` as target encoding This is deprecated since PHP 8.2: Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead It was used because `DOMDocument`, which uses libxml2 internally, will parse the HTML as ISO-8859-1, unless the document contains an XML encoding declaration or HTML meta tag setting character set. Since first such element wins, putting the `meta[charset]` up front will ensure the parser uses the correct encoding, even if the document contains incorrect meta tag (e.g. when the document is converted to UTF-8 without also updating the metadata by the software passing it to Readability). https://stackoverflow.com/a/39148511/160386	3 years ago
Jan Tojnar	23f824a1ce	tests: Fix “THE ERROR HANDLER HAS CHANGED!”	3 years ago
Jan Tojnar	2a57124528	composer: upgrade rector	3 years ago
Jan Tojnar	0975574bdb	Rector: Upgrade configuration	3 years ago
Jan Tojnar	9ed89bde92	Fix PHP-Cs-Fixer changes 1) src/Readability.php (braces, no_unneeded_control_parentheses, single_line_comment_spacing, global_namespace_import, no_unused_imports, phpdoc_align) 2) src/JSLikeHTMLElement.php (phpdoc_separation) Switch code blocks to Markdown syntax to work around `phpdoc_separation`, ApiGen uses Markdown these days anyway.	3 years ago
Jan Tojnar	2c6c6d5987	PHPStan: Use stable PHPUnit path phpunit-bridge will create a symlink.	3 years ago
Jan Tojnar	c5407ec07c	composer: Add scripts for development	3 years ago
Jérémy Benoist	7cd8476d38	Merge pull request #79 from j0k3r/fix/psr-log-2-3 Allow `psr/log` 2.0 & 3.0	4 years ago
Jeremy Benoist	82083c872b	Allow `psr/log` 2.0 & 3.0	4 years ago
Kevin Decherf	41ef59212f	Keep h1 and other headings Even though using h1 tags for sections inside an article is semantically wrong, a lot of websites are doing it anyway. So the idea here is to stop stripping headings, including h1 on Readability's side. Fixes wallabag/wallabag#5805 Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Kevin Decherf	6689f19956	Strip script and style tags through ::clean() method instead of preg_replace Huge tags can lead to a failure of preg_replace, thus erasing the whole fetched content. Fixes https://github.com/wallabag/wallabag/issues/5847 Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Jérémy Benoist	0c0653dad6	Merge pull request #73 from Kdecherf/fix/impr Fix `isPhrasingContent` conditions, text node replacement	4 years ago
Kevin Decherf	2ab87d7445	Fix isPhrasingContent conditions, text node replacement It also disables reverting forced paragraph elements as it can break layouts or corrupt content. Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Jérémy Benoist	8af69ad68c	Merge pull request #71 from j0k3r/feature/enable-rector Add Rector	4 years ago

1 2 3 4

191 Commits (3ff7bcccc1dc73f1c3f9261c8ed75359ce8bfdc6) All Branches Search

191 Commits (3ff7bcccc1dc73f1c3f9261c8ed75359ce8bfdc6)

All Branches