php-readability

Commit Graph

Author	SHA1	Message	Date
Kevin Decherf	6689f19956	Strip script and style tags through ::clean() method instead of preg_replace Huge tags can lead to a failure of preg_replace, thus erasing the whole fetched content. Fixes https://github.com/wallabag/wallabag/issues/5847 Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Kevin Decherf	2ab87d7445	Fix isPhrasingContent conditions, text node replacement It also disables reverting forced paragraph elements as it can break layouts or corrupt content. Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Jeremy Benoist	c2a1639b34	Add Rector	4 years ago
Kevin Decherf	a44c4e5482	Add routine to remove invisible nodes Readability was previously removing (was trying to actually, see next section) invisible nodes using a pattern from `unlikelyCandidates`. This was quite hacky and was removed during a backport of logics from mozilla/readability. There is still a need to remove them so here we are. We still use a pattern but specifically against the style attribute. We also remove nodes with the attribute `hidden`. The clean feature of tidy actually replaces inline style attributes with css classes thus preventing readability to detect invisible nodes, see https://github.com/htacg/tidy-html5/blob/5.6.0/src/clean.c#L1488 We therefore set clean configuration to false. Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Kevin Decherf	b580cf216d	Backport some logics from mozilla/readability This change backports several things from mozilla/readability: - Add child score to all ancestors instead of the first parent only - Check 5 top candidates and try to find alternative candidates within ancestors, this can help to find a better parent and grab more content - Reduce patterns from `unlikelyCandidates` to the one used by Mozilla as ours tend to remove useful nodes - Score headers (h2 to h6) by default in addition to div, p, td and section Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Jeremy Benoist	c4bba53dbe	Remove Scrutinizer	4 years ago
Jeremy Benoist	66215a6c80	Require PHP >= 7.2 - remove test on Composer v1 - remove deprecated function - move `loadHtml()` into `init()` instead of `__construct` Kinda prepare 2.0 version :)	4 years ago
Jérémy Benoist	fabf096ce6	Fix deprecated message > Method "Psr\Log\LoggerAwareInterface::setLogger()" might add "void" as a native return type declaration in the future. Do the same in implementation "Readability\Readability" now to avoid errors or add an explicit @return annotation to suppress this message.	4 years ago
Kevin Decherf	eb72a315c4	Clean empty figure tags without ending See 'Tag omission' https://developer.mozilla.org/en-US/docs/Web/HTML/Element/figure Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	4 years ago
Jeremy Benoist	ea1368fac0	Body can be wiped without tidy Re-create it in that case. Also run CS-Fixer.	5 years ago
Jan Tojnar	7cea79c23a	readability: stop tidy from wrapping noscript text HTML 4.01 Strict only allows block-level elements within noscript, form and blockquote. The `enclose-block-text` option fixes the instances when those elements contain inline elements or text by wrapping the children in paragraphs. HTML 5 has looser content model and allows noscript elements basically anywhere, including paragraphs, making the noscript elements inherit the parent element’s content model. This means that tidy will produce invalid HTML nesting paragraphs for `p > noscript > text`, a structure that would be invalid on two counts in HTML 4 Strict profile but is completely valid in HTML 5. Popular WordPress image lazy-loading code produces precisely that structure so tidy “corrects” it to invalid code. In a proper HTML parser, the produced code would force close the outer paragraph, making the noscript element its sibling instead of a child. The only reason this does not break Graby’s code for stripping the lazy-loading HTML is that libxml2 contains a bug counteracting this: https://gitlab.gnome.org/GNOME/libxml2/-/issues/205 Since all three elements allow flow content in HTML 5, it does not make much sense to enable this option any more. The only possible issues that could occur is producing HTML code not conforming to 4.01 Strict but that was never guaranteed, as our example shows, and having blockquotes contain text nodes not wrapped in paragraphs, which might be expected by some ancient stylesheets but that is only minor and easily fixable visual backwards incompatibility.	5 years ago
Jeremy Benoist	6a8ecf232f	Use a new deps for HTML5 parser `electrolinux/php-html5lib` was quite old and incompatible with the upcoming Composer 2.0. Jumping to `masterminds/html5` for the same result. Also the lib is maintained. Also: - keep README in vendors - use new Scrutinizer engine - test with lower deps - remove php-coveralls dev deps and download the phar during the CI build	6 years ago
Jeremy Benoist	b1acc9ed73	Fix PHPStan (again) Also cleanup	6 years ago
Jeremy Benoist	11d2946904	Add openload.co to media detection	7 years ago
nicofrand	ff78c63e6d	Skip empty (empty innerHTML) nodes when grabbing article	7 years ago
Jeremy Benoist	bb65caf864	Fix “A non well formed numeric value encountered”	7 years ago
Simounet	2e20f76195	\bout removed from negative content	7 years ago
Jeremy Benoist	74d9cc605a	Enable PHPStan	7 years ago
Jeremy Benoist	2dce2879bf	Update fixer rules Following graby, wallabag, etc.	7 years ago
Kevin Decherf	26c881d864	tidy: use tidy_repair_string instead of tidy_parse_string+tidy_clean_repair A change released in tidy 5.6.0 breaks php-tidy when using tidy_parse_string+tidy_clean_repair and wrap=0, incorrectly wrapping every single word. Also it seems that $tidy->value should not be used to retrieve the repaired html as far as it is undocumented and for internal use. We replace the call with tidy_repair_string which directly returns the repaired string. Relates to https://github.com/htacg/tidy-html5/issues/673 Relates to https://bugs.php.net/bug.php?id=75947 Tests pass. Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	7 years ago
Simounet	422c74f29c	Giphy added to allowed medias	7 years ago
Simounet	63cd304dba	Media class added to positive candidates Fix Mediapart images.	8 years ago
Kevin Decherf	4c68cc9f09	Keep elements with 'footnote' as possible candidates Should fix https://github.com/wallabag/wallabag/issues/3100 Signed-off-by: Kevin Decherf <kevin@kdecherf.com>	9 years ago
Jeremy Benoist	613a63c062	CS	9 years ago
Jeremy Benoist	05089bbd03	Add missing HTML5 class	9 years ago
Jeremy Benoist	f2a43b476c	Avoid PHP Warning This isn't the best solution but the previous one using `@` wasn't really better. Appending a string into a fragment might generate some warning if the string contains bad entity. For example `+`.	9 years ago
Jeremy Benoist	8b1c3f147d	Don't be to hard on 'links' attribute	9 years ago
Jeremy Benoist	ff754b80bd	Avoid childnode becoming null to generate a warning	9 years ago
Jeremy Benoist	d97bece7c5	Don’t be too agressive Some links got a “tooltip-link” and shouldn’t be removed by php-readability because they are usefull to the content	10 years ago
Jeremy Benoist	3de4e918b4	Convert header & section to p And took `pre` element in score	10 years ago
Jeremy Benoist	5182d6cb11	“info” is too agressive in unlikelyCandidates Some contents have a `infocontent` node (ot sth different) and they are real content. Using only `info` as regex is too agressive and remove legitimate content. Matching the whole word `info` (or `infos`) should be a better choice	10 years ago
Jeremy Benoist	2ef400bf73	Enable php-cs-fixer	10 years ago
Jeremy Benoist	00f622e9b7	Revert BC changes - avoid method signature update - revert moving logic out of the constructor	10 years ago
Jeremy Benoist	c756ec067e	Fix tests `getInnerText` might receive a null DOMElement if the xpath or query return no element.	10 years ago
Jeremy Benoist	8ab7d76cd5	Use Monolog instead of custom solution Remove that ugly `openlog` & `syslog`	10 years ago
Jeremy Benoist	149a333b40	Remove addPreFilter Pre filters are used in the __construct so adding more pre filters once the object is instantiated is useless.	10 years ago
Jeremy Benoist	209c404d7b	Fix instanceof DOMElement We previously checked `instanceof DOMElement` which was wrong since we are in the namespace class, the class `Readability\DOMElement` does not exists.	10 years ago
Jeremy Benoist	2951936e00	CS & PHPDoc	10 years ago
Jeremy Benoist	850ade16b6	Cleanup	10 years ago
Jeremy Benoist	dc590542f0	Avoid adding id that might already exists We append a new node when it isn't a `div` or `p` (like when it's an `article`) with the same id which generate a DOM error "DOMElement::setAttribute(): ID blabla already defined".	10 years ago
Jeremy Benoist	111cb08034	Improve negative element - add unlikelyCandidates: head - add negative: recommend	11 years ago
Jeremy Benoist	f71c3a4196	Do not remove html tag attributes They might contains useful information (at least language)	11 years ago
Jeremy Benoist	b77876b30a	Do not remove nofollow links Most the time, they can be usefull. At least, it'll be a link to something unrelated. But we won't lose a link inside the content. Also, adding some extra space.	11 years ago
Jeremy Benoist	175196d6c2	Avoid error with   Fix #5	11 years ago
Jeremy Benoist	2b5af601d5	Do not format output to avoid breaking apps It'll require to jump to 2.0.0 and I think it's to soon	11 years ago
Jeremy Benoist	d01eb2ac1e	Use class instead of id to avoid error It generates error like `ID XXX already defined`	11 years ago
Jeremy Benoist	c5a4a490e1	CS	11 years ago
Jeremy Benoist	c67189248e	Backport changes from wallabag `e9e4ff87f8`	11 years ago
Jeremy Benoist	91b80b70e2	Update HTML5 tags From https://github.com/htacg/tidy-html5/blob/master/src/tags.c#L296	11 years ago
Jeremy Benoist	814c6e4730	Restore compatibility with PHP 5.3	11 years ago

1 2

54 Commits (c5407ec07c9ee5fbc062cf07434e34285096f6e6)