You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Jan Tojnar 9462d2060e php-cs-fixer: Disable incompatible rules 1 year ago
.github ci: Add PHP 8.1 through 8.4 1 year ago
src Update coding style for upcoming PHP-CS-Fixer changes 1 year ago
tests Remove extra set_error_handler callback argument 1 year ago
.editorconfig Ditch Travis to use GitHub Actions 5 years ago
.gitattributes Ditch Travis to use GitHub Actions 5 years ago
.gitignore phpstan: Use standard config path 1 year ago
.php-cs-fixer.php php-cs-fixer: Disable incompatible rules 1 year ago
.scrutinizer.yml Use a new deps for HTML5 parser 6 years ago
LICENSE.md Initial commit 11 years ago
README.md Test on PHP 8 & drop Travis 4 years ago
composer.json composer: Allow phpunit-bridge 6.0 and 7.0 1 year ago
phpstan.dist.neon phpstan: Use standard config path 1 year ago
phpunit.xml.dist Use a new deps for HTML5 parser 6 years ago

README.md

Readability

CI Coverage Status Total Downloads License

This is an extract of the Readability class from this full-text-rss fork. It can be defined as a better version of the original php-readability.

Differences

The default php-readability lib is really old and needs to be improved. I found a great fork of full-text-rss from @Dither which improve the Readability class.

  • I've extracted the class from its fork to be able to use it out of the box
  • I've added some simple tests
  • and changed the CS, run php-cs-fixer and added a namespace

But the code is still really hard to understand / read ...

Requirements

By default, this lib will use the Tidy extension if it's available. Tidy is only used to cleanup the given HTML and avoid problems with bad HTML structure, etc .. It'll be suggested by Composer.

Also, if you got problem from parsing a content without Tidy installed, please install it and try again.

Usage

use Readability\Readability;

$url = 'http://www.medialens.org/index.php/alerts/alert-archive/alerts-2013/729-thatcher.html';

// you can use whatever you want to retrieve the html content (Guzzle, Buzz, cURL ...)
$html = file_get_contents($url);

$readability = new Readability($html, $url);
// or without Tidy
// $readability = new Readability($html, $url, 'libxml', false);
$result = $readability->init();

if ($result) {
    // display the title of the page
    echo $readability->getTitle()->textContent;
    // display the *readability* content
    echo $readability->getContent()->textContent;
} else {
    echo 'Looks like we couldn\'t find the content. :(';
}

If you want to debug it, or check what's going on, you can inject a logger (which must follow Psr\Log\LoggerInterface, Monolog for example):

use Readability\Readability;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;

$url = 'http://www.medialens.org/index.php/alerts/alert-archive/alerts-2013/729-thatcher.html';
$html = file_get_contents($url);

$logger = new Logger('readability');
$logger->pushHandler(new StreamHandler('path/to/your.log', Logger::DEBUG));

$readability = new Readability($html, $url);
$readability->setLogger($logger);