项目作者: REBELinBLUE

项目描述 :
A HTTP crawler with a fluent interface
高级语言: PHP
项目地址: git://github.com/REBELinBLUE/fluent-crawler.git
创建时间: 2017-03-21T14:43:55Z
项目社区:https://github.com/REBELinBLUE/fluent-crawler

开源协议:MIT License

下载


logo

Fluent Web Crawler

StyleCI
Build Status
Code Climate
Code Coverage

A web scraping library for PHP with a nice fluent interface.

A fork of laravel/browser-kit-testing, repurposed to use with real HTTP requests.

Developed for a project I worked on at Sainsbury’s.

Requirements

PHP 7.1+ and Goutte 3.1+

Installation

The recommended way to install the library is through Composer.

Add rebelinblue/fluent-web-crawler as a require dependency in your composer.json file:

  1. composer require rebelinblue/fluent-web-crawler

Usage

Create an instance of the Crawler

  1. use REBELinBLUE\Crawler;
  2. $crawler = new Crawler();

Visit a URL

  1. $crawler->visit('http://www.example.com');

Interact with the page

  1. $crawler->type('username', 'admin')
  2. ->type('password', 'password')
  3. ->press('Login');
  4. // This can also be written as the following
  5. $crawler->submitForm('Login', [
  6. 'username' => 'admin',
  7. 'password' => 'password',
  8. ]);

Check the response is as expected

  1. if ($crawler->dontSeeText('Hello World')) {
  2. throw new \Exception('The page does not contain the expected text');
  3. }

For a full list of the available actions see api.md.

Customising the HTTP client settings

If you wish to customize the instance of Goutte which is used (or more likely, the instance of Guzzle), you can
inject your own instance when constructing the class. For example, you may want to increase Guzzle’s timeout

  1. use Goutte\Client as GoutteClient;
  2. use GuzzleHttp\Client as GuzzleClient;
  3. $goutteClient = new GoutteClient();
  4. $guzzleClient = new GuzzleClient([
  5. 'timeout' => 60,
  6. ]);
  7. $goutteClient->setClient($guzzleClient);
  8. $crawler = new Crawler($goutteClient);

Further Reading

Fluent Crawler is a wrapper around the following PHP libraries.