ddliu/spider 问题修复 & 功能扩展

解决BUG、新增功能、兼容多环境部署,快速响应你的开发需求

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

ddliu/spider

最新稳定版本:v0.2.9

Composer 安装命令:

composer require ddliu/spider

包简介

Light weight spider for the web.

README 文档

README

A flexible spider in PHP.

Concepts

A spider contains many processors called pipes, you can pass as many tasks as you like to the spider, each task go through these pipes and get processed.

Installation

composer require ddliu/spider

Requirements

  • PHP5.3+
  • curl(RequestPipe)

Dependencies

See composer.json.

Usage

use ddliu\spider\Spider;
use ddliu\spider\Pipe\NormalizeUrlPipe;
use ddliu\spider\Pipe\RequestPipe;
use ddliu\spider\Pipe\DomCrawlerPipe;

(new Spider())
    ->pipe(new NormalizeUrlPipe())
    ->pipe(new RequestPipe())
    ->pipe(new DomCrawlerPipe())
    ->pipe(function($spider, $task) {
        $task['$dom']->filter('a')->each(function($a) use ($task) {
            $href = $a->attr('href');
            $task->fork($href);
        })
    })
    // the entry task
    ->addTask('http://example.com')
    ->run()
    ->report();

Find more examples in examples folder.

Spider

The Spider class.

Options

  • limit: maxmum tasks to run

Methods

  • pipe($pipe): add a pipe
  • addTask($task): add a task
  • run(): run the spider
  • report(): write report to log

Task

A task contains the data array and some helper functions.

The Task class implements ArrayAccess interface, so you can access data like array.

Methods

  • fork($task): add a sub task to the spider
  • ignore(): ignore the task

Pipes

Pipes define how each task being processed.

A pipe can be a function:

function($spider, $task) {}

Or extends the BasePipe:

use ddliu\spider\Pipe\BasePipe;

class MyPipe extends BasePipe {
    public function run($spider, $task) {
        // process the task...
    }
}

Useful Pipes

NormalizeUrlPipe

Normalize $task['url'].

new NormalizeUrlPipe()

RequestPipe

Start an HTTP request with $task['url'] and save the result in $task['content'].

new RequestPipe(array(
    'useragent' => 'myspider',
    'timeout' => 10
));

FileCachePipe

Cache a pipe (e.g. RequestPipe).

$requestPipe = new RequestPipe();
$cacheForReqPipe = new FileCachePipe($requestPipe, [
    'input' => 'url',
    'output' => 'content',
    'root' => '/path/to/cache/root',
]);

RetryPipe

Retry on failure.

$requestPipe = new RequestPipe();
$retryForReqPipe = new RetryPipe($requestPipe, [
    'count' => 10,
]);

DomCrawlerPipe

Create a DomCrawler from $task['content']. Access it with $task['$dom'] in following pipes.

ReportPipe

Report every 10 minutes.

new ReportPipe(array(
    'seconds' => 600
))

Logging

$spider->logger is an instance of Monolog\Logger. You can add logging handlers to it before start:

use Monolog\Handler\StreamHandler;

$spider->logger->pushHandler(new StreamHandler('path/to/your.log', Logger::WARNING));

TODO/Ideas

  • Real world examples.
  • Running tasks concurrently.(With pthread?)

Alternate

Use golang version for better performance!

统计信息

  • 总下载量: 36
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 19
  • 点击次数: 2
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 19
  • Watchers: 4
  • Forks: 4
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2014-11-06

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固