定制 octopoda/octopus 二次开发

按需修改功能、优化性能、对接业务系统,提供一站式技术支持

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

octopoda/octopus

最新稳定版本:0.11.1

Composer 安装命令:

composer require octopoda/octopus

包简介

PHP Sitemap crawler

README 文档

README

Small PHP tool to crawl collections of URLs in a Sitemap using the PHPReact library for asynchronous loading of the URLs. Both plain text files and XML Sitemaps are supported.

Logo

Usage from the Command Line Interface (CLI)

Crawl the URLs in a Sitemap with verbose logging (-vvv).

php application.php http://www.domain.ext/sitemap.xml -vvv

Using 15 concurrent connections instead of the default 5 concurrent connections:

php application.php http://www.domain.ext/sitemap.xml --concurrency 15 -vvv

Use a HTTP GET request instead of the default HTTP HEAD. Note that HTTP HEAD requests involve less data transfer since no body is involved:

php application.php http://www.domain.ext/sitemap.xml --requestType GET -vvv

Use a timeout of 3 seconds instead of the default 10 seconds:

php application.php http://www.domain.ext/sitemap.xml --timeout 3 -vvv

Use a specific UserAgent instead of the default Octopus/1.0, for example, to simulate a search engine crawling a sitemap:

php application.php http://www.domain.ext/sitemap.xml --userAgent 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' -vvv

Use the TablePresenter to display intermediate results instead of the default EchoPresenter:

php application.php http://www.domain.ext/sitemap.xml --presenter Octopus\\Presenter\\TablePresenter -vvv

Usage from your own application

You can easily integrate sitemap crawling in your own application, have a look at the Config class for all possible configuration options. If required you can use a PSR3-Logger for logging purposes.

use Octopus\Config;
use Octopus\Processor;

$config = new Config();
$config->concurrency = 2;
$config->targetFile = 'https://www.domain.ext/sitemap.xml';
$config->additionalResponseHeadersToCount = array(
    'CF-Cache-Status', //Useful to check CloudFlare edge server cache status
);
$config->requestHeaders = array(
    'User-Agent' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', //Simulate Google's webcrawler
);
$processor = new Processor($config, $this->logger); //A PSR3 Logger can be injected if required
$processor->run();

$this->logger->info('Statistics: ' . print_r($processor->result->getStatusCodes(), true));
$this->logger->info('Applied concurrency: ' . $config->concurrency);
$this->logger->info('Total amount of processed data: ' . $processor->result->getTotalData());
$this->logger->info('Failed to load #URLs: ' . count($processor->result->getBrokenUrls()));

Limitations

Currently, Octopus is mainly an experimental / educational tool. Advanced use cases in HTTP response handling might not be supported.

Tests

To run the test suite, you first need to clone this repository and then install all dependencies using Composer:

$ composer install

To run the test suite, go to the project root and run:

$ make test

统计信息

  • 总下载量: 4.84k
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 11
  • 点击次数: 1
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 11
  • Watchers: 2
  • Forks: 1
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2017-12-06

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固