定制 johndetomal/browser-channel 二次开发

按需修改功能、优化性能、对接业务系统,提供一站式技术支持

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

johndetomal/browser-channel

Composer 安装命令:

composer require johndetomal/browser-channel

包简介

PHP library for Browser-based scraping

README 文档

README

License PHP Node Status

🎯 Purpose & Practical Use

This project was created primarily as a learning exercise to understand how scalable web content retrieval systems are designed.

It explores real-world concepts such as:

  • Multi-driver request handling (HTTP, cURL, headless browser)
  • Proxy-aware routing systems
  • Driver pooling and lifecycle management
  • Fallback execution strategies
  • Basic response classification and reliability handling

While the project is not a production scraping product, it is structured in a way that it can serve as a foundation or base engine for more advanced scraping or automation systems.

💡 Why This Project

Most scraping solutions fail when websites become dynamic or protected.

This engine is designed to improve reliability by combining multiple strategies:

  • Lightweight requests for speed
  • Browser automation for complex pages
  • Intelligent fallback between methods

The goal is simple: maximize success rate while keeping performance efficient.

💼 Real-World Value

This project can serve as the foundation for:

  • Data collection systems
  • Monitoring tools
  • Automation pipelines
  • Custom scraping APIs

It focuses on reliable data acquisition — the most critical layer in any scraping workflow.

🚀 Browser Scraper Engine

A scalable, driver-based web scraping engine designed to reliably retrieve web page content using multiple strategies such as HTTP, cURL, and headless browser automation.

Built to handle dynamic websites, fallback failures, proxy rotation, and caching in a structured and extensible architecture.

✨ Key Features

  • 🧠 Multi-driver system (Curl / HTTP / Browser automation)
  • 🔁 Automatic fallback between scraping strategies
  • 🌐 Proxy rotation support (improves success rate)
  • ⚡ File-based caching system
  • 🧩 Modular architecture (easy to extend)
  • 📊 Driver-level success/failure tracking
  • 🧪 Debug mode for monitoring

🏗 Architecture Overview

The system uses a driver-based approach to fetch web pages:

If one method fails, the system automatically retries using alternative drivers.

This improves reliability across different website types and protection levels.

📦 Installation

1. Install PHP dependencies

composer require johndetomal/browser-channel

2. Install Node.js dependencies (for browser driver)

cd node
npm install

🚀 Running the Browser Engine

Start the Puppeteer service:

node server.js

Default endpoint: http://localhost:3000

⚙️ Basic Usage (Quick Start)

use Browser\Services\Browser\BrowserService;
use Browser\Services\Browser\Enum\BrowserDriver;

$scraper = new BrowserService([
    'settings' => [
        'driver' => BrowserDriver::Curl,
    ]
]);

$response = $scraper->openPage("https://example.com");

echo $response['content'];

📊 Response Format

[
'content' => '<html>...</html>',
'status_code' => 200,
'retries' => 1,
'process_start_time' => 458252,
'process_end_time' => 458828252,
'message' => 'success',
'reason' => $reason,
'driver' => $driverType,
]

💾 Caching System

'settings' => [
    'cache' => true,
]

🧪 Debug Mode

$this->isDebugMode = true;

Debug output includes:

  • Driver used
  • Proxy used
  • Request status
  • Success/failure tracking
  • Response message

🌐 Proxy Configuration

$scraper->proxies([
    ['ip' => '127.0.0.1', 'port' => '8080'],
    ['ip' => '127.0.0.2', 'port' => '8080'],
]);

🔁 Fallback System

If the primary method fails, the engine automatically switches between:

  • Primary configured driver
  • HTTP driver
  • cURL driver
  • Browser automation driver

This improves reliability across different website structures.

🧩 Driver Strategy

Each driver serves a different purpose:

Curl / HTTP → fast, lightweight requests Browser (Puppeteer) → full rendering for JavaScript-heavy sites

📈 Scalability & Architecture

This project is designed for scalability.

It uses a modular architecture that allows extension without modifying core logic.

You can extend the system by adding:

  • New drivers (e.g. Playwright)
  • Custom proxy strategies
  • Advanced caching layers
  • Enhanced response handling

📌 Use Cases

This engine acts as a data acquisition layer and can be used for:

  • Web page content collection
  • Data extraction pipelines (with custom parsers)
  • Website monitoring and change detection
  • Automation workflows
  • Research and large-scale data collection

⚠️ Notes

  • Puppeteer requires a working Chrome/Chromium environment
  • Some Linux servers may require additional dependencies
  • Curl/HTTP drivers work without Node.js

🤝 Contributions

This project is open to contributions and improvements.

Developers are welcome to:

  • Add new scraping drivers
  • Improve proxy rotation and scoring logic
  • Enhance caching mechanisms
  • Optimize performance and reliability
  • Suggest architectural improvements

All constructive feedback is appreciated.

⚠️ Limitations

This system is optimized for public and moderately protected websites.

Performance depends on:

  • Website protection level (anti-bot systems)
  • Proxy quality
  • Request patterns and concurrency

Some heavily protected websites may require additional strategies.

统计信息

  • 总下载量: 0
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 7
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2026-04-24

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固