README

🎯 Purpose & Practical Use

This project was created primarily as a learning exercise to understand how scalable web content retrieval systems are designed.

It explores real-world concepts such as:

Multi-driver request handling (HTTP, cURL, headless browser)
Proxy-aware routing systems
Driver pooling and lifecycle management
Fallback execution strategies
Basic response classification and reliability handling

While the project is not a production scraping product, it is structured in a way that it can serve as a foundation or base engine for more advanced scraping or automation systems.

💡 Why This Project

Most scraping solutions fail when websites become dynamic or protected.

This engine is designed to improve reliability by combining multiple strategies:

Lightweight requests for speed
Browser automation for complex pages
Intelligent fallback between methods

The goal is simple: maximize success rate while keeping performance efficient.

💼 Real-World Value

This project can serve as the foundation for:

Data collection systems
Monitoring tools
Automation pipelines
Custom scraping APIs

It focuses on reliable data acquisition — the most critical layer in any scraping workflow.

🚀 Browser Scraper Engine

A scalable, driver-based web scraping engine designed to reliably retrieve web page content using multiple strategies such as HTTP, cURL, and headless browser automation.

Built to handle dynamic websites, fallback failures, proxy rotation, and caching in a structured and extensible architecture.

✨ Key Features

🧠 Multi-driver system (Curl / HTTP / Browser automation)
🔁 Automatic fallback between scraping strategies
🌐 Proxy rotation support (improves success rate)
⚡ File-based caching system
🧩 Modular architecture (easy to extend)
📊 Driver-level success/failure tracking
🧪 Debug mode for monitoring

🏗 Architecture Overview

The system uses a driver-based approach to fetch web pages:

If one method fails, the system automatically retries using alternative drivers.

This improves reliability across different website types and protection levels.

📦 Installation

1. Install PHP dependencies

composer require johndetomal/browser-channel

2. Install Node.js dependencies (for browser driver)

cd node
npm install

🚀 Running the Browser Engine

Start the Puppeteer service:

node server.js

Default endpoint: http://localhost:3000

⚙️ Basic Usage (Quick Start)

use Browser\Services\Browser\BrowserService;
use Browser\Services\Browser\Enum\BrowserDriver;

$scraper = new BrowserService([
    'settings' => [
        'driver' => BrowserDriver::Curl,
    ]
]);

$response = $scraper->openPage("https://example.com");

echo $response['content'];

📊 Response Format

[
'content' => '<html>...</html>',
'status_code' => 200,
'retries' => 1,
'process_start_time' => 458252,
'process_end_time' => 458828252,
'message' => 'success',
'reason' => $reason,
'driver' => $driverType,
]

💾 Caching System

'settings' => [
    'cache' => true,
]

🧪 Debug Mode

$this->isDebugMode = true;

Debug output includes:

Driver used
Proxy used
Request status
Success/failure tracking
Response message

🌐 Proxy Configuration

$scraper->proxies([
    ['ip' => '127.0.0.1', 'port' => '8080'],
    ['ip' => '127.0.0.2', 'port' => '8080'],
]);

🔁 Fallback System

If the primary method fails, the engine automatically switches between:

Primary configured driver
HTTP driver
cURL driver
Browser automation driver

This improves reliability across different website structures.

🧩 Driver Strategy

Each driver serves a different purpose:

Curl / HTTP → fast, lightweight requests Browser (Puppeteer) → full rendering for JavaScript-heavy sites

📈 Scalability & Architecture

This project is designed for scalability.

It uses a modular architecture that allows extension without modifying core logic.

You can extend the system by adding:

New drivers (e.g. Playwright)
Custom proxy strategies
Advanced caching layers
Enhanced response handling

📌 Use Cases

This engine acts as a data acquisition layer and can be used for:

Web page content collection
Data extraction pipelines (with custom parsers)
Website monitoring and change detection
Automation workflows
Research and large-scale data collection

⚠️ Notes

Puppeteer requires a working Chrome/Chromium environment
Some Linux servers may require additional dependencies
Curl/HTTP drivers work without Node.js

🤝 Contributions

This project is open to contributions and improvements.

Developers are welcome to:

Add new scraping drivers
Improve proxy rotation and scoring logic
Enhance caching mechanisms
Optimize performance and reliability
Suggest architectural improvements

All constructive feedback is appreciated.

⚠️ Limitations

This system is optimized for public and moderately protected websites.

Performance depends on:

Website protection level (anti-bot systems)
Proxy quality
Request patterns and concurrency

Some heavily protected websites may require additional strategies.

johndetomal/browser-channel

包简介

README 文档