README

ElephScraper is a lightweight, PHP-native web scraping toolkit, built on top of Guzzle and Symfony DomCrawler. This library provides a clean and powerful interface for extracting HTML, metadata, and structured data from any web page — or from an HTML string you already have yourself.

Fast. Clean. Eleph-style scraping. 🐘⚡

Part of the ioodev scraper ecosystem alongside SnakyScraper (Python) and NodeScraper (Node.js) — three libraries with a similar API philosophy for three different language ecosystems.

Moving from riodevnet/elephscraper? See Migrating from v1.0 below — the namespace and package name changed in v1.1.0.

📋 Table of Contents

Features
Installation
Basic Usage
Error Handling
Request Options (Headers, Timeout, Proxy, etc.)
Full API Reference
Project Structure
Testing & Quality Tools
Migrating from v1.0 (riodevnet/elephscraper)
Contributing
Changelog
License

🚀 Features

✅ Extract metadata: title, description, keywords, author, charset, canonical, and more
✅ Full support for Open Graph, Twitter Card, CSRF token, and HTTP-equiv headers
✅ Extract headings, paragraphs, images, lists, and links — complete with rel, nofollow, etc. details
✅ Flexible filter() method with tag/class/ID-based selectors
✅ Can load from a URL or directly from an HTML string (fromHtml()) — no HTTP request needed, great for testing
✅ Never throws a fatal error — fetch/parse failures can always be checked via isValid() / getError(), or optionally thrown as an exception (throwOnError)
✅ Custom headers, timeout, proxy, cookies, and other Guzzle options via the $options parameter
✅ Safe return types: string, array, or associative array — always null (never a crash) when data isn't found
✅ Strict types & full type-hints (PHP 8.0+) for a safer development experience
✅ Built on top of Guzzle + Symfony DomCrawler + CssSelector
✅ PHPUnit test suite, PHPStan level 6, and PHP-CS-Fixer (PSR-12) already set up

📦 Installation

Install via Composer:

composer require ioodev/elephscraper

Requires PHP 8.0 or newer.

🛠️ Basic Usage

<?php

require_once __DIR__ . '/vendor/autoload.php';

use Ioodev\Elephscraper\ElephScraper;

$scraper = new ElephScraper('https://example.com');

if (!$scraper->isValid()) {
    // Request failed (timeout, DNS error, 404, etc.) — will not throw a fatal error.
    die('Failed to load page: ' . $scraper->getError()?->getMessage());
}

echo $scraper->title();        // "Welcome to Example.com"
echo $scraper->description();  // "Example site for testing"
print_r($scraper->h1());       // ["Main Title", "News"]
print_r($scraper->openGraph());

Load from an HTML string (no HTTP request)

Useful for unit tests, or when you already have HTML from another source (headless browser, file cache, webhook payload, etc.):

$html = '<html><head><title>Static Page</title></head><body>...</body></html>';

$scraper = ElephScraper::fromHtml($html);

echo $scraper->title(); // "Static Page"

⚠️ Error Handling

By default, the constructor never throws an exception — this is intentional, so a single broken URL in the middle of a batch/loop process doesn't halt the entire process. Always check one of:

$scraper = new ElephScraper($url);

if (!$scraper->isValid()) {
    echo 'Error: ' . $scraper->getError()->getMessage();
    // continue to the next URL, etc.
}

If you prefer a "fail-fast" model with try/catch, set throwOnError:

use Ioodev\Elephscraper\Exceptions\ScraperException;

try {
    $scraper = new ElephScraper($url, ['throwOnError' => true]);
} catch (ScraperException $e) {
    echo 'Scraping failed: ' . $e->getMessage();
}

All extraction methods (title(), h1(), links(), etc.) are always safe to call even if the document fails to load — they will return null (never a crash), including edge cases from previous versions that had a fatal error bug under this condition.

⚙️ Request Options (Headers, Timeout, Proxy, etc.)

The constructor's second parameter is an options array passed directly to Guzzle request(), merged with the defaults (timeout: 10, connect_timeout: 5, redirects followed, browser User-Agent):

$scraper = new ElephScraper('https://example.com', [
    'timeout' => 20,
    'headers' => [
        'User-Agent' => 'MyBot/1.0',
        'Accept-Language' => 'en-US,en;q=0.9',
    ],
    'proxy' => 'http://localhost:8125',
    'verify' => false, // disable SSL verification (use with caution in production)
]);

You can also inject your own Guzzle Client instance (for example, for testing with a mock handler):

$scraper = new ElephScraper('https://example.com', [
    'client' => $myMockedGuzzleClient,
]);

📚 Full API Reference

🔹 Page Metadata

$scraper->title();          // ?string
$scraper->description();    // ?string
$scraper->keywords();       // ?string[] — comma-split result, already trimmed
$scraper->keywordString();  // ?string  — raw "content" attribute
$scraper->charset();        // ?string
$scraper->canonical();      // ?string
$scraper->contentType();    // ?string  — from meta http-equiv="Content-Type"
$scraper->author();         // ?string
$scraper->csrfToken();      // ?string  — checks <meta name="csrf-token">, falls back to <input name="csrf-token">
$scraper->image();          // ?string  — shortcut for og:image
$scraper->viewport();       // ?string[] — comma-split result from meta viewport
$scraper->viewportString(); // ?string

🔹 Open Graph & Twitter Card

$scraper->openGraph();              // array<string,?string> — all common og: properties
$scraper->openGraph('og:title');    // ?string — a specific property

$scraper->twitterCard();                  // array<string,?string> — all common twitter: tags
$scraper->twitterCard('twitter:title');   // ?string — a specific property

🔹 Heading & Text

$scraper->h1(); // ?string[]
$scraper->h2(); // ?string[]
$scraper->h3(); // ?string[]
$scraper->h4(); // ?string[]
$scraper->h5(); // ?string[]
$scraper->h6(); // ?string[]
$scraper->p();  // ?string[] — all <p> elements, trimmed

🔹 List

$scraper->ul(); // ?string[] — all <li> text inside <ul>
$scraper->ol(); // ?string[] — all <li> text inside <ol>

🔹 Images

$scraper->images();       // ?string[] — all <img> src
$scraper->imageDetails(); // ?array<int, array{url:?string, alt_text:?string, title:?string}>

🔹 Links

$scraper->links();       // ?string[] — all <a> href
$scraper->linkDetails();
// ?array<int, array{
//     url: ?string,
//     protocol: string,        // "https", "mailto", "" if relative, etc.
//     text: string,
//     title: string,
//     target: string,
//     rel: string[],
//     is_nofollow: bool,
//     is_ugc: bool,
//     is_noopener: bool,
//     is_noreferrer: bool,
// }>

🔍 Custom DOM Filter

filter() is the most flexible method — ideal for scraping custom HTML structures like product lists, article cards, data tables, etc.

$scraper->filter(
    element: 'div',
    attributes: ['id' => 'main'],
    multiple: false,
    extract: ['.title', '#desc', 'p'],
    returnHtml: false
);

Filter multiple elements at once:

$products = $scraper->filter(
    element: 'div',
    attributes: ['class' => 'product-card'],
    multiple: true,
    extract: ['.product-title', '.price'],
    returnHtml: false
);

// [
//     ['.product-title' => 'Wireless Mouse', '.price' => '$15.00'],
//     ['.product-title' => 'Mechanical Keyboard', '.price' => '$85.00'],
// ]

Get raw HTML from a single section:

$scraper->filter(
    element: 'section',
    attributes: ['class' => 'hero'],
    returnHtml: true
);

Selector rules for extract:

Tag name: h2, p, span, etc.

Class: .className (automatically matches even if the element has multiple classes)

ID: #idName

Result array keys always follow the original selector string (e.g. result['.title']). Values in attributes (for class/id/other attributes) are safe from quote characters — they won't break the selector as they could in previous versions.

Returns null if the document fails to load, or if no matching elements are found.

🔧 Low-Level Access

For cases not covered by the built-in methods, you can drop straight down to Symfony DomCrawler:

$scraper->isValid();    // bool — whether the document loaded successfully
$scraper->getError();   // ?Throwable — the last exception, if any
$scraper->getHtml();    // ?string — raw HTML
$scraper->getCrawler(); // ?Symfony\Component\DomCrawler\Crawler
$scraper->getUrl();     // ?string — source URL (or base URL from loadHtml())

🗂 Project Structure

elephscraper/
├── .github/
│   └── workflows/
│       └── ci.yml              # GitHub Actions: tests + static analysis on PHP 8.0–8.3
├── examples/
│   ├── basic-usage.php
│   ├── custom-options.php
│   └── from-html-and-filter.php
├── src/
│   ├── Exceptions/
│   │   ├── InvalidUrlException.php
│   │   └── ScraperException.php
│   ├── Support/
│   │   └── CssSelectorBuilder.php   # safe CSS selector builder (escaping, id/class normalization)
│   └── ElephScraper.php             # main class
├── tests/
│   └── Unit/
│       ├── CssSelectorBuilderTest.php
│       └── ElephScraperTest.php
├── .gitignore
├── .php-cs-fixer.php
├── CHANGELOG.md
├── composer.json
├── LICENSE
├── phpstan.neon
├── phpunit.xml
└── README.md

This separation is intentional, to make further development easier:

src/Exceptions/ — all exception classes, so library consumers can catch (ScraperException $e) specifically without catching generic PHP errors.
src/Support/ — internal helpers (currently CssSelectorBuilder) kept separate from the main class so they can be unit-tested independently and reused if more selector features are added later.
tests/Unit/ — mirrors the src/ namespace structure, one test file per class.
examples/ — runnable scripts (php examples/basic-usage.php) for quick onboarding without needing to read the whole README.

🧪 Testing & Quality Tools

composer install

composer test            # run PHPUnit
composer test:coverage   # PHPUnit + coverage report
composer analyse         # PHPStan level 6
composer lint            # check code formatting (PSR-12), without modifying files
composer lint:fix         # automatically fix code formatting

The test suite covers metadata extraction, heading/paragraph/list extraction, images & links (including the edge case of relative links without rel), filter() (single & multiple, including values containing quote characters), and behavior when the document fails to load (must return null, not crash).

🔁 Migrating from v1.0 (`riodevnet/elephscraper`)

Version 1.1.0 changes the namespace and package name following the username rename from riodevnet to ioodev. Migration steps:

composer remove riodevnet/elephscraper
composer require ioodev/elephscraper

Then find-and-replace throughout your project:

- use Riodevnet\Elephscraper\ElephScraper;
+ use Ioodev\Elephscraper\ElephScraper;

All method names remain exactly the same — there are no signature changes to any public method that existed in v1.0, so you only need to update the use statement. See CHANGELOG.md for the full list of changes, new features, and bug fixes.

🤝 Contributing

Found a bug? Want to add a feature? Open an issue or submit a pull request at github.com/ioodev/elephscraper!

Before opening a PR, please run:

composer test
composer analyse
composer lint

📝 Changelog

See CHANGELOG.md for the full list of changes in each version.

📄 License

🔗 Related Libraries

Guzzle
Symfony DomCrawler
Symfony CssSelector
SnakyScraper — Python version
NodeScraper — Node.js version

💡 Why ElephScraper?

ElephScraper is your trusty PHP elephant — strong, smart, and always ready to extract exactly the data you need. 🐘

ioodev/elephscraper

包简介

关键字：

README 文档

README

📋 Table of Contents

🚀 Features

📦 Installation

🛠️ Basic Usage

Load from an HTML string (no HTTP request)

⚠️ Error Handling

⚙️ Request Options (Headers, Timeout, Proxy, etc.)

📚 Full API Reference

🔹 Page Metadata

🔹 Open Graph & Twitter Card

🔹 Heading & Text

🔹 List

🔹 Images

🔹 Links

🔍 Custom DOM Filter

🔧 Low-Level Access

🗂 Project Structure

🧪 Testing & Quality Tools

🔁 Migrating from v1.0 (`riodevnet/elephscraper`)

🤝 Contributing

📝 Changelog

📄 License

🔗 Related Libraries

💡 Why ElephScraper?

统计信息

GitHub 信息

其他信息

承接程序开发

ioodev/elephscraper

包简介

关键字：

README 文档

README

📋 Table of Contents

🚀 Features

📦 Installation

🛠️ Basic Usage

Load from an HTML string (no HTTP request)

⚠️ Error Handling

⚙️ Request Options (Headers, Timeout, Proxy, etc.)

📚 Full API Reference

🔹 Page Metadata

🔹 Open Graph & Twitter Card

🔹 Heading & Text

🔹 List

🔹 Images

🔹 Links

🔍 Custom DOM Filter

🔧 Low-Level Access

🗂 Project Structure

🧪 Testing & Quality Tools

🔁 Migrating from v1.0 (riodevnet/elephscraper)

🤝 Contributing

📝 Changelog

📄 License

🔗 Related Libraries

💡 Why ElephScraper?

统计信息

GitHub 信息

其他信息

承接程序开发

🔁 Migrating from v1.0 (`riodevnet/elephscraper`)