README

Introduction

Larascrape allows you to scrape any URL using Laravel. It uses Puppeteer under the hood. Unlikely Sapatie Crawler or Browsershot, this scraper focuses on simplicity. While Spatie Crawler can leave opened many Chromium instances, filling your server memory, Larascrape starts the scraping process using Node, making sure the Chromium instance is closed before existint.

Unlikely Spatie Crawler, it supports Proxy authentication and in general is faster.

Install

Run this command via Composer:

composer require edulazaro/larascraper

Then install the required Node dependencies:

npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth

These packages are required for the internal Puppeteer script to run.

Please note that when you run the scraper via a scheduled task, chances are a non interactive terminal is used. Usually Node will be available, but it may not be the case when installing Node via NVM. In this scenario, check the issues section at the end.

Basic Usage

Create a scraper class (manually or via the built-in command):

php artisan make:scraper BikeScraper

This generates a file like:

namespace App\Scrapers;

use EduLazaro\Larascraper\Scraper;

class BikeScraper extends Scraper
{
    protected function handle(): array
    {
        return [
            'title' => $this->crawler->filter('title')->text('')
        ];
    }
}

You can now scrape a URL like this:

use App\Scrapers\BikeScraper;

$data = BikeScraper::scrape('https://whatever.com/bikes/4')
    ->proxy('ip:port', 'username', 'password') // Optional
    ->timeout(10000) // Optional timeout in ms
    ->headers(['Accept-Language' => 'en']) // Optional headers
    ->run();

dd($data);

You can pass parameters to the run method as long as they are handled:

namespace App\Scrapers;

use EduLazaro\Larascraper\Scraper;

class BikeScraper extends Scraper
{
    protected function handle(string $name): array
    {
        return [
            'title' => $this->crawler->filter($name)->text('')
        ];
    }
}

And then you can do:

use App\Scrapers\BikeScraper;

BikeScraper::scrape('https://whatever.com/bikes/4')->run(name: 'title');

Proxy Support

Larascraper supports proxies with or without authentication:

->proxy('200.20.14.84:40200')

Or if using authentication:

->proxy('200.20.14.84:40200', 'username', 'password')

Timeout

To add a custom timeout (20000 ms by default):

->timeout(10000) // Timeout in milliseconds

Headers

To append custom headers:

->headers([
    'Accept-Language' => 'en',
    'X-Custom-Header' => 'Hello'
])

Retry logic

You can add the number of attempts and the number of seconds to wait between attempts:

->retry(3, 5)

Retry 3 times and wait 5 seconds betwee attempts. Please note only the error codes 408, 429, 500, 502, 503 and 504 will be retried.

Artisan Commands

You can generate a scraper instance with:

php artisan make:scraper MyScraper

List all scrapers in app/Scrapers directory:

php artisan list:scrapers

Testing a scraper

You can easily test a scraper with Tinker:

php artisan tinker

And the running:

$data = \App\Scrapers\TestScraper::scrape('https://whatever.com')->run();
dd($data);

Issues

This section contains common configuration issues.

Using Node via NVM

If you use Node via NVM and you try to run the scraper via a scheduled task, chances are Node is not available. To make it available, edit your bash_profile with an editor like Vi, Vim or Nano:

nano ~/.bash_profile

Then make sure this is included at the top:

export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"  # This loads nvm
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"  # This loads nvm bash_completion

Save the file and run:

source ~/.bash_profile

Now Node will be available for non interative terminals and the scraping process should run successfully.

In general, it's not recommended the usage of NVM on production environments.

edulazaro/larascraper

包简介

README 文档