Welcome to Scrapy Do’s documentation!

Scrapy Do is a daemon that provides a convenient way to run Scrapy spiders. It can either do it once - immediately; or it can run them periodically, at specified time intervals. It’s been inspired by scrapyd but written from scratch. It comes with a REST API, a command line client, and an interactive web interface.

Contents

Quick Start

  • Install scrapy-do using pip:

    $ pip install scrapy-do
    
  • Start the daemon in the foreground:

    $ scrapy-do -n scrapy-do
    
  • Open another terminal window, download the Scrapy’s Quotesbot example, and push the code to the server:

    $ git clone https://github.com/scrapy/quotesbot.git
    $ cd quotesbot
    $ scrapy-do-cl push-project
    +----------------+
    | quotesbot      |
    |----------------|
    | toscrape-css   |
    | toscrape-xpath |
    +----------------+
    
  • Schedule some jobs:

    $ scrapy-do-cl schedule-job --project quotesbot \
        --spider toscrape-css --when 'every 5 to 15 minutes'
    +--------------------------------------+
    | identifier                           |
    |--------------------------------------|
    | 0a3db618-d8e1-48dc-a557-4e8d705d599c |
    +--------------------------------------+
    
    $ scrapy-do-cl schedule-job --project quotesbot --spider toscrape-css
    +--------------------------------------+
    | identifier                           |
    |--------------------------------------|
    | b3a61347-92ef-4095-bb68-0702270a52b8 |
    +--------------------------------------+
    
  • See what’s going on:

    Active Jobs

    The web interface is available at http://localhost:7654 by default.

Basic Concepts

Projects

Scrapy Do handles zipped scrapy projects. The only expectation it has about the structure of the archive is that it contains a directory whose name is the same as the name of the project. This directory, in turn, includes the Scrapy project itself. Doing things this way ends up being quite convenient if you use version control like git to manage the code of your spiders (which you probably should). Let’s consider the quotesbot:

$ git clone https://github.com/scrapy/quotesbot.git
$ cd quotesbot

You can create a valid archive like this:

$ git archive master -o quotesbot.zip --prefix=quotesbot/

You can, of course, create the zip file any way you wish as long as it meets the criteria described above.

Jobs

When you submit a job, it will end up being classified as either SCHEDULED or PENDING depending on the scheduling spec you provide. Any PENDING job will be picked up for execution as soon as there is a free job slot and its status will be changed to RUNNING. SCHEDULED jobs spawn new PENDING jobs at the intervals specified in the scheduling spec. A RUNNING job may end up being SUCCESSFUL, FAILED, or CANCELED depending on the return code of the spider process or your actions.

Scheduling Specs

Scrapy Do uses the excellent Schedule library to handle scheduled jobs. The user-supplied scheduling specs get translated to a series of calls to the schedule library. Therefore, whatever is valid for this library should be a valid scheduling spec. For example:

  • ‘every monday at 12:30’
  • ‘every 2 to 3 hours’
  • ‘every 6 minutes’
  • ‘every hour at 00:15’

are all valid. A scheduling spec must start with either: ‘every’ or ‘now’. The former will result in creating a SCHEDULED job while the latter will produce a PENDING job for immediate execution. Other valid keywords are:

  • second
  • seconds
  • minute
  • minutes
  • hour
  • hours
  • day
  • days
  • week
  • weeks
  • monday
  • tuesday
  • wednesday
  • thursday
  • friday
  • saturday
  • sunday
  • at - expects an hour-like parameter immediately afterwards (ie. 12:12)
  • to - expects an integer immediately afterwards

Installation

The easy way

The easiest way to install Scrapy Do is using pip. You can then create a directory where you want your project data stored and just start the daemon there.

$ pip install scrapy-do
$ mkdir /home/user/my-scrapy-do-data
$ cd /home/user/my-scrapy-do-data
$ scrapy-do scrapy-do

Yup, you need to type scrapy-do twice. That’s how twisted works, don’t ask me. After doing that, you will see some content in this directory including the log file and the pidfile of the Scrapy Do daemon.

A systemd service

Installing Scrapy Do as a systemd service is a far better idea than the easy way described above. It’s a bit of work that should really be done by a proper Debian/Ubuntu package, but we do not have one for the time being, so I will show you how to do it “by hand.”

  • Although not strictly necessary, it’s a good practice to run the daemon under a separate user account. I will create one called pydaemon because I run a couple more python daemons this way.

    $ sudo useradd -m -d /opt/pydaemon pydaemon
    
  • Make sure you have all of the following packages installed:

    $ sudo apt-get install python3 python3-dev python3-virtualenv
    $ sudo apt-get install build-essential
    
  • Switch your session to this new user account:

    $ sudo su - pydaemon
    
  • Create the virtual env and install Scrapy Do:

    $ mkdir virtualenv
    $ cd virtualenv/
    $ python3 /usr/lib/python3/dist-packages/virtualenv.py -p /usr/bin/python3 .
    $ . ./bin/activate
    $ pip install scrapy-do
    $ cd ..
    
  • Create a bin directory and a wrapper script that will set up the virtualenv on startup:

    $ mkdir bin
    $ cat > bin/scrapy-do << EOF
    > #!/bin/bash
    > . /opt/pydaemon/virtualenv/bin/activate
    > exec /opt/pydaemon/virtualenv/bin/scrapy-do "\${@}"
    > EOF
    $ chmod 755 bin/scrapy-do
    
  • Create a data directory and a configuration file:

    $ mkdir -p data/scrapy-do
    $ mkdir etc
    $ cat > etc/scrapy-do.conf << EOF
    > [scrapy-do]
    > project-store = /opt/pydaemon/data/scrapy-do
    > EOF
    
  • As root, create the following file with the following content:

    # cat > /etc/systemd/system/scrapy-do.service << EOF
    > [Unit]
    > Description=Scrapy Do Service
    >
    > [Service]
    > ExecStart=/opt/pydaemon/bin/scrapy-do --nodaemon --pidfile= \
    >           scrapy-do --config /opt/pydaemon/etc/scrapy-do.conf
    > User=pydaemon
    > Group=pydaemon
    > Restart=always
    >
    > [Install]
    > WantedBy=multi-user.target
    > EOF
    
  • You can then reload the systemd configuration and let it manage the Scrapy Do daemon:

    $ sudo systemctl daemon-reload
    $ sudo systemctl start scrapy-do
    $ sudo systemctl enable scrapy-do
    
  • Finally, you should now be able to see that the daemon is running:

    $ sudo systemctl status scrapy-do
    ● scrapy-do.service - Scrapy Do Service
       Loaded: loaded (/etc/systemd/system/scrapy-do.service; enabled; vendor preset: enabled)
       Active: active (running) since Sun 2017-12-10 22:42:55 UTC; 4min 23s ago
     Main PID: 27543 (scrapy-do)
    ...
    

I know its awfully complicated. I will do some packaging work when I have a spare moment.

Server Configuration

You can pass a configuration file to the Scrapy Do daemon in the following way:

$ scrapy-do scrapy-do --config /path/to/config/file.conf

The remaining part of this section describes the meaning of the configurable parameters.

[scrapy-do] section

  • project-store: A directory where all the state of the Scrapy Do daemon is stored. Defaults to projects, meaning that it will use a subdirectory of the Current Working Directory.
  • job-slots: A numer of jobs that can run in parallel. Defaults to 3.
  • completed-cap: A number of completed jobs to keep. All the jobs that exceed the cap and their log files will be purged. Older jobs are purged first. Defaults to 50.

[web] section

  • interfaces: A whitespace-separated list of address-port pairs to listen on. Use the RFC3986 notation to specify IPv6 addresses, i.e., [::1]:7654. Defaults to 127.0.0.1:7654.
  • https: The HTTPS switch. Defaults to off.
  • key: Path to your certificate key. Defaults to: scrapy-do.key.
  • cert: Path to your certificate. Defaults to: scrapy-do.crt.
  • chain: Path to a file containing additional certificates in the chain of trust. Useful when using Let’s Encrypt because their signing certificate is trusted by browsers but not by OS iteslf, leading to commadnline tools like wget or curl failing to verify the certificate. Defaults to an empty string.
  • auth: The authentication switch. Scrapy Do uses the digest authentication method and it will not transmit your password over the network. Therefore, it’s safe to use even without TLS. Defaults to off.
  • auth-db: Path to your authentication database file. The file contains username-password pairs, each in a separate line. The user and password parts are separated by a colon (:). I.e., myusername:mypassword. Please note that the digest authentication requires the server to know the actual password and not the hash. Defaults to auth.db.

Example configuration

[scrapy-do]
project-store = /var/scrapy-do
job-slots = 5
completed-cap = 250

[web]
interfaces = 10.8.0.1:9999 [2001:db8::fa]:7654

https = on
key = /etc/scrapy-do/scrapy-do.key
cert = /etc/scrapy-do/scrapy-do.crt
chain = /etc/scrapy-do/scrapy-do-chain.pem

auth = on
auth-db = /etc/scrapy-do/auth.db

Command Line Client

The command line client is a thin wrapper over the REST API. Its purpose is to make the command invocations more succinct and to format the responses. The command name is scrapy-do-cl. It is followed by a bunch of optional global parameters, the name of the command to be executed, and the command’s parameters:

scrapy-do-cl [global parameters] command [command parameters]

Global parameters and the configuration file

  • --url - the URL of the scrapy-do server, i.e.: http://localhost:7654
  • --username - user name, in case the server is configured to perform authentication
  • --password - user password; if the password is not specified and was not configured in the configuration file, the user will be prompted to type it in the terminal.
  • --print-format - the format of the output; valid options are simple, grid, fancy_grid, presto, psql, pipe, orgtbl, jira, rst, mediawiki, html, latex; defaults to psql.
  • --verify-ssl - a boolean determining whether the SSL certificate checking should be enabled; defaults to True

The defaults for some of these parameters may be specified in the scrapy-do section of the ~/.scrapy-do.cfg file. The parameters configurable this way are: url, username, password, and print-format.

Commands and their parameters

status

Get information about the daemon and its environment.

Example:

$ scrapy-do-cl status
+-----------------+----------------------------+
| key             | value                      |
|-----------------+----------------------------|
| cpu-usage       | 0.0                        |
| memory-usage    | 42.9765625                 |
| jobs-canceled   | 0                          |
| timezone        | UTC; UTC                   |
| uptime          | 1m 12d 8h 44m 58s          |
| jobs-run        | 761                        |
| status          | ok                         |
| jobs-failed     | 0                          |
| hostname        | ip-172-31-35-215           |
| time            | 2018-01-27 08:28:55.625109 |
| jobs-successful | 761                        |
+-----------------+----------------------------+
push-project

Push a project archive to the server replacing an existing one of the same name if it is already present.

Parameters:

  • --project-path - path to the project that you intend to push; defaults to the current working directory

Example:

$ scrapy-do-cl push-project
+----------------+
| quotesbot      |
|----------------|
| toscrape-css   |
| toscrape-xpath |
+----------------+
list-projects

Get a list of the projects registered with the server.

Example:

$ scrapy-do-cl list-projects
+-----------+
| name      |
|-----------|
| quotesbot |
+-----------+
list-spiders

List spiders provided by the given project.

Parameters:

  • --project - name of the project

Example:

$ scrapy-do-cl list-spiders --project quotesbot
+----------------+
| name           |
|----------------|
| toscrape-css   |
| toscrape-xpath |
+----------------+
schedule-job

Schedule a job.

Parameters:

  • --project - name of the project
  • --spider - name of the spider
  • --when - a schedling spec, see Scheduling Specs; defaults to now

Example:

$ scrapy-do-cl schedule-job --project quotesbot \
    --spider toscrape-css --when 'every 10 minutes'
+--------------------------------------+
| identifier                           |
|--------------------------------------|
| 2abf7ff5-f5fe-47d2-96cd-750f8701aa27 |
+--------------------------------------+
list-jobs

Get information about a job or jobs.

Parameters:

  • --status - status of the jobs to list, see Jobs; addtionally ACTIVE and COMPLETED are accepted to get lists of jobs with related statuses; defaults to ACTIVE
  • --job-id - id of the job to list; superceeds --status

Query by status:

$ scrapy-do-cl list-jobs --status SCHEDULED
+--------------------------------------+-----------+--------------+-----------+-----------------------+---------+----------------------------+------------+
| identifier                           | project   | spider       | status    | schedule              | actor   | timestamp                  | duration   |
|--------------------------------------+-----------+--------------+-----------+-----------------------+---------+----------------------------+------------|
| 2abf7ff5-f5fe-47d2-96cd-750f8701aa27 | quotesbot | toscrape-css | SCHEDULED | every 10 minutes      | USER    | 2018-01-27 09:44:19.764036 |            |
| 0a3db618-d8e1-48dc-a557-4e8d705d599c | quotesbot | toscrape-css | SCHEDULED | every 5 to 15 minutes | USER    | 2018-01-27 08:29:24.749770 |            |
+--------------------------------------+-----------+--------------+-----------+-----------------------+---------+----------------------------+------------+

Query by id:

$ scrapy-do-cl list-jobs --job-id 2abf7ff5-f5fe-47d2-96cd-750f8701aa27
+--------------------------------------+-----------+--------------+-----------+------------------+---------+----------------------------+------------+
| identifier                           | project   | spider       | status    | schedule         | actor   | timestamp                  | duration   |
|--------------------------------------+-----------+--------------+-----------+------------------+---------+----------------------------+------------|
| 2abf7ff5-f5fe-47d2-96cd-750f8701aa27 | quotesbot | toscrape-css | SCHEDULED | every 10 minutes | USER    | 2018-01-27 09:44:19.764036 |            |
+--------------------------------------+-----------+--------------+-----------+------------------+---------+----------------------------+------------+
cancel-lob

Cancel a job.

Parameters:

  • --job-id - id of the job to cancel

Example:

$ scrapy-do-cl cancel-job --job-id 2abf7ff5-f5fe-47d2-96cd-750f8701aa27
Canceled.
get-log

Retrieve the log file of the job that has either been completed or is still running.

Parameters:

  • --job-id - id of the job
  • --log-type - out for standard output; err for standard error output

Example:

$ scrapy-do-cl get-log --job-id b37be5b0-24bc-4c3c-bfa8-3c8e305fd9a3 \
    --log-type err
remove-project

Remove a project.

Parameters:

  • name - name of the project
$ scrapy-do-cl remove-project --project quotesbot
Removed.

Web GUI

Scrapy Do comes with a simple web user interface that provides the functionality equivalent to that of the commandline client or the REST API.

Dashboard

The dashboard shows the status of the running daemon and some of the job statistics.

Projects

The project’s view lists all the projects and the spiders they provide. You can push new projects or schedule jobs for the existing spiders.

Active Jobs

The active job view lists all the jobs that are either scheduled, pending, or running.

Completed Jobs

The completed jobs view lists all the jobs that have been completed and their logs if available.

REST API

This section describes the REST API provided by Scrapy Do. The responses to all of the requests except for get-log are JSON dictionaries. Error responses look like this:

{
  "msg": "Error message",
  "status": "error"
}

Successful responses have the status part set to ok and a variety of query dependent keys described below. The request examples use curl and jq.

status.json

Get information about the daemon and its environment.

  • Method: GET

Example request:

$ curl -s "http://localhost:7654/status.json" | jq -r
{
  "status": "ok",
  "memory-usage": 39.89453125,
  "cpu-usage": 0,
  "time": "2017-12-11 15:20:42.415793",
  "timezone": "CET; CEST",
  "hostname": "host",
  "uptime": "1d 12m 24s",
  "jobs-run": 24,
  "jobs-successful": 24,
  "jobs-failed": 0,
  "jobs-canceled": 0
}

push-project.json

Push a project archive to the server replacing an existing one of the same name if it is already present.

  • Method: POST

  • Parameters:

    • archive - a binary buffer containing the project archive
    $ curl -s http://localhost:7654/push-project.json \
           -F archive=@quotesbot.zip | jq -r
    
    {
      "status": "ok",
      "name": "quotesbot",
      "spiders": [
        "toscrape-css",
        "toscrape-xpath"
      ]
    }
    

list-projects.json

Get a list of the projects registered with the server.

  • Method: GET

    $ curl -s http://localhost:7654/list-projects.json | jq -r
    
    {
      "status": "ok",
      "projects": [
        "quotesbot"
      ]
    }
    

list-spiders.json

List spiders provided by the given project.

  • Method: GET

  • Parameters:

    • project - name of the project
    $ curl -s "http://localhost:7654/list-spiders.json?project=quotesbot" | jq -r
    
    {
      "status": "ok",
      "project": "quotesbot",
      "spiders": [
        "toscrape-css",
        "toscrape-xpath"
      ]
    }
    

schedule-job.json

Schedule a job.

  • Method: POST

  • Parameters:

    • project - name of the project
    • spider - name of the spider
    • when - a schedling spec, see Scheduling Specs.
    $ curl -s http://localhost:7654/schedule-job.json \
           -F project=quotesbot \
           -F spider=toscrape-css \
           -F "when=every 10 minutes" | jq -r
    
    {
      "status": "ok",
      "identifier": "5b30c8a2-42e5-4ad5-b143-4cb0420955a5"
    }
    

list-jobs.json

Get information about a job or jobs.

  • Method: GET
  • Parameters (one required):
    • status - status of the jobs to list, see Jobs; addtionally ACTIVE and COMPLETED are accepted to get lists of jobs with related statuses.
    • id - id of the job to list

Query by status:

$ curl -s "http://localhost:7654/list-jobs.json?status=ACTIVE" | jq -r
{
  "status": "ok",
  "jobs": [
    {
      "identifier": "5b30c8a2-42e5-4ad5-b143-4cb0420955a5",
      "status": "SCHEDULED",
      "actor": "USER",
      "schedule": "every 10 minutes",
      "project": "quotesbot",
      "spider": "toscrape-css",
      "timestamp": "2017-12-11 15:34:13.008996",
      "duration": null
    },
    {
      "identifier": "451e6083-54cd-4628-bc5d-b80e6da30e72",
      "status": "SCHEDULED",
      "actor": "USER",
      "schedule": "every minute",
      "project": "quotesbot",
      "spider": "toscrape-css",
      "timestamp": "2017-12-09 20:53:31.219428",
      "duration": null
    }
  ]
}

Query by id:

$ curl -s "http://localhost:7654/list-jobs.json?id=317d71ea-ddea-444b-bb3f-f39d82855e19" | jq -r
 {
   "status": "ok",
   "jobs": [
     {
       "identifier": "317d71ea-ddea-444b-bb3f-f39d82855e19",
       "status": "SUCCESSFUL",
       "actor": "SCHEDULER",
       "schedule": "now",
       "project": "quotesbot",
       "spider": "toscrape-css",
       "timestamp": "2017-12-11 15:40:39.621948",
       "duration": 2
     }
   ]
}

cancel-job.json

Cancel a job.

  • Method: POST

  • Parameters:

    • id - id of the job to cancel
    $ curl -s http://localhost:7654/cancel-job.json \
           -F id=451e6083-54cd-4628-bc5d-b80e6da30e72 | jq -r
    
    {
      "status": "ok"
    }
    

get-log

Retrieve the log file of the job that has either been completed or is still running.

  • Method:: GET

Get the log of the standard output:

$ curl -s http://localhost:7654/get-log/data/bf825a9e-b0c6-4c52-89f6-b5c8209e7977.out

Get the log of the standard error output:

$ curl -s http://localhost:7654/get-log/data/bf825a9e-b0c6-4c52-89f6-b5c8209e7977.err

remove-project.json

Remove a project.

  • Method: POST

  • Parameters:

    • name - name pf the project
    $ curl -s http://localhost:7654/remove-project.json \
           -F name=quotesbot | jq -r
    
    {
      "status": "ok"
    }
    

Source Documentation

Indices and tables