Welcome to Scrapy Do’s documentation!¶
Scrapy Do is a daemon that provides a convenient way to run Scrapy spiders. It can either do it once - immediately; or it can run them periodically, at specified time intervals. It’s been inspired by scrapyd but written from scratch. It comes with a REST API, a command line client, and an interactive web interface.
- Homepage: https://jany.st/scrapy-do.html
- The code: https://github.com/ljanyst/scrapy-do
Contents¶
Quick Start¶
Install
scrapy-do
usingpip
:$ pip install scrapy-do
Start the daemon in the foreground:
$ scrapy-do -n scrapy-do
Open another terminal window, download the Scrapy’s Quotesbot example, and push the code to the server:
$ git clone https://github.com/scrapy/quotesbot.git $ cd quotesbot $ scrapy-do-cl push-project +----------------+ | quotesbot | |----------------| | toscrape-css | | toscrape-xpath | +----------------+
Schedule some jobs:
$ scrapy-do-cl schedule-job --project quotesbot \ --spider toscrape-css --when 'every 5 to 15 minutes' +--------------------------------------+ | identifier | |--------------------------------------| | 0a3db618-d8e1-48dc-a557-4e8d705d599c | +--------------------------------------+ $ scrapy-do-cl schedule-job --project quotesbot --spider toscrape-css +--------------------------------------+ | identifier | |--------------------------------------| | b3a61347-92ef-4095-bb68-0702270a52b8 | +--------------------------------------+
See what’s going on:
The web interface is available at http://localhost:7654 by default.
Basic Concepts¶
Projects¶
Scrapy Do handles zipped scrapy projects. The only expectation it has about the structure of the archive is that it contains a directory whose name is the same as the name of the project. This directory, in turn, includes the Scrapy project itself. Doing things this way ends up being quite convenient if you use version control like git to manage the code of your spiders (which you probably should). Let’s consider the quotesbot:
$ git clone https://github.com/scrapy/quotesbot.git $ cd quotesbot
You can create a valid archive like this:
$ git archive master -o quotesbot.zip --prefix=quotesbot/
You can, of course, create the zip file any way you wish as long as it meets the criteria described above.
Jobs¶
When you submit a job, it will end up being classified as either SCHEDULED
or
PENDING
depending on the scheduling spec you provide.
Any PENDING
job will be picked up for execution as soon as there is a free job
slot and its status will be changed to RUNNING
. SCHEDULED
jobs spawn
new PENDING
jobs at the intervals specified in the scheduling spec. A
RUNNING
job may end up being SUCCESSFUL
, FAILED
, or CANCELED
depending on the return code of the spider process or your actions.
Scheduling Specs¶
Scrapy Do uses the excellent Schedule
library to handle scheduled jobs. The user-supplied scheduling specs get
translated to a series of calls to the schedule
library. Therefore, whatever
is valid for this library should be a valid scheduling spec. For example:
- ‘every monday at 12:30’
- ‘every 2 to 3 hours’
- ‘every 6 minutes’
- ‘every hour at 00:15’
are all valid. A scheduling spec must start with either: ‘every’ or ‘now’. The
former will result in creating a SCHEDULED
job while the latter will produce
a PENDING
job for immediate execution. Other valid keywords are:
second
seconds
minute
minutes
hour
hours
day
days
week
weeks
monday
tuesday
wednesday
thursday
friday
saturday
sunday
at
- expects an hour-like parameter immediately afterwards (ie. 12:12)to
- expects an integer immediately afterwards
Installation¶
The easy way¶
The easiest way to install Scrapy Do is using pip
. You can then create a
directory where you want your project data stored and just start the daemon
there.
$ pip install scrapy-do $ mkdir /home/user/my-scrapy-do-data $ cd /home/user/my-scrapy-do-data $ scrapy-do scrapy-do
Yup, you need to type scrapy-do
twice. That’s how twisted works, don’t ask me. After doing that, you
will see some content in this directory including the log file and the pidfile
of the Scrapy Do daemon.
A systemd service¶
Installing Scrapy Do as a systemd service is a far better idea than the easy way described above. It’s a bit of work that should really be done by a proper Debian/Ubuntu package, but we do not have one for the time being, so I will show you how to do it “by hand.”
Although not strictly necessary, it’s a good practice to run the daemon under a separate user account. I will create one called
pydaemon
because I run a couple more python daemons this way.$ sudo useradd -m -d /opt/pydaemon pydaemon
Make sure you have all of the following packages installed:
$ sudo apt-get install python3 python3-dev python3-virtualenv $ sudo apt-get install build-essential
Switch your session to this new user account:
$ sudo su - pydaemon
Create the virtual env and install Scrapy Do:
$ mkdir virtualenv $ cd virtualenv/ $ python3 /usr/lib/python3/dist-packages/virtualenv.py -p /usr/bin/python3 . $ . ./bin/activate $ pip install scrapy-do $ cd ..
Create a bin directory and a wrapper script that will set up the virtualenv on startup:
$ mkdir bin $ cat > bin/scrapy-do << EOF > #!/bin/bash > . /opt/pydaemon/virtualenv/bin/activate > exec /opt/pydaemon/virtualenv/bin/scrapy-do "\${@}" > EOF $ chmod 755 bin/scrapy-do
Create a data directory and a configuration file:
$ mkdir -p data/scrapy-do $ mkdir etc $ cat > etc/scrapy-do.conf << EOF > [scrapy-do] > project-store = /opt/pydaemon/data/scrapy-do > EOF
As root, create the following file with the following content:
# cat > /etc/systemd/system/scrapy-do.service << EOF > [Unit] > Description=Scrapy Do Service > > [Service] > ExecStart=/opt/pydaemon/bin/scrapy-do --nodaemon --pidfile= \ > scrapy-do --config /opt/pydaemon/etc/scrapy-do.conf > User=pydaemon > Group=pydaemon > Restart=always > > [Install] > WantedBy=multi-user.target > EOF
You can then reload the systemd configuration and let it manage the Scrapy Do daemon:
$ sudo systemctl daemon-reload $ sudo systemctl start scrapy-do $ sudo systemctl enable scrapy-do
Finally, you should now be able to see that the daemon is running:
$ sudo systemctl status scrapy-do ● scrapy-do.service - Scrapy Do Service Loaded: loaded (/etc/systemd/system/scrapy-do.service; enabled; vendor preset: enabled) Active: active (running) since Sun 2017-12-10 22:42:55 UTC; 4min 23s ago Main PID: 27543 (scrapy-do) ...
I know its awfully complicated. I will do some packaging work when I have a spare moment.
Server Configuration¶
You can pass a configuration file to the Scrapy Do daemon in the following way:
$ scrapy-do scrapy-do --config /path/to/config/file.conf
The remaining part of this section describes the meaning of the configurable parameters.
[scrapy-do]
section¶
- project-store: A directory where all the state of the Scrapy Do daemon is
stored. Defaults to
projects
, meaning that it will use a subdirectory of the Current Working Directory. - job-slots: A numer of jobs that can run in parallel. Defaults to
3
. - completed-cap: A number of completed jobs to keep. All the jobs that exceed
the cap and their log files will be purged. Older jobs are purged first.
Defaults to
50
.
[web]
section¶
- interfaces: A whitespace-separated list of address-port pairs to listen
on. Use the RFC3986 notation to specify IPv6 addresses, i.e.,
[::1]:7654
. Defaults to127.0.0.1:7654
. - https: The HTTPS switch. Defaults to
off
. - key: Path to your certificate key. Defaults to:
scrapy-do.key
. - cert: Path to your certificate. Defaults to:
scrapy-do.crt
. - chain: Path to a file containing additional certificates in the chain of trust. Useful when using Let’s Encrypt because their signing certificate is trusted by browsers but not by OS iteslf, leading to commadnline tools like wget or curl failing to verify the certificate. Defaults to an empty string.
- auth: The authentication switch. Scrapy Do uses the digest authentication
method and it will not transmit your password over the network. Therefore,
it’s safe to use even without TLS. Defaults to
off
. - auth-db: Path to your authentication database file. The file contains
username-password pairs, each in a separate line. The user and password parts
are separated by a colon (
:
). I.e.,myusername:mypassword
. Please note that the digest authentication requires the server to know the actual password and not the hash. Defaults toauth.db
.
Example configuration¶
[scrapy-do] project-store = /var/scrapy-do job-slots = 5 completed-cap = 250 [web] interfaces = 10.8.0.1:9999 [2001:db8::fa]:7654 https = on key = /etc/scrapy-do/scrapy-do.key cert = /etc/scrapy-do/scrapy-do.crt chain = /etc/scrapy-do/scrapy-do-chain.pem auth = on auth-db = /etc/scrapy-do/auth.db
Command Line Client¶
The command line client is a thin wrapper over the REST API. Its purpose
is to make the command invocations more succinct and to format the responses.
The command name is scrapy-do-cl
. It is followed by a bunch of optional
global parameters, the name of the command to be executed, and the command’s
parameters:
scrapy-do-cl [global parameters] command [command parameters]
Global parameters and the configuration file¶
--url
- the URL of thescrapy-do
server, i.e.:http://localhost:7654
--username
- user name, in case the server is configured to perform authentication--password
- user password; if the password is not specified and was not configured in the configuration file, the user will be prompted to type it in the terminal.--print-format
- the format of the output; valid options aresimple
,grid
,fancy_grid
,presto
,psql
,pipe
,orgtbl
,jira
,rst
,mediawiki
,html
,latex
; defaults topsql
.--verify-ssl
- a boolean determining whether the SSL certificate checking should be enabled; defaults toTrue
The defaults for some of these parameters may be specified in the scrapy-do
section of the ~/.scrapy-do.cfg
file. The parameters configurable this way
are: url
, username
, password
, and print-format
.
Commands and their parameters¶
status¶
Get information about the daemon and its environment.
Example:
$ scrapy-do-cl status +-----------------+----------------------------+ | key | value | |-----------------+----------------------------| | cpu-usage | 0.0 | | memory-usage | 42.9765625 | | jobs-canceled | 0 | | timezone | UTC; UTC | | uptime | 1m 12d 8h 44m 58s | | jobs-run | 761 | | status | ok | | jobs-failed | 0 | | hostname | ip-172-31-35-215 | | time | 2018-01-27 08:28:55.625109 | | jobs-successful | 761 | +-----------------+----------------------------+
push-project¶
Push a project archive to the server replacing an existing one of the same name if it is already present.
Parameters:
--project-path
- path to the project that you intend to push; defaults to the current working directory
Example:
$ scrapy-do-cl push-project +----------------+ | quotesbot | |----------------| | toscrape-css | | toscrape-xpath | +----------------+
list-projects¶
Get a list of the projects registered with the server.
Example:
$ scrapy-do-cl list-projects +-----------+ | name | |-----------| | quotesbot | +-----------+
list-spiders¶
List spiders provided by the given project.
Parameters:
--project
- name of the project
Example:
$ scrapy-do-cl list-spiders --project quotesbot +----------------+ | name | |----------------| | toscrape-css | | toscrape-xpath | +----------------+
schedule-job¶
Schedule a job.
Parameters:
--project
- name of the project--spider
- name of the spider--when
- a schedling spec, see Scheduling Specs; defaults tonow
Example:
$ scrapy-do-cl schedule-job --project quotesbot \ --spider toscrape-css --when 'every 10 minutes' +--------------------------------------+ | identifier | |--------------------------------------| | 2abf7ff5-f5fe-47d2-96cd-750f8701aa27 | +--------------------------------------+
list-jobs¶
Get information about a job or jobs.
Parameters:
--status
- status of the jobs to list, see Jobs; addtionallyACTIVE
andCOMPLETED
are accepted to get lists of jobs with related statuses; defaults toACTIVE
--job-id
- id of the job to list; superceeds--status
Query by status:
$ scrapy-do-cl list-jobs --status SCHEDULED +--------------------------------------+-----------+--------------+-----------+-----------------------+---------+----------------------------+------------+ | identifier | project | spider | status | schedule | actor | timestamp | duration | |--------------------------------------+-----------+--------------+-----------+-----------------------+---------+----------------------------+------------| | 2abf7ff5-f5fe-47d2-96cd-750f8701aa27 | quotesbot | toscrape-css | SCHEDULED | every 10 minutes | USER | 2018-01-27 09:44:19.764036 | | | 0a3db618-d8e1-48dc-a557-4e8d705d599c | quotesbot | toscrape-css | SCHEDULED | every 5 to 15 minutes | USER | 2018-01-27 08:29:24.749770 | | +--------------------------------------+-----------+--------------+-----------+-----------------------+---------+----------------------------+------------+
Query by id:
$ scrapy-do-cl list-jobs --job-id 2abf7ff5-f5fe-47d2-96cd-750f8701aa27 +--------------------------------------+-----------+--------------+-----------+------------------+---------+----------------------------+------------+ | identifier | project | spider | status | schedule | actor | timestamp | duration | |--------------------------------------+-----------+--------------+-----------+------------------+---------+----------------------------+------------| | 2abf7ff5-f5fe-47d2-96cd-750f8701aa27 | quotesbot | toscrape-css | SCHEDULED | every 10 minutes | USER | 2018-01-27 09:44:19.764036 | | +--------------------------------------+-----------+--------------+-----------+------------------+---------+----------------------------+------------+
cancel-lob¶
Cancel a job.
Parameters:
--job-id
- id of the job to cancel
Example:
$ scrapy-do-cl cancel-job --job-id 2abf7ff5-f5fe-47d2-96cd-750f8701aa27 Canceled.
get-log¶
Retrieve the log file of the job that has either been completed or is still running.
Parameters:
--job-id
- id of the job--log-type
-out
for standard output;err
for standard error output
Example:
$ scrapy-do-cl get-log --job-id b37be5b0-24bc-4c3c-bfa8-3c8e305fd9a3 \ --log-type err
remove-project¶
Remove a project.
Parameters:
name
- name of the project$ scrapy-do-cl remove-project --project quotesbot Removed.
Web GUI¶
Scrapy Do comes with a simple web user interface that provides the functionality equivalent to that of the commandline client or the REST API.
REST API¶
This section describes the REST API provided by Scrapy Do. The responses to
all of the requests except for get-log
are JSON dictionaries. Error
responses look like this:
{ "msg": "Error message", "status": "error" }
Successful responses have the status
part set to ok
and a variety of
query dependent keys described below. The request examples use curl and jq.
status.json
¶
Get information about the daemon and its environment.
- Method:
GET
Example request:
$ curl -s "http://localhost:7654/status.json" | jq -r{ "status": "ok", "memory-usage": 39.89453125, "cpu-usage": 0, "time": "2017-12-11 15:20:42.415793", "timezone": "CET; CEST", "hostname": "host", "uptime": "1d 12m 24s", "jobs-run": 24, "jobs-successful": 24, "jobs-failed": 0, "jobs-canceled": 0 }
push-project.json
¶
Push a project archive to the server replacing an existing one of the same name if it is already present.
Method:
POST
Parameters:
archive
- a binary buffer containing the project archive
$ curl -s http://localhost:7654/push-project.json \ -F archive=@quotesbot.zip | jq -r
{ "status": "ok", "name": "quotesbot", "spiders": [ "toscrape-css", "toscrape-xpath" ] }
list-projects.json
¶
Get a list of the projects registered with the server.
Method:
GET
$ curl -s http://localhost:7654/list-projects.json | jq -r
{ "status": "ok", "projects": [ "quotesbot" ] }
list-spiders.json
¶
List spiders provided by the given project.
Method:
GET
Parameters:
project
- name of the project
$ curl -s "http://localhost:7654/list-spiders.json?project=quotesbot" | jq -r
{ "status": "ok", "project": "quotesbot", "spiders": [ "toscrape-css", "toscrape-xpath" ] }
schedule-job.json
¶
Schedule a job.
Method:
POST
Parameters:
project
- name of the projectspider
- name of the spiderwhen
- a schedling spec, see Scheduling Specs.
$ curl -s http://localhost:7654/schedule-job.json \ -F project=quotesbot \ -F spider=toscrape-css \ -F "when=every 10 minutes" | jq -r
{ "status": "ok", "identifier": "5b30c8a2-42e5-4ad5-b143-4cb0420955a5" }
list-jobs.json
¶
Get information about a job or jobs.
- Method:
GET
- Parameters (one required):
status
- status of the jobs to list, see Jobs; addtionallyACTIVE
andCOMPLETED
are accepted to get lists of jobs with related statuses.id
- id of the job to list
Query by status:
$ curl -s "http://localhost:7654/list-jobs.json?status=ACTIVE" | jq -r{ "status": "ok", "jobs": [ { "identifier": "5b30c8a2-42e5-4ad5-b143-4cb0420955a5", "status": "SCHEDULED", "actor": "USER", "schedule": "every 10 minutes", "project": "quotesbot", "spider": "toscrape-css", "timestamp": "2017-12-11 15:34:13.008996", "duration": null }, { "identifier": "451e6083-54cd-4628-bc5d-b80e6da30e72", "status": "SCHEDULED", "actor": "USER", "schedule": "every minute", "project": "quotesbot", "spider": "toscrape-css", "timestamp": "2017-12-09 20:53:31.219428", "duration": null } ] }
Query by id:
$ curl -s "http://localhost:7654/list-jobs.json?id=317d71ea-ddea-444b-bb3f-f39d82855e19" | jq -r{ "status": "ok", "jobs": [ { "identifier": "317d71ea-ddea-444b-bb3f-f39d82855e19", "status": "SUCCESSFUL", "actor": "SCHEDULER", "schedule": "now", "project": "quotesbot", "spider": "toscrape-css", "timestamp": "2017-12-11 15:40:39.621948", "duration": 2 } ] }
cancel-job.json
¶
Cancel a job.
Method:
POST
Parameters:
id
- id of the job to cancel
$ curl -s http://localhost:7654/cancel-job.json \ -F id=451e6083-54cd-4628-bc5d-b80e6da30e72 | jq -r
{ "status": "ok" }
get-log
¶
Retrieve the log file of the job that has either been completed or is still running.
- Method::
GET
Get the log of the standard output:
$ curl -s http://localhost:7654/get-log/data/bf825a9e-b0c6-4c52-89f6-b5c8209e7977.out
Get the log of the standard error output:
$ curl -s http://localhost:7654/get-log/data/bf825a9e-b0c6-4c52-89f6-b5c8209e7977.err
remove-project.json
¶
Remove a project.
Method:
POST
Parameters:
name
- name pf the project
$ curl -s http://localhost:7654/remove-project.json \ -F name=quotesbot | jq -r
{ "status": "ok" }