Basic Concepts¶
Projects¶
Scrapy Do handles zipped scrapy projects. The only expectation it has about the structure of the archive is that it contains a directory whose name is the same as the name of the project. This directory, in turn, includes the Scrapy project itself. Doing things this way ends up being quite convenient if you use version control like git to manage the code of your spiders (which you probably should). Let’s consider the quotesbot:
$ git clone https://github.com/scrapy/quotesbot.git $ cd quotesbot
You can create a valid archive like this:
$ git archive master -o quotesbot.zip --prefix=quotesbot/
You can, of course, create the zip file any way you wish as long as it meets the criteria described above.
Jobs¶
When you submit a job, it will end up being classified as either SCHEDULED
or
PENDING
depending on the scheduling spec you provide.
Any PENDING
job will be picked up for execution as soon as there is a free job
slot and its status will be changed to RUNNING
. SCHEDULED
jobs spawn
new PENDING
jobs at the intervals specified in the scheduling spec. A
RUNNING
job may end up being SUCCESSFUL
, FAILED
, or CANCELED
depending on the return code of the spider process or your actions.
Scheduling Specs¶
Scrapy Do uses the excellent Schedule
library to handle scheduled jobs. The user-supplied scheduling specs get
translated to a series of calls to the schedule
library. Therefore, whatever
is valid for this library should be a valid scheduling spec. For example:
- ‘every monday at 12:30’
- ‘every 2 to 3 hours’
- ‘every 6 minutes’
- ‘every hour at 00:15’
are all valid. A scheduling spec must start with either: ‘every’ or ‘now’. The
former will result in creating a SCHEDULED
job while the latter will produce
a PENDING
job for immediate execution. Other valid keywords are:
second
seconds
minute
minutes
hour
hours
day
days
week
weeks
monday
tuesday
wednesday
thursday
friday
saturday
sunday
at
- expects an hour-like parameter immediately afterwards (ie. 12:12)to
- expects an integer immediately afterwards