Crawler -- mm2_crawler
======================

Overview
--------

The crawler scans all available mirrors if their content is up to date. The
crawler only scans mirrors which are not disabled (host.user_active,
host.admin_active, site.user_active, site.admin_active).

The crawler retrieves the list of active mirrors from the database and
spawns a thread for each host. The crawling of a host consists of a loop
over all categories which have been configured for this host and one category
is crawled after another.

The crawler has the possibility to crawl a category via HTTP, FTP or RSYNC.
For each category the protocol to crawl is selected from the available
category protocols. RSYNC has the highest priority and is followed by HTTP
and the protocol used last (if no other protocol is available) is FTP.

After the crawl has finished (successful or unsuccessful) the duration
and the timestamp of the crawl is recorded in the host. After the crawl of
each category the files which are up to date in the category are stored
in the database. Crawl failures (timeouts, network problems) usually result
in a complete 'marked-as-not-up-to-date' of that mirror. Multiple consecutive
crawl failures (default 4) will disable the host completely (host.user_active).

The crawler requires enormous amounts of memory and for 40 threads crawling
mirrors in parallel at least 32GB of memory are required. At the end of
each crawl thread the garbage collector is explicitly called in the hope
that some unused memory is freed again. Fedora's MirrorManager installation
has right now (April 2015) around 250 active mirrors which are crawled.

Crawling protocol
-----------------

As previously mentioned the crawler uses either RSYNC, HTTP or FTP to crawl
a connected mirror. As also already mentioned the crawler spawns a thread
for each host it has to crawl and then starts crawling one category configured
after another. The protocol decision is made for each category so that it is
possible to crawl different categories with different protocols.

If the category supports RSYNC the whole category is scanned using RSYNC with
a single network connection. If it was able to find all files the category
is marked as up to date and the next category follows. If no RSYNC URL is
available the crawler uses FTP or HTTP. FTP requires one network connection
per directory and using HTTP each file is crawled separately. Depending on
the configuration of the remote host this can mean one network connection
per file or, if the remote host supports HTTP keep-alive, one connection
for multiple files or the whole category. This depends again on the
configuration of the remote host and is detected automatically.

Using HTTP most of the files are only read via HTTP HEAD. Not the actual data
of the file is downloaded. Only the metadata is downloaded. Only files with
the name 'repomd.xml' are actually completed downloaded and their SHA256 sum
is compared to the one in the database.

Timeouts
--------

The crawler tries to use timeouts at different points during its runtime.
There is the per host timeout (default 120 minutes) and the RSYNC timeout
of 4 hours. The RSYNC timeout is used as '--timeout=14400'. According
to RSYNC's man-page this means:

   This  option  allows  you  to set a maximum I/O timeout in seconds.
   If no data is transferred for the specified time then rsync will exit.

In contrast to the other timeouts in the crawler this has no direct influence
on the crawl duration. This only cleans up old RSYNC processes if they might
have stalled.

The per host timeout is enforced after each network operation. After each
HTTP and FTP the timeout check function is called and if the timeout is
reached a timeout exception is raised. This has again the consequence that
all categories of this host are marked as not being up to date and (default)
4 consecutive timeout failures will auto disable this host (host.user_active).

Additionally each FTP and HTTP connection have the default timeout specified
while instantiating the corresponding transport class (ftplib, httplib).
According to the python documentation this timeout works as follows:

    If the optional timeout parameter is given, blocking operations
    (like connection attempts) will timeout after that many seconds.

In combination with the different ways of crawling (FTP - whole directories,
HTTP - single files or multiple files using HTTP keepalive) the behaviour
of the timeout values depends on the host configuration and the URLs
provided as possible crawl URLs.

Additional discussion about timeouts can be found here:

   https://github.com/fedora-infra/mirrormanager2/pull/53