BitTorrent DHT crawling cluster
Cluster project to crawl the BitTorrent DHT network and download torrent file metadata from remote BitTorrent clients.
Run a number of p2pspider instance hosts (crawlers) to gather DHT infohashes, download their metadata and send them on a Redis-based Celery instance (collector) which verifies the torrent files and stores them to disk.
First, install the Redis daemon and start it. Then start the Celery worker which will connect to the Redis instance on localhost:
$ cd collector
$ pyvenv venv
$ source venv/bin/activate
(venv) $ pip3 install -r requirements.txt
(venv) $ ./tasks.py
Collected torrent files will be saved as torrents/[INFOHASH].torrent
$ cd crawler
$ npm install
$ npm start
This will start the worker with the default broker of redis://localhost/0
. To use another one, e.g. a remote host, set the BROKER
environment variable:
$ BROKER=redis://[HOSTNAME]/0 npm start
Because the programs within the containers run with different uids, you may need to make the torrents
target directory writeable before:
$ chmod a+w torrents
If you intend to crawl for a while, you may run out of inodes because of the many small torrent files. It’s a good idea to (loop) mount a filesystem with a higher inode count than the default. The usage type news
creates one inode for every 4k block, which should be plenty.
$ dd if=/dev/zero of=torrents.img bs=1M count=1024
$ mkfs.ext4 -T news torrents.img
# mount torrents.img torrents
# chmod -R a+w torrents
Building the images and starting the cluster locally (one crawler, one collector instance):
$ docker-compose build
$ docker-compose up
To start the optional Celery Flower web monitor, issue docker ps
to find the ID of the running collector container and then issue:
$ docker exec [CONTAINER-ID] supervisorctl start celery-flower
Afterwards it should be reachable via localhost:5555.
For the developer guide on ECS, see here. These are the basic steps for setting it up:
In the IAM console:
AmazonEC2ContainerServiceforEC2Role
policy attachedIn the ECS console:
collector
and crawler
Configure the aws cli by running aws configure
. Then run either the aws/push-containers.sh
script or follow the push instructions manually (given on the EC2 container registry site).
Create a collector
and a crawler
task using the supplied JSON task definitions. Be sure to replace the [REGISTRY-URI]
field with your own ECS repository URI, and replace the [COLLECTOR]
field with the private IP/hostname of the EC2 instance where the Collector container runs.
Create a service for each task definition. Set the Number of tasks
to 0 for now.
aws/setup-ec2-instance.sh
as user dataNumber of tasks
to 1. crawler
task definition so that BROKER
env variable points to the correct collector hostname/IPNumber of tasks
to some number in [1,*].To see the collector status, connect to the collector host, issue docker ps
to find the ID of the running collector container and then issue:
$ docker logs -f [CONTAINER-ID]
To see only the stats, use the following. They are printed every minute and show the cumulative values of the last 10 minutes.
$ docker logs -f [CONTAINER-ID] 2>&1 | grep print_stats
For a typical session, the stats may look like this:
Received task: tasks.print_stats[17e03d9a-23b9-4be5-86c4-ebc020f672f4]
tasks.print_stats: last 10.0 mins: 1800 torrents / 51.2 MB saved
tasks.print_stats: Rate: 180.0/m, 10799.5/h, 259.2k/d
tasks.print_stats: Storage consumption rate: 5.1 MB/m, 0.3 GB/h, 7.2 GB/d
Task tasks.print_stats[17e03d9a-23b9-4be5-86c4-ebc020f672f4] succeeded in 0.04759194100006425s: {'saved': 1800, 'size': 53672251, 'save_rate': 2.999862196330241, 'size_rate': 89449.64264824889, 'running': 600.027562, 'start': 1482410075.418662}