项目作者: debugtalk

项目描述 :
A web crawler based on requests-html, mainly targets for url validation test.
高级语言: Python
项目地址: git://github.com/debugtalk/WebCrawler.git
创建时间: 2017-03-24T13:34:26Z
项目社区:https://github.com/debugtalk/WebCrawler

开源协议:MIT License

下载


WebCrawler

A simple web crawler, mainly targets for link validation test.

Features

  • running in BFS or DFS mode
  • specify concurrent running workers in BFS mode
  • crawl seeds can be set to more than one urls
  • support crawl with cookies
  • configure hyper links regex, including match type and ignore type
  • group visited urls by HTTP status code
  • flexible configuration in YAML
  • send test result by mail, through SMTP protocol or mailgun service
  • cancel jobs

Installation/Upgrade

  1. $ pip install -U git+https://github.com/debugtalk/WebCrawler.git#egg=requests-crawler --process-dependency-links

To ensure the installation or upgrade is successful, you can execute command webcrawler -V to see if you can get the correct version number.

  1. $ webcrawler -V
  2. jenkins-mail-py version: 0.2.4
  3. WebCrawler version: 0.3.0

Usage

  1. $ webcrawler -h
  2. usage: webcrawler [-h] [-V] [--log-level LOG_LEVEL]
  3. [--config-file CONFIG_FILE] [--seeds SEEDS]
  4. [--include-hosts INCLUDE_HOSTS] [--cookies COOKIES]
  5. [--crawl-mode CRAWL_MODE] [--max-depth MAX_DEPTH]
  6. [--concurrency CONCURRENCY] [--save-results SAVE_RESULTS]
  7. [--grey-user-agent GREY_USER_AGENT]
  8. [--grey-traceid GREY_TRACEID]
  9. [--grey-view-grey GREY_VIEW_GREY]
  10. [--mailgun-api-id MAILGUN_API_ID]
  11. [--mailgun-api-key MAILGUN_API_KEY]
  12. [--mail-sender MAIL_SENDER]
  13. [--mail-recepients [MAIL_RECEPIENTS [MAIL_RECEPIENTS ...]]]
  14. [--mail-subject MAIL_SUBJECT] [--mail-content MAIL_CONTENT]
  15. [--jenkins-job-name JENKINS_JOB_NAME]
  16. [--jenkins-job-url JENKINS_JOB_URL]
  17. [--jenkins-build-number JENKINS_BUILD_NUMBER]
  18. A web crawler for testing website links validation.
  19. optional arguments:
  20. -h, --help show this help message and exit
  21. -V, --version show version
  22. --log-level LOG_LEVEL
  23. Specify logging level, default is INFO.
  24. --config-file CONFIG_FILE
  25. Specify config file path.
  26. --seeds SEEDS Specify crawl seed url(s), several urls can be
  27. specified with pipe; if auth needed, seeds can be
  28. specified like user1:pwd1@url1|user2:pwd2@url2
  29. --include-hosts INCLUDE_HOSTS
  30. Specify extra hosts to be crawled.
  31. --cookies COOKIES Specify cookies, several cookies can be joined by '|'.
  32. e.g. 'lang:en,country:us|lang:zh,country:cn'
  33. --crawl-mode CRAWL_MODE
  34. Specify crawl mode, BFS or DFS.
  35. --max-depth MAX_DEPTH
  36. Specify max crawl depth.
  37. --concurrency CONCURRENCY
  38. Specify concurrent workers number.
  39. --save-results SAVE_RESULTS
  40. Specify if save results, default is NO.
  41. --grey-user-agent GREY_USER_AGENT
  42. Specify grey environment header User-Agent.
  43. --grey-traceid GREY_TRACEID
  44. Specify grey environment cookie traceid.
  45. --grey-view-grey GREY_VIEW_GREY
  46. Specify grey environment cookie view_gray.
  47. --mailgun-api-id MAILGUN_API_ID
  48. Specify mailgun api id.
  49. --mailgun-api-key MAILGUN_API_KEY
  50. Specify mailgun api key.
  51. --mail-sender MAIL_SENDER
  52. Specify email sender.
  53. --mail-recepients [MAIL_RECEPIENTS [MAIL_RECEPIENTS ...]]
  54. Specify email recepients.
  55. --mail-subject MAIL_SUBJECT
  56. Specify email subject.
  57. --mail-content MAIL_CONTENT
  58. Specify email content.
  59. --jenkins-job-name JENKINS_JOB_NAME
  60. Specify jenkins job name.
  61. --jenkins-job-url JENKINS_JOB_URL
  62. Specify jenkins job url.
  63. --jenkins-build-number JENKINS_BUILD_NUMBER
  64. Specify jenkins build number.

Examples

Specify config file.

  1. $ webcrawler --seeds http://debugtalk.com --crawl-mode bfs --max-depth 5 --config-file path/to/config.yml

Crawl in BFS mode with 20 concurrent workers, and set maximum depth to 5.

  1. $ webcrawler --seeds http://debugtalk.com --crawl-mode bfs --max-depth 5 --concurrency 20

Crawl in DFS mode, and set maximum depth to 10.

  1. $ webcrawler --seeds http://debugtalk.com --crawl-mode dfs --max-depth 10

Crawl several websites in BFS mode with 20 concurrent workers, and set maximum depth to 10.

  1. $ webcrawler --seeds http://debugtalk.com,http://blog.debugtalk.com --crawl-mode bfs --max-depth 10 --concurrency 20

Crawl with different cookies.

  1. $ webcrawler --seeds http://debugtalk.com --crawl-mode BFS --max-depth 10 --concurrency 50 --cookies 'lang:en,country:us|lang:zh,country:cn'

Supported Python Versions

WebCrawler supports Python 2.7, 3.3, 3.4, 3.5, and 3.6.

License

Open source licensed under the MIT license (see LICENSE file for details).