Spider Man

高可用代理IP池高并发生成器一些实战经验

Highly Available Proxy IP Pool, Highly Concurrent Request Builder, Some Application

site	document	Last Modified time
some proxy site,etc.	Proxy pool	20-06-01
music.163.com	Netease	18-10-21
-	Press Test System	18-11-10
news.baidu.com	News	19-01-25
note.youdao.com	Youdao Note	20-01-04
jianshu.com/csdn.net	blog	20-01-04
elective.pku.edu.cn	Brush Class	19-10-11
zimuzu.tv	zimuzu	19-04-13
bilibili.com	Bilibili	20-06-06
exam.shaoq.com	shaoq	19-03-21
data.eastmoney.com	Eastmoney	19-03-29
hotel.ctrip.com	Ctrip Hotel Detail	19-10-11
douban.com	DouBan	19-05-07
66ip.cn	66ip	19-05-07

keyword

Big data store
High concurrency requests
Support WebSocket
method for font cheat
method for js compile
Some Application

Quick Start

docker is on the road.

$ git clone https://github.com/iofu728/spider.git
$ cd spider
$ pip install -r requirement.txt
# load proxy pool
$ python proxy/getproxy.py                             # to load proxy resources

To use proxy pool

''' using proxy requests '''
from proxy.getproxy import GetFreeProxy                # to use proxy
proxy_req = GetFreeProxy().proxy_req
proxy_req(url:str, types:int, data=None, test_func=None, header=None)
''' using basic requests '''
from util.util import basic_req
basic_req(url: str, types: int, proxies=None, data=None, header=None, need_cookie: bool = False)

Structure

.
├── LICENSE
├── README.md
├── bilibili
│   ├── analysis.py                // data analysis
│   ├── bilibili.py                // bilibili basic
│   └── bsocket.py                 // bilibili websocket
├── blog
│   └── titleviews.py              // Zhihu && CSDN && jianshu
├── brushclass
│   └── brushclass.py              // PKU elective
├── buildmd
│   └── buildmd.py                 // Youdao Note
├── eastmoney
│   └── eastmoney.py               // font analysis
├── exam
│   ├── shaoq.js                   // jsdom
│   └── shaoq.py                   // compile js shaoq
├── log
├── netease
│   ├── netease_music_base.py
│   ├── netease_music_db.py        // Netease Music
│   └── table.sql
├── news
│   └── news.py                    // Google && Baidu
├── press
│   └── press.py                   // Press text
├── proxy
│   ├── getproxy.py                // Proxy pool
│   └── table.sql
├── requirement.txt
├── utils
│   ├── db.py
│   └── utils.py
└── zimuzu
    └── zimuzu.py                  // zimuzi

Proxy pool

proxy pool is the heart of this project.

Highly Available Proxy IP Pool
- By obtaining data from Gatherproxy, Goubanjia, xici etc. Free Proxy WebSite
- Analysis of the Goubanjia port data
- Quickly verify IP availability
- Cooperate with Requests to automatically assign proxy Ip, with Retry mechanism, fail to write DB mechanism
- two models for proxy shell
  - model 1: load gather proxy list && update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to proxy/data/passage one line by username, one line by passwd)
  - model 0: update proxy pool db && test available
- one common proxy api
  - from proxy.getproxy import GetFreeProxy
  - proxy_req = GetFreeProxy().proxy_req
  - proxy_req(url: str, types: int, data=None, test_func=None, header=None)
- also one common basic req api
  - from util import basic_req
  - basic_req(url: str, types: int, proxies=None, data=None, header=None)
- if you want spider by using proxy
  - because access proxy web need over the GFW, so maybe you can’t use model 1 to download proxy file.
  - download proxy txt from ‘http://gatherproxy.com‘
  - cp download_file proxy/data/gatherproxy
  - python proxy/getproxy.py —model==0

Netease

Netease Music song playlist crawl - netease/netease_music_db.py

problem: big data store
classify -> playlist id -> song_detail
V1 Write file, One run version, no proxy, no record progress mechanism
V1.5 Small amount of proxy IP
V2 Proxy IP pool, Record progress, Write to MySQL
- Optimize the write to DB Load data/ Replace INTO
Netease Music Spider for DB
Netease Music Spider

Press Test System

Press Test System - press/press.py

problem: high concurrency requests
By highly available proxy IP pool to pretend user.
Give some web service uneven pressure
To do: press uniform

News

google & baidu info crawl- news/news.py

get news from search engine by Proxy Engine
one model: careful analysis DOM
the other model: rough analysis Chinese words

Youdao Note

Youdao Note documents crawl - buildmd/buildmd.py

load data from youdaoyun
by series of rules to deal data to .md

blog

csdn && zhihu && jianshu view info crawl - blog/titleview.py

$ python blog/titleviews.py --model=1 >> log 2>&1 # model = 1: load gather model or python blog/titleviews.py --model=1 >> proxy.log 2>&1
$ python blog/titleviews.py --model=0 >> log 2>&1 # model = 0: update gather model

Brush Class

PKU Class brush - brushclass/brushclass.py

when your expected class have places, It will send you some email.

zimuzu

ZiMuZu download list crawl - zimuzu/zimuzu.py

when you want to download lots of show like Season 22, Season 21.
If click one by one, It is very boring, so zimuzu.py is all you need.
The thing you only need do is to wait for the program run.
And you copy the Thunder URL for one to download the movies.
Now The Winter will come, I think you need it to review <Game of Thrones>.

Bilibili

Get av data by http - bilibili/bilibili.py

homepage rank -> check tids -> to check data every 2min(during on rank + one day)
monitor every rank av -> star num & basic data

Get av data by websocket - bilibili/bsocket.py

base on WebSocket
byte analysis
heartbeat

Get comment data by http - bilibili/bilibili.py

load comment from /x/v2/reply
UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 7-10: ordinal not in range(128)
- read/write in utf-8
- with codecs.open(filename, ‘r/w’, encoding=’utf-8’)
bilibili some url return 404 like http://api.bilibili.com/x/relation/stat?jsonp=jsonp&callback=__jp11&vmid=

basic_req auto add host to headers, but this URL can’t request in ‘Host’

shaoq

Get text data by compiling javascript - exam/shaoq.py

Idea
1. get cookie
2. request image
3. requests after 5.5s
4. compile javascript code -> get css
5. analysic css

Requirement

pip3 install PyExecJS
yarn install add jsdom # npm install jsdom PS: not global

Can’t get true html
- Wait time must be 5.5s.
- So you can use threading or await asyncio.gather to request image
- Coroutines and Tasks
Error: Cannot find module ‘jsdom’

jsdom must install in local not in global
- Cannot find module ‘jsdom’
remove subtree & edit subtree & re.findall
```
subtree.extract()
subtree.string = new_string
parent_tree.find_all(re.compile('''))
```

Eastmoney

Get stock info by analysis font - eastmoney/eastmoney.py

font analysis
Idea
1. get data from HTML -> json
2. get font map -> transform num
3. or load font analysis font(contrast with base)
error: unpack requires a buffer of 20 bytes
- requests.text -> str,
- requests.content -> byte
- Struct.error: unpack requires a buffer of 16 bytes
How to analysis font
- use fonttools
- get TTFont().getBestCamp()
- contrast with base
configure file
- cfg = ConfigParser()
- cfg.read(assign_path, ‘utf-8’)
- 13.10read configure file

Ctrip Hotel Detail

Get Ctrip Hotel True Detail - ctrip/hotelDetail.py

int32
```
np.int32()
```
js charCodeAt() in py

python 中如何实现 js 里的 charCodeAt()方法？
```
ord(string[index])
```
python access file fold import
```
import sys
sys.path.append(os.getcwd())
```

generate char list

using ASCII

lower_char = [chr(i) for i in range(97,123)] # a-z
upper_char = [chr(i) for i in range(65,91)]  # A-Z

Can’t get cookie in document.cookie

Service use HttpOnly in Set-Cookie
- Why doesn’t document.cookie show all the cookie for the site?
- Secure and HttpOnly
The Secure attribute is meant to keep cookie communication limited to encrypted transmission, directing browsers to use cookies only via secure/encrypted connections. However, if a web server sets a cookie with a secure attribute from a non-secure connection, the cookie can still be intercepted when it is sent to the user by man-in-the-middle attacks. Therefore, for maximum security, cookies with the Secure attribute should only be set over a secure connection.

The HttpOnly attribute directs browsers not to expose cookies through channels other than HTTP (and HTTPS) requests. This means that the cookie cannot be accessed via client-side scripting languages (notably JavaScript), and therefore cannot be stolen easily via cross-site scripting (a pervasive attack technique).
ctrip cookie analysis

key	method	how	constant	login	finish
`magicid`	set	`https://hotels.ctrip.com/hotel/xxx.html`	1	0	1
`ASP.NET_SessionId`	set	`https://hotels.ctrip.com/hotel/xxx.html`	1	0	1
`clientid`	set	`https://hotels.ctrip.com/hotel/xxx.html`	1	0	1
`_abtest_userid`	set	`https://hotels.ctrip.com/hotel/xxx.html`	1	0	1
`hoteluuid`	js	`https://hotels.ctrip.com/hotel/xxx.html`	1	0
`fcerror`	js	`https://hotels.ctrip.com/hotel/xxx.html`	1	0
`_zQdjfing`	js	`https://hotels.ctrip.com/hotel/xxx.html`	1	0
`OID_ForOnlineHotel`	js	`https://webresource.c-ctrip.com/ResHotelOnline/R8/search/js.merge/showhotelinformation.js`	1	0
`_RSG`	req	`https://cdid.c-ctrip.com/chloro-device/v2/d`	1	0
`_RDG`	req	`https://cdid.c-ctrip.com/chloro-device/v2/d`	1	0
`_RGUID`	set	`https://cdid.c-ctrip.com/chloro-device/v2/d`	1	0
`_ga`	js	for google analysis	1	0
`_gid`	js	for google analysis	1	0
`MKT_Pagesource`	js	`https://webresource.c-ctrip.com/ResUnionOnline/R3/float/floating_normal.min.js`	1	0
`_HGUID`	js	`https://hotels.ctrip.com/hotel/xxx.html`	1	0
`HotelDomesticVisitedHotels1`	set	`https://hotels.ctrip.com/Domestic/tool/AjaxGetHotelAddtionalInfo.ashx`	1	0
`_RF1`	req	`https://cdid.c-ctrip.com/chloro-device/v2/d`	1	0
`appFloatCnt`	js	`https://webresource.c-ctrip.com/ResUnionOnline/R3/float/floating_normal.min.js?20190428`	1	0
`gad_city`	set	`https://crm.ws.ctrip.com/Customer-Market-Proxy/AdCallProxyV2.aspx`	1	0
`login_uid`	set	`https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`	1	1
`login_type`	set	`https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`	1	1
`cticket`	set	`https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`	1	1
`AHeadUserInfo`	set	`https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`	1	1
`ticket_ctrip`	set	`https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`	1	1
`DUID`	set	`https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`	1	1
`IsNonUser`	set	`https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`	1	1
`UUID`	req	`https://passport.ctrip.com/gateway/api/soa2/12770/setGuestData`	1	1
`IsPersonalizedLogin`	js	`https://webresource.c-ctrip.com/ares2/basebiz/cusersdk/~0.0.8/default/login/1.0.0/loginsdk.min.js`	1	1
`_bfi`	js	`https://webresource.c-ctrip.com/code/ubt/_bfa.min.js?v=20193_28.js`	1	0
`_jzqco`	js	`https://webresource.c-ctrip.com/ResUnionOnline/R1/remarketing/js/mba_ctrip.js`	1	0
`__zpspc`	js	`https://webresource.c-ctrip.com/ResUnionOnline/R1/remarketing/js/s.js`	1	0
`_bfa`	js	`https://webresource.c-ctrip.com/code/ubt/_bfa.min.js?v=20193_28.js`	1	0
`_bfs`	js	`https://webresource.c-ctrip.com/code/ubt/_bfa.min.js?v=20193_28.js`	1	0
`utc`	js	`https://hotels.ctrip.com/hotel/xxx.html`	0	0	1
`htltmp`	js	`https://hotels.ctrip.com/hotel/xxx.html`	0	0	1
`htlstm`	js	`https://hotels.ctrip.com/hotel/xxx.html`	0	0	1
`arp_scroll_position`	js	`https://hotels.ctrip.com/hotel/xxx.html`	0	0	1

some fusion in ctrip

function a31(a233, a23, a94) {
  var a120 = {
    KWcVI: "mMa",
    hqRkQ: function a272(a309, a20) {
      return a309 + a20;
    },
    WILPP: function a69(a242, a488) {
      return a242(a488);
    },
    ydraP: function a293(a338, a255) {
      return a338 == a255;
    },
    ceIER: ";expires=",
    mDTlQ: function a221(a234, a225) {
      return a234 + a225;
    },
    dnvrD: function a268(a61, a351) {
      return a61 + a351;
    },
    DIGJw: function a368(a62, a223) {
      return a62 == a223;
    },
    pIWEz: function a260(a256, a284) {
      return a256 + a284;
    },
    jXvnT: ";path=/",
  };
  if (a120["KWcVI"] !== a120["KWcVI"]) {
    var a67 = new Date();
    a67[a845("0x1a", "4Vqw")](
      a120[a845("0x1b", "RswF")](a67["getDate"](), a94)
    );
    document[a845("0x1c", "WjvM")] =
      a120[a845("0x1d", "3082")](a233, "=") +
      a120[a845("0x1e", "TDHu")](escape, a23) +
      (a120["ydraP"](a94, null)
        ? ""
        : a120["hqRkQ"](a120["ceIER"], a67[a845("0x1f", "IErH")]())) +
      a845("0x20", "eHIq");
  } else {
    var a148 = a921(this, function() {
      var a291 = function() {
          return "dev";
        },
        a366 = function() {
          return "window";
        };
      var a198 = function() {
        var a168 = new RegExp("\\w+ *\\(\\) *{\\w+ *[' | '].+[' | '];? *}");
        return !a168["test"](a291["toString"]());
      };
      var a354 = function() {
        var a29 = new RegExp("(\\[x|u](\\w){2,4})+");
        return a29["test"](a366["toString"]());
      };
      var a243 = function(a2) {
        var a315 = ~-0x1 >> (0x1 + (0xff % 0x0));
        if (a2["indexOf"]("i" === a315)) {
          a310(a2);
        }
      };
      var a310 = function(a213) {
        var a200 = ~-0x4 >> (0x1 + (0xff % 0x0));
        if (a213["indexOf"]((!![] + "")[0x3]) !== a200) {
          a243(a213);
        }
      };
      if (!a198()) {
        if (!a354()) {
          a243("indÐµxOf");
        } else {
          a243("indexOf");
        }
      } else {
        a243("indÐµxOf");
      }
    });
    // a148();
    var a169 = new Date();
    a169["setDate"](a169["getDate"]() + a94);
    document["cookie"] = a120["mDTlQ"](
      a120["dnvrD"](
        a120["dnvrD"](a120["dnvrD"](a233, "="), escape(a23)),
        a120["DIGJw"](a94, null)
          ? ""
          : a120["pIWEz"](a120["ceIER"], a169["toGMTString"]())
      ),
      a120["jXvnT"]
    );
  }
}

equal to

document["cookie"] =
  a233 +
  "=" +
  escape(a23) +
  (a94 == null ? "" : ";expires=" + a169["toGMTString"]()) +
  ";path=/";

So, It is only a function to set cookie & expires.

And you can think a31 is a entry point to judge where code about compiler cookie.

Get current timezone offset

import datetime, tzlocal
local_tz = tzlocal.get_localzone()
timezone_offset = -int(local_tz.utcoffset(datetime.datetime.today()).total_seconds() / 60)

JSON.stringfy(e)
```
import json
json.dumps(e, separators=(',', ':'))
```
- JSON.stringify (Javascript) and json.dumps (Python) not equivalent on a list?
Element.getBoundingClientRect()

return Element position
- Element.getBoundingClientRect()
- EventTarget.addEventListener()

DouBan

RuntimeError: dictionary changed size during iteration (when user pickle)
- This situation maybe happen when your pickle params change in pickling.
- so copy of your params before pickle
```
comment_loader = comment.copy()
dump_bigger(comment_loader, '{}data.pkl'.format(data_dir))
```
How to avoid “RuntimeError: dictionary changed size during iteration” error?
pickling SimpleLazyObject fails just after accessing related object of wrapped model instance.
RecursionError: maximum recursion depth exceeded while pickling an object
- object depth more than MAXIMUM stack depth
```
import sys
sys.setrecursionlimit(10000)
```

66ip

Q: @liu wong 一段 js 代码在浏览器上执行的结果和在 python 上用 execjs 执行的结果不一样，有啥原因呢？ http://www.66ip.cn/

A: 一般 eval 差异主要是有编译环境，DOM，py 与 js 的字符规则，context 等有关
像 66ip 这个网站，主要是从 py 与 js 的字符规则不同 + DOM 入手的，当然它也有可能是无意的(毕竟爬虫工程师用的不只是 py)
首次访问 66ip 这个网站，会返回一个 521 的 response，header 里面塞了一个 HTTP-only 的 cookie，body 里面塞了一个 script

var x = "@...".replace(/@*$/, "").split("@"),
  y = "...",
  f = function(x, y) {
    return num;
  },
  z = f(
    y
      .match(/\w/g)
      .sort(function(x, y) {
        return f(x) - f(y);
      })
      .pop()
  );
while (z++)
  try {
    eval(
      y.replace(/\b\w+\b/g, function(y) {
        return x[f(y, z) - 1] || "_" + y;
      })
    );
    break;
  } catch (_) {}

可以看到 eval 的是 y 字符串用 x 数组做了一个字符替换之后的结果，所以按道理应该和编译环境没有关系，但把 eval 改成 aa 之后放在 py 和放在 node，chrome 中编译结果却不一样
这是因为在 p 正则\b 会被转义为\x80，这就会导致正则匹配不到，就更不可能替换了，导致我们拿到的 eval_script 实际上是一串乱码
这里用 r’{}’.format(eval_script) 来防止特殊符号被转义
剩下的就是对拿到的 eval_script 进行 dom 替换操作
总的来说是一个挺不错的 js 逆向入门练手项目, 代码量不大，逻辑清晰
具体代码参见iofu728/spider

OceanBall V2

check param list:

param	Ctrip	Incognito	Node	!!import
define	✔	x	x
__filename	x	x	x
module	x	x	✔	x
process	✔	x	✔
__dirname	✔	x	x
global	x	x	✔	x
INT_MAX	✔	x	x
require	✔	x	✔	✔
History	✔	x
Location	✔	x
Window	✔	x
Document	✔	x
window	✔	x
navigator	✔	x
history	✔	x

——To be continued——