爬取iciba.com的翻译数据

ICIBA接口调研

访问ICIBA网站接口默认传参URL

http://www.iciba.com/index.php?callback=jQuery190037237954499612624_1521877397217&a=getWordMean&c=search&list=1,2,3,4,5,8,9,10,12,13,14,15,18,21,22,24,3003,3004,3005&word=test&_=1521877397218

接口传参意义,分析传参含义

  1. callback:iciba JQuery回调函数形式返回到该网站

  2. a=getWordMean:获取单词参数

  3. c=search:查询单词参数

  4. word:传递所需要的单词

  5. list=1,2,3,4,5,8,9,10,12,13,14,15,18,21,22,24,3003,3004,3005

    1. 1参数显示基础单词信息,路径:JSON.baesInfo
    2. 2参数显示单词分析信息,路径:JSON.sameAnalysis
    3. 3柯林斯高阶英汉双解学习词典,路径:JSON.collins
    4. 4英英词典,路径:JSON.ee_mean
    5. 5行业词典,路径:JSON.trade_means
    6. 8双语例句,路径:JSON.sentence
    7. 9(1,9组合会出现网络释义,路径JSON.netmean),路径:JSON.baesInfo
    8. 10权威例句,路径:JSON.auth_sentence
    9. 12释义变形(此参数并没有显示在页面里),路径:JSON.synonym
    10. 14词组和句型,路径:JSON.phrase
    11. 15词根词缀,路径:JSON.stems_affixes
    12. 18百度百科,路径:JSON.encyclopedia
    13. 21四级真题,路径:JSON.cetFour
    14. 3003词性、中英例句、翻译并未在页面展示 ,路径:JSON.bidec
    15. 3005句式用法,路径:JSON.jushi
    16. 如果list不传参或者输入0默认显示所有信息

产品需求数据

  • 输入单词:JSON.baesInfo.word_name
  • 音标+翻译:JSON.baesInfo.symbols
  • 变形:JSON.baesInfo.exchange
  • 双语例句:JSON.sentence
  • 句式用法:JSON.jushi
  • 权威例句:JSON.auth_sentence

此时list参数传参值为:1,6,8,10,3005 ,减少无用参数,优化请求速度。

http://www.iciba.com/index.php?&a=getWordMean&c=search&list=1,6,8,10,3005&word=test

数据库设计

数据表名:dic_auth_sentence注释:权威例句表

字段 数据类型 注释 类型
id int(11) 权威例句id 主键
content varchar(255) 内容
link varchar(255) 相关链接
source varchar(50) 来源
word_id int(11) 单词id

数据表名:dic_base_info注释:音标表

字段 数据类型 注释 类型
id int(11) 单词id 主键
word varchar(128) 单词
ph_en varchar(255) 英 音标
ph_am varchar(255) 美 音标
ph_en_mp3 varchar(255) 英 音标mp3
ph_am_mp3 varchar(255) 美 音标mp3
create_time datetime 创建时间

数据表名:dic_base_info_test注释: 音标表

字段 数据类型 注释 类型
id int(11) 单词id 主键
word varchar(128) 单词
ph_en varchar(255) 英 音标
ph_am varchar(255) 美 音标
ph_en_mp3 varchar(255) 英 音标mp3
ph_am_mp3 varchar(255) 美 音标mp3
create_time datetime 创建时间

数据表名:dic_exchange注释: 单词变形表

字段 数据类型 注释 类型
word_id int(11) 单词id 主键
word_pl varchar(128) 复数
word_past varchar(128) 过去式
word_done varchar(128) 过去分词
word_ing varchar(128) 现在分词
word_third varchar(128) 第三人称单数
word_er varchar(128) 比较级
word_est varchar(128) 最高级
word_prep varchar(128) 代词
word_adv varchar(128) 副词
word_verb varchar(128) 动词
word_noun varchar(128) 名词
word_adj varchar(128) 形容词
word_conn varchar(128) 系连词

数据表名:dic_jushi注释: 句式用法表

字段 数据类型 注释 类型
id int(11) 句式用法id 主键
word_id int(11) 单词id
english varchar(255) 英语例句
chinese varchar(255) 中文例句

数据表名:dic_parts注释:翻译表

字段 数据类型 注释 类型
id int(11) 词性id 主键
word_id int(11) 单词id
part varchar(50) 词性
means varchar(255) 释义

数据表名:dic_sentence注释:双语例句表

字段 数据类型 注释 类型
id int(11) 例句id 主键
word_id int(11) 单词id
english varchar(255) 例句英文
chinese varchar(255) 例句中文

数据表名:dic_temp_word注释: 跑批单词

字段 数据类型 注释 类型
id int(11) 单词id 主键
word varchar(255) 单词
is_set tinyint(1) unsigned zerofill 0未跑1跑过

JAVA实现方案

1
2
3
4
5
6
7
8
9
String url = "http://www.iciba.com/index.php&a=getWordMean&c=search&list=1,6,8,10,3005&word=";
//get请求翻译API
try {
wordMean = java.net.URLEncoder.encode(wordMean, "UTF-8");//把特殊字符转码
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
final HttpResponse response = WS.url(url + wordMean).get();//get请求url
final String res = WS.getResponseAsString(response);
  • Fastjson:解析获取到的JSON,找到所需要数据的路径
1
2
3
4
5
6
7
8
9
10
11
       JSONObject jsonObject = JSON.parseObject(res);
int errno = (int) JSONPath.eval(jsonObject, "$.errno");
if (errno == 404) {
return FAIL;
}
JSONObject symbols = (JSONObject) JSONPath.eval(jsonObject, "$.baesInfo.symbols[0]");//音标+mp3
JSONObject exchange = (JSONObject) JSONPath.eval(jsonObject, "$.baesInfo.exchange");//变形单词
ArrayList<JSONObject> sentences = (ArrayList<JSONObject>) JSONPath.eval(jsonObject, "$.sentence[0:2]");//双语例句
ArrayList<JSONObject> jushi = (ArrayList<JSONObject>) JSONPath.eval(jsonObject, "$.jushi[0:2]");//句式语法
ArrayList<JSONObject> authSentence = (ArrayList<JSONObject>) JSONPath.eval(jsonObject, "$.auth_sentence[0:2]");//权威例句
String wordName = (String) JSONPath.eval(jsonObject, "$.baesInfo.word_name");//单词名
  • 遍历获取到的数据并存入MYSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
            int wordId;
if (wordMeanMapper.checkWord(wordName) != null) {
ReturnT.FAIL.setMsg("单词已经存在");
return FAIL;
}
wordMeanMapper.insertWord(wordName);
wordId = (int) wordMeanMapper.checkWord(wordName).get("id");
try {
if (symbols != null) {//没有音标一定没有该单词的释义,直接取得单词的网络翻译
//发音音标
String ph_en = symbols.getString("ph_en");// 英 音标
String ph_am = symbols.getString("ph_am");//美 音标
// String ph_other = symbols.getString("ph_other");
String ph_en_mp3 = symbols.getString("ph_en_mp3");//英 音标发音
String ph_am_mp3 = symbols.getString("ph_am_mp3");//美 音标发音
String ph_tts_mp3 = symbols.getString("ph_tts_mp3");//tts原生 发音
wordMeanMapper.insertSymbols(wordId, ph_en, ph_am, ph_en_mp3, ph_am_mp3, ph_tts_mp3);
//变形词汇
String reg = "[\\[\\]\"]";
String word_pl = exchange.getString("word_pl").replaceAll(reg, "");
String word_third = exchange.getString("word_third").replaceAll(reg, "");
String word_past = exchange.getString("word_past").replaceAll(reg, "");
String word_done = exchange.getString("word_done").replaceAll(reg, "");
String word_ing = exchange.getString("word_ing").replaceAll(reg, "");
String word_er = exchange.getString("word_er").replaceAll(reg, "");
String word_est = exchange.getString("word_est").replaceAll(reg, "");
String word_prep = exchange.getString("word_prep").replaceAll(reg, "");
String word_adv = exchange.getString("word_adv").replaceAll(reg, "");
String word_verb = exchange.getString("word_verb").replaceAll(reg, "");
String word_noun = exchange.getString("word_noun").replaceAll(reg, "");
String word_adj = exchange.getString("word_adj").replaceAll(reg, "");
String word_conn = exchange.getString("word_conn").replaceAll(reg, "");
wordMeanMapper.insertExchange(wordId, word_pl, word_past, word_done, word_ing, word_third, word_er, word_est, word_prep, word_adv, word_verb, word_noun, word_adj, word_conn
);
//词性翻译
JSONArray parts = symbols.getJSONArray("parts");
for (Object partObj : parts) {
JSONObject partJSONObject = (JSONObject) partObj;
String means = partJSONObject.getString("means");
means = means.replaceAll(reg, "");
String part = partJSONObject.getString("part");
wordMeanMapper.insertParts(part, means, wordId);
}
} else {
String translateResult = (String) JSONPath.eval(jsonObject, "$.baesInfo.translate_result");
wordMeanMapper.updateTranslateResult(wordMean, translateResult);
}
} catch (Exception e) {
e.printStackTrace();
return FAIL;
}
//双语例句
for (JSONObject sentence : sentences) {
String networkEN = sentence.getString("Network_en");
String networkCN = sentence.getString("Network_cn");
String mp3 = sentence.getString("tts_mp3");
String mp3Size = sentence.getString("tts_size");
wordMeanMapper.insertSentence(networkEN, networkCN, mp3, mp3Size, wordId);
}
//句式用法
for (JSONObject jushiObj : jushi) {
String english = jushiObj.getString("english");
String chinese = jushiObj.getString("chinese");
String mp3 = jushiObj.getString("mp3");
wordMeanMapper.insertJushi(english, chinese, mp3, wordId);
}
//权威例句
for (JSONObject authSentenceObj : authSentence) {
String content = authSentenceObj.getString("content");
String link = authSentenceObj.getString("link");
String source = authSentenceObj.getString("source");
String mp3 = authSentenceObj.getString("tts_mp3");
String mp3Size = authSentenceObj.getString("tts_size");
wordMeanMapper.insertAuthSentence(content, link, source, mp3, mp3Size, wordId);
}

Mapper

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
   @Insert("INSERT INTO fanyi_word(word)VALUES(#{word})")//插入传入的单词
int insertWord(String word);
@Insert("UPDATE fanyi_word SET translate_result=#{translate_result} WHERE word=#{word}")
int updateTranslateResult(@Param("word")String word,@Param("translate_result")String translateResult);//网络翻译结果

@Insert("INSERT INTO fanyi_sentence(english,chinese,tts_mp3,tts_size,word_id) VALUES(#{english},#{chinese},#{tts_mp3},#{tts_size},#{word_id})")//插入例句
int insertSentence(@Param("english") String english,
@Param("chinese") String chinese,
@Param("tts_mp3") String tts_mp3,
@Param("tts_size") String tts_size,
@Param("word_id") int wordId);

@Insert("INSERT INTO fanyi_auth_sentence(content,link,source,tts_mp3,tts_size,word_id)VALUES(#{content},#{link},#{source},#{tts_mp3},#{tts_size},#{word_id})")//权威例句
int insertAuthSentence(@Param("content") String content,
@Param("link") String link,
@Param("source") String source,
@Param("tts_mp3") String tts_mp3,
@Param("tts_size") String tts_size,
@Param("word_id") int wordId);

@Insert("INSERT INTO fanyi_jushi(english,chinese,mp3,word_id)VALUES(#{english},#{chinese},#{mp3},#{word_id})")//句式
int insertJushi(@Param("english") String english,
@Param("chinese") String chinese,
@Param("mp3") String mp3,
@Param("word_id") int wordId);

@Insert("INSERT INTO fanyi_symbols(word_id,ph_en,ph_am,ph_en_mp3,ph_am_mp3,ph_tts_mp3)VALUES(#{word_id},#{ph_en},#{ph_am},#{ph_en_mp3},#{ph_am_mp3},#{ph_tts_mp3})")//音标资源
int insertSymbols(@Param("word_id") int wordId,
@Param("ph_en") String ph_en,
@Param("ph_am") String ph_am,
@Param("ph_en_mp3") String ph_en_mp3,
@Param("ph_am_mp3") String ph_am_mp3,
@Param("ph_tts_mp3") String ph_tts_mp3);

@Insert("INSERT INTO fanyi_parts(part,means,word_id)VALUES(#{part},#{means},#{word_id})")
//翻译+词性
int insertParts(@Param("part")String part,
@Param("means")String means,
@Param("word_id")int wordId);
@Insert("INSERT INTO fanyi_exchange (`word_id`, `word_pl`, `word_past`, `word_done`, `word_ing`, `word_third`, `word_er`, `word_est`, `word_prep`, `word_adv`, `word_verb`, `word_noun`, `word_adj`, `word_conn`)" +
"VALUES(#{word_id},#{word_pl},#{word_past},#{word_done},#{word_ing},#{word_third},#{word_er},#{word_est},#{word_prep},#{word_adv},#{word_verb},#{word_noun},#{word_adj},#{word_conn})")//单词变形
int insertExchange(@Param("word_id")int wordId,
@Param("word_pl")String word_pl,
@Param("word_past")String word_past,
@Param("word_done")String word_done,
@Param("word_ing")String word_ing,
@Param("word_third")String word_third,
@Param("word_er")String word_er,
@Param("word_est")String word_est,
@Param("word_prep")String word_prep,
@Param("word_adv")String word_adv,
@Param("word_verb")String word_verb,
@Param("word_noun")String word_noun,
@Param("word_adj")String word_adj,
@Param("word_conn")String word_conn);

@Select("SELECT word,id FROM fanyi_word WHERE word = #{word}")//检查单词是否存在
Map checkWord(String word);

Python 实现方案

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import sys
import io
import json
import jsonpath

from utils import get_redis_conn
from examples.iciba.crawler import Crawler
from utils import mysql_util

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8') # 为了xxl-job输出不出现乱码,格式化为utf8

host = 'http://www.iciba.com/index.php?&a=getWordMean&c=search&list=1,6,8,10,3005&word='
waiting_set = 'iciba:seeds:to_crawl'
seeds_all = 'iciba:seeds:all'
info_set = 'iciba:info:wordmsg'

# Not considering concurrent security
common_crawler = Crawler()


def init_db():
redis_client = get_redis_conn(db=2)
return redis_client


def get_info(word):
"""get user info"""
url = host + word
html = common_crawler.get(url)
if not html:
return
py_obj = json.loads(html.encode('utf8'))
try:
word_name = jsonpath.jsonpath(py_obj, "$.baesInfo.word_name") # 单词名称
exchanges = jsonpath.jsonpath(py_obj, "$.baesInfo.exchange") # 变形单词
symbols = jsonpath.jsonpath(py_obj, "$.baesInfo.symbols[0]") # 音标+mp3
parts = jsonpath.jsonpath(py_obj, "$.baesInfo.symbols[0].parts") # 翻译词性
translate_result = jsonpath.jsonpath(py_obj, "$.baesInfo.translate_result") # 网络翻译
is_errno = jsonpath.jsonpath(py_obj, "$.errno") # 单词是否存在

auth_sentence = jsonpath.jsonpath(py_obj, "$.auth_sentence[:3]") # 权威例句
jushi = jsonpath.jsonpath(py_obj, "$.jushi[:3]") # 句式语法
sentence = jsonpath.jsonpath(py_obj, "$.sentence[:3]") # 双语例句

if translate_result or is_errno[0] == 404: # 单词是否错误翻译,使用网络翻译,返回单词id
print('单词不存在,单词出错')
return
word_id = mysql_util.insert_word(word_name[0])
if not word_id: #如果单词已经存在则直接跳出
return
for symbol in symbols: #发音音标
mysql_util.insert_symbols(word_id, symbol['ph_en'], symbol['ph_am'], symbol['ph_en_mp3'],
symbol['ph_am_mp3'], symbol['ph_tts_mp3'])

for part in parts[0]: # 词性翻译
mysql_util.insert_parts(word_id, part['part'], part['means'])
if auth_sentence:
for i in auth_sentence: # 单词权威例句
mysql_util.insert_auth_sentence(i['content'], i['link'], i['source'], i['tts_mp3'], i['tts_size'], word_id)
if jushi:
for i in jushi: # 句式
mysql_util.insert_jushi(i['english'], i['chinese'], i['mp3'], word_id)
if sentence:
for i in sentence: # 双语例句
mysql_util.insert_sentence(i['Network_en'], i['Network_cn'], i['tts_mp3'], i['tts_size'], word_id)

for exchange in exchanges: # 单词变形 把空数组赋值为NULL,并存入
for key,value in exchange.items():
if not value:
exchange[key]=None

mysql_util.insert_exchange(word_id, exchange['word_pl'], exchange['word_past'], exchange['word_done'],
exchange['word_ing'], exchange['word_third'], exchange['word_er'],
exchange['word_est'], exchange['word_prep'], exchange['word_adv'],
exchange['word_verb'], exchange['word_noun'], exchange['word_adj'],
exchange['word_conn'])

except Exception as e:
print(e)
return

return html


def start():
redis_client = init_db()
while not redis_client.scard(waiting_set):
# block if there is no seed in waitting_set
print("%s里的单词全部跑完,等待下次跑批" % waiting_set)
return

# fetch seeds from waitting_set
word = redis_client.spop(waiting_set).decode()

print("正在获取%s单词数据中……" % word)
word_data = get_info(word)
redis_client.hset(info_set, word, word_data)
# redis_client.sadd(info_set, user)
print("%s单词已经保存" % word)


def xxl_job(): # xxl_job调用的函数 waiting_set的单词从Redis里获取然后跑入到MySQL
redis_conn = init_db()
while True:
if not redis_conn.scard(waiting_set):
print("所有单词已经跑完,等待下次跑批")
break
start()


if __name__ == '__main__':
init_seeds = ['test'] # 直接运行测试数据
redis_conn = init_db()
redis_conn.sadd(waiting_set, *init_seeds)
redis_conn.sadd(seeds_all, *init_seeds)
while True:
start()

MySQL_util.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
from config.settings import (MYSQL_DATABASE, MYSQL_HOST, MYSQL_PASSWORD, MYSQL_PROT, MYSQL_USER, MYSQL_CHARSET)
import pymysql.cursors


def get_conn_mysql():
connection = pymysql.connect(host=MYSQL_HOST,
user=MYSQL_USER,
port=MYSQL_PROT,
passwd=MYSQL_PASSWORD,
db=MYSQL_DATABASE,
charset=MYSQL_CHARSET)
return connection


def insert_word(word):
try:
connection = get_conn_mysql()
cursor = connection.cursor()
insert_word = "INSERT INTO `fanyi_word` (`id`,`word`) VALUES (%s, %s)"
check_word = "SELECT word FROM `fanyi_word` WHERE word=%s LIMIT 1"
result = cursor.execute(check_word, word) # 检查表内是否有该单词
if not result:
cursor.execute(insert_word, (cursor.lastrowid, word))
connection.commit()
return cursor.lastrowid
else:
return
finally:
connection.close()


def insert_symbols(word_id, ph_en, ph_am, ph_en_mp3, ph_am_mp3, ph_tts_mp3):
try:
connection = get_conn_mysql()
cursor = connection.cursor()
insert_symbols = "INSERT INTO `fanyi_symbols` (`word_id`, `ph_en`, `ph_am`, `ph_en_mp3`, `ph_am_mp3`, `ph_tts_mp3`) VALUES (%s, %s, %s, %s, %s, %s)"
cursor.execute(insert_symbols, (word_id, ph_en, ph_am, ph_en_mp3, ph_am_mp3, ph_tts_mp3))
connection.commit()
finally:
connection.close()


def insert_parts(word_id, part, means):
try:
connection = get_conn_mysql()
cursor = connection.cursor()
insert_parts = "INSERT INTO `fanyi_parts` (`part`, `means`, `word_id`) VALUES (%s, %s, %s)"
means_str = ','.join(means)
cursor.execute(insert_parts, (part, means_str, word_id))
connection.commit()
finally:
connection.close()


def insert_sentence(english, chinese, tts_mp3, tts_size, word_id):
try:
connection = get_conn_mysql()
cursor = connection.cursor()
insert_sentence = "INSERT INTO `fanyi_sentence` (`english`, `chinese`, `tts_mp3`, `tts_size`, `word_id`) VALUES (%s, %s, %s,%s, %s)"
cursor.execute(insert_sentence, (english, chinese, tts_mp3, tts_size, word_id))
connection.commit()
finally:
connection.close()


def insert_jushi(english, chinese, mp3, word_id):
try:
connection = get_conn_mysql()
cursor = connection.cursor()
insert_jushi = "INSERT INTO `fanyi_jushi` (`english`, `chinese`, `mp3`, `word_id`) VALUES ( %s, %s,%s, %s)"
cursor.execute(insert_jushi, (english, chinese, mp3, word_id))
connection.commit()
finally:
connection.close()


def insert_auth_sentence(content, link, source, tts_mp3, tts_size, word_id):
try:
connection = get_conn_mysql()
cursor = connection.cursor()
insert_auth_sentence = "INSERT INTO `fanyi_auth_sentence` (`content`, `link`, `source`, `tts_mp3`, `tts_size`, `word_id`) VALUES (%s, %s,%s, %s,%s,%s)"
cursor.execute(insert_auth_sentence, (content, link, source, tts_mp3, tts_size, word_id))
connection.commit()
finally:
connection.close()


def insert_exchange(word_id, word_pl, word_past, word_done, word_ing, word_third, word_er, word_est, word_prep,
word_adv, word_verb, word_noun, word_adj, word_conn):
try:
connection = get_conn_mysql()
cursor = connection.cursor()
insert_exchange = "INSERT INTO `fanyi_exchange` (`word_id`, `word_pl`, `word_past`, `word_done`, `word_ing`, `word_third`, `word_er`, `word_est`, `word_prep`, `word_adv`, `word_verb`, `word_noun`, `word_adj`, `word_conn`) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
cursor.execute(insert_exchange, (
word_id, word_pl, word_past, word_done, word_ing, word_third, word_er, word_est, word_prep, word_adv, word_verb,
word_noun, word_adj, word_conn))
connection.commit()
finally:
connection.close()

使用Spring Boot 对Redis缓存单词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
package com.xxl.job.executor.mvc.controller;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.data.redis.RedisConnectionFailureException;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.ResponseBody;

@Controller
@EnableAutoConfiguration
public class IndexController {

@Autowired
private StringRedisTemplate stringRedisTemplate;

@RequestMapping(value = "/redis/addWord/{value}", method = RequestMethod.GET)
@ResponseBody
public String redisTest(@PathVariable String value) {
String waitingSet = "iciba:seeds:to_crawl";
String seedsAll = "iciba:seeds:all";
try {
stringRedisTemplate.opsForSet().add(waitingSet, value);
stringRedisTemplate.opsForSet().add(seedsAll, value);
return String.format("%s已经成功添加!", value);
}catch (RedisConnectionFailureException e){
e.printStackTrace();
return "redis连接失败";
}
}
}

使用xxl-job调度中心执行

https://github.com/xuxueli/xxl-job

  • Java :在xxl-job里运行模式选择Bean模式
    使用Java版本在调度中心设置JobHandler执行器名称和,并执行器内添加注解

    1
    @JobHandler(value="demoJobHandler")
  • Python:使用GLUE(Python)模式

注:如果在Linux下Python有多个版本,可以在com.xxl.job.core.glue.GlueTypeEnum下修改脚本执行前缀,如:GLUE_PYTHON("GLUE(Python)", true, "python3.6", ".py")

GLUE(Python)模式代码:

1
2
3
4
5
6
7
8
9
#!/usr/bin/python3.6
# -*- coding: UTF-8 -*-
import time
import sys
print(sys.stdout)
sys.path.append('F:\\workspace\\haipproxy-0.1') # Windows下的项目路径,Linux下需要更换
from examples.iciba.iciba_spider import xxl_job
xxl_job()
exit(0)

使用haipproxy匿名代理爬取

https://github.com/SpiderClub/haipproxy

这个开源项目爬取一些免费的代理网站,获取代理IP地址,然后对这些IP进行筛选打分,打分策略针对各个IP请求成功率响应速度最近验证时间是否匿名这四个维度。

这个项目的详细文档在github上的wiki里,我在配置一些Python环境和依赖踩了不少坑,为了用这个项目去特意学习了Python和Redis…