一些爬虫技巧


主要是:

  • 1 去掉有#的链接
  • 2 去掉没有hostname的链接

urllib + beautifulsoup + urlparse

一般用urllib和beautifulsoup找网页上的链接,比如:

1
2
3
4
5
6
url = "http://www.taobao.com/"
htmltext = urllib.open(url).read()
soup = BeautifulSoup(htmltext)

for tag in soup.findAll('a', href=True):
print tag['href']

这样找到的链接比较乱,

1
2
3
4
5
6
7
https://s.2.taobao.com/list/theme.htm?spm=2007.6815005.7.9.mlDc8r&id=cat
//star.taobao.com/#guid-1408957859445
//star.taobao.com/?slide=2
//star.taobao.com/#guid-1408957859445
//www.tmall.hk/wow/import/act/shengdanjie
javascript:;
http://survey.taobao.com/survey/Ag3fFWYv

像”javascript:;”或者带有#的链接是没用的.可以用urlparse去掉.

1
2
3
4
5
6
for tag in soup.findAll('a', href=True):
raw = tag['href']
host = urlparse.urlparse(tag['href']).hostname or url[7:]
path = urlparse.urlparse(tag['href']).path
new_url = "http://" + host + path
print new_url

对于没有hostname的链接,也就是像”javascript:;”之类的,host设置成”www.taobao.com”.那些带有#的链接用urlparse.urlparse(tag[‘href’]).path也能去掉#.不过这样需要从本来的url里面去找host,用url[7:]跳过http:// 实在是很笨.

mechanize

用mechanize比较方便,它找到的link里就有base_url和后面的url.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import urlparse
import mechanize

url = "http://www.taobao.com/"
br = mechanize.Browser()
br.open(url)

for link in br.links():
newurl = urlparse.urljoin(link.base_url, link.url)
newurl_urlparse = urlparse.urlparse(newurl)
#remove hashtag#
part1 = newurl_urlparse.hostname
part2 = newurl_urlparse.path
#if hostname is none, it is sth like this:
#javascript:;
if part1 is not None:
print "http://"+part1+part2

1
2
3
4
http://star.taobao.com/
http://star.taobao.com/
http://www.tmall.hk/wow/import/act/shengdanjie
http://survey.taobao.com/survey/Ag3fFWYv