2015-12-23

一些爬虫技巧

主要是:

1 去掉有#的链接
2 去掉没有hostname的链接

urllib + beautifulsoup + urlparse

一般用urllib和beautifulsoup找网页上的链接,比如:

url = "http://www.taobao.com/"
htmltext = urllib.open(url).read()
soup = BeautifulSoup(htmltext)

for tag in soup.findAll('a', href=True):
    print tag['href']

这样找到的链接比较乱,

https://s.2.taobao.com/list/theme.htm?spm=2007.6815005.7.9.mlDc8r&id=cat
//star.taobao.com/#guid-1408957859445
//star.taobao.com/?slide=2
//star.taobao.com/#guid-1408957859445
//www.tmall.hk/wow/import/act/shengdanjie
javascript:;
http://survey.taobao.com/survey/Ag3fFWYv

像”javascript:;”或者带有#的链接是没用的.可以用urlparse去掉.

for tag in soup.findAll('a', href=True):
    raw = tag['href']
    host = urlparse.urlparse(tag['href']).hostname or url[7:]
    path = urlparse.urlparse(tag['href']).path
    new_url = "http://" + host + path
    print new_url

对于没有hostname的链接,也就是像”javascript:;”之类的,host设置成”www.taobao.com”.那些带有#的链接用urlparse.urlparse(tag[‘href’]).path也能去掉#.不过这样需要从本来的url里面去找host,用url[7:]跳过http:// 实在是很笨.

mechanize

用mechanize比较方便,它找到的link里就有base_url和后面的url.

import urlparse
import mechanize

url = "http://www.taobao.com/"
br = mechanize.Browser()
br.open(url)

for link in br.links():
    newurl = urlparse.urljoin(link.base_url, link.url)
    newurl_urlparse = urlparse.urlparse(newurl)
    #remove hashtag#
    part1 = newurl_urlparse.hostname
    part2 = newurl_urlparse.path
    #if hostname is none, it is sth like this:
    #javascript:; 
    if part1 is not None:
        print "http://"+part1+part2

http://star.taobao.com/
http://star.taobao.com/
http://www.tmall.hk/wow/import/act/shengdanjie
http://survey.taobao.com/survey/Ag3fFWYv

Amnesia's blog

I'll be your mirror

一些爬虫技巧

urllib + beautifulsoup + urlparse

mechanize