一些爬虫技巧
主要是:
- 1 去掉有#的链接
- 2 去掉没有hostname的链接
urllib + beautifulsoup + urlparse
一般用urllib和beautifulsoup找网页上的链接,比如:1
2
3
4
5
6url = "http://www.taobao.com/"
htmltext = urllib.open(url).read()
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('a', href=True):
print tag['href']
这样找到的链接比较乱,1
2
3
4
5
6
7https://s.2.taobao.com/list/theme.htm?spm=2007.6815005.7.9.mlDc8r&id=cat
//star.taobao.com/#guid-1408957859445
//star.taobao.com/?slide=2
//star.taobao.com/#guid-1408957859445
//www.tmall.hk/wow/import/act/shengdanjie
javascript:;
http://survey.taobao.com/survey/Ag3fFWYv
像”javascript:;”或者带有#的链接是没用的.可以用urlparse去掉.1
2
3
4
5
6for tag in soup.findAll('a', href=True):
raw = tag['href']
host = urlparse.urlparse(tag['href']).hostname or url[7:]
path = urlparse.urlparse(tag['href']).path
new_url = "http://" + host + path
print new_url
对于没有hostname的链接,也就是像”javascript:;”之类的,host设置成”www.taobao.com”.那些带有#的链接用urlparse.urlparse(tag[‘href’]).path也能去掉#.不过这样需要从本来的url里面去找host,用url[7:]跳过http:// 实在是很笨.
mechanize
用mechanize比较方便,它找到的link里就有base_url和后面的url.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17import urlparse
import mechanize
url = "http://www.taobao.com/"
br = mechanize.Browser()
br.open(url)
for link in br.links():
newurl = urlparse.urljoin(link.base_url, link.url)
newurl_urlparse = urlparse.urlparse(newurl)
#remove hashtag#
part1 = newurl_urlparse.hostname
part2 = newurl_urlparse.path
#if hostname is none, it is sth like this:
#javascript:;
if part1 is not None:
print "http://"+part1+part2
1 | http://star.taobao.com/ |