爬取豆瓣Top250的电影

即上次发的《简单聊聊scrapy》这次要写一些具体操作啦!首现当然是要创建一个projectscrapy startproject doubantop。执行完命令之后就会看到项目文件夹,下面就讲几个比较重要的文件。settings.py是这个项目的配置文件,里面包含了User_Agent、Cookie、还有一些中间件的配置、还可以自己写入一些配置比如mysql的的配置(主要是因为我用的是mysql,有兴趣的话可以用mongodb)。pipelines.py是对spider解析后的数据进行插入数据库等等操作的地方。middlewares.py可以自己定义一些中间件处理user_agent和header等等。最后spider目录下就是要写具体的spider啦!下面我们看下代码,代码中有相应的注释。

Spider

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# -*- coding: utf-8 -*-

import scrapy
from bs4 import BeautifulSoup
from doubantop.items import DoubantopItem

class DoubanTopSpider(scrapy):
name = 'doubantop'
start_urls = ['https://movie.douban.com/top250?start=0']

def parse_content(self, response):
#beautifulSoup是一个非常强大,简便的python库
soup = BeautifulSoup(response.body, "lxml")

print soup.prettify()

DoubantopItem

1
2
3
4
5
6
7
8
9
10
11
12
13
import scrapy


class DoubantopItem(scrapy.Item):
# define the fields for your item here like:
#定义你要爬取的字段就像下面一下
picPath = scrapy.Field()
rank = scrapy.Field()
title = scrapy.Field()
info = scrapy.Field()
score = scrapy.Field()
evaluateNum = scrapy.Field()
inq = scrapy.Field()

DbHelper

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/usr/bin/python
# -*- coding: utf-8 -*-

import MySQLdb
from scrapy.utils.project import get_project_settings
import time

#定义一个数据库操作的类
class DbHelper():
def __init__(self):
self.settings = get_project_settings()
self.host = self.settings['MYSQL_HOST']
self.port = self.settings['MYSQL_PORT']
self.user = self.settings['MYSQL_USER']
self.passwd = self.settings['MYSQL_PASSWORD']
self.db = self.settings['MYSQL_DBNAME']

def connection(self):
conn = MySQLdb.connect(
host = self.host,
user = self.user,
passwd = self.passwd,
db = self.db
)
return conn

def excuteSql(self, sql, params):
conn = self.connection();
cur = conn.cursor()
cur.execute(sql,params)
conn.commit()
cur.close()
conn.close()

DoubantopPipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from doubantop.dbhelper import DbHelper
import time

#在Pipeline中将我们爬取到的数据存储到数据库中,这里就用到了我们写的dbhelper类
class DoubantopPipeline(object):
def process_item(self, item, spider):
self.insert_item(item)
return item

def insert_item(self, item):
sql = "INSERT INTO `movie` ( `title`, `picPath`, `rank`, `info`, `score`, `evaluateNum`, `inq`, `createdTime`, `updatedTime`) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)"
params = (item['title'], item['picPath'], item['rank'], item['info'], item['score'], item['evaluateNum'], item['inq'], int(time.time()), int(time.time()))
dbHelper = DbHelper()
dbHelper.excuteSql(sql, params)

以上只是讲了一下每个类具体作用,当你了解他们的时候你就能自己写一个简单的爬虫了,当然爬虫比较关键的还是跟程序猿斗智斗勇啊,他们可不会让你这么简单就把页面给爬了,所以为了防止被Ban你还要了解为你的爬虫加入UserAgent池,Ip池,模拟登陆等等,具体爬取豆瓣top250的项目地址在github上,下面附上项目地址。
—— 豆瓣Top250电影爬虫项目