官方澳门新永利下载:python爬虫爬房多多链家房源信息,使用Maven将Excel导入数据库

官方澳门新永利下载 1Paste_Image.png官方澳门新永利下载 2Paste_Image.png

屡次索要用到excel导入数据库的同校,其实可以毫无每趟费力编写POI-java代码,以往maven可以一本万利的机关导入,只须要配置好相关安装,方便了成千上万
示范使用mysql DB

有这样多个excel表格,有好六个你想关切的小区,只要在小区上面填好类似
的页面,脚本自动爬下当天的价位数据填回去excel中。sheet1是房多多(index =
0)sheet2是链家 (index = 1)

官方澳门新永利下载,excel内容格式如下

#!/usr/bin/env python# -*- coding: utf-8 -*- import urllib2import xlrdimport xlwtimport xlutilsimport datetimefrom xlutils.copy import copyimport sysreloadsys.setdefaultencodingarrs = ["http://esf.fangdd.com/suzhou/xiaoqu_64364.html", "http://esf.fangdd.com/suzhou/xiaoqu_65115.html", "http://esf.fangdd.com/suzhou/xiaoqu_65123.html"]currentRow = 0;arrr = []def testXlrd(filename, index): book = xlrd.open_workbook sh = book.sheet_by_index #print sh.nrows, sh.ncols rows = sh.row_values print rows return rows, sh.nrowsdef testXlwt(filename, arrss, index): book = xlrd.open_workbook sh = book.sheet_by_index lastdate = sh.cell(currentRow - 1,0).value wsh = copy wsh2 = wsh.get_sheet now = datetime.datetime.now() date = now.strftime if date == lastdate: print "今天已经执行过啦" return wsh2.write(currentRow, 0, date) i = 1 for arr in arrss: wsh2.write(currentRow, i, arrr[i - 1]) i = i + 1 print currentRow, i, arrss[i - 2] wsh.save print "写入表成功"def pachong(arrss, index): for arr in arrss: url = arr.encode if url != "": up = urllib2.urlopen cont = up.read() if index == 0: head = '' tail = '' elif index == 1: head = '' tail = '' elif index == 2: head = '<div ><h2 >' tail = '</a></h2>' elif index == 3: head = '<h1 >' tail = '</h1>' elif index == 4: head = '' tail = '' ph = cont.find pj = cont.find(tail, ph + 1) print cont[ph + len : pj] if index == 0: head = '' tail = '' elif index == 1: head = '' tail = '</strong>' elif index == 2: head = '' tail = '' elif index == 3: head = '<big>' tail = '</big>' elif index == 4: head = '' tail = '' ph = cont.find pj = cont.find(tail, ph + 1) result = cont[ph + len : pj].strip() if index == 1: result = result[8:] print result + "元/平米" print "---------------分割线--------------------" arrr.append(result.decodeif __name__=='__main__': #index=0为爬取房多多下面二手房小区房源 rows, currentRow = testXlrd('hehehe.xls', 0) pachong testXlwt('hehehe.xls', arrr, 0) #index=1为爬取链家下面二手房小区房源 currentRow = 0; arrr = [] rows, currentRow = testXlrd('hehehe.xls', 1) pachong testXlwt('hehehe.xls', arrr, 1) #index=2为爬取房天下下面新房小区房源 # currentRow = 0; # arrr = [] # rows, currentRow = testXlrd('hehehe.xls', 2) # pachong # testXlwt('hehehe.xls', arrr, 2) #index=3为爬取房天下下面新房小区房源 currentRow = 0; arrr = [] rows, currentRow = testXlrd('hehehe.xls', 3) pachong testXlwt('hehehe.xls', arrr, 3) #index=4为爬取链家下面新房小区房源 currentRow = 0; arrr = [] rows, currentRow = testXlrd('hehehe.xls', 4) pachong testXlwt('hehehe.xls', arrr, 4)

官方澳门新永利下载 3

Paste_Image.png

根据excel格式建构相应的表结构

SET SESSION FOREIGN_KEY_CHECKS=0;

/* Drop Tables */
DROP TABLE learn_maven_list;

/* Create Tables */

CREATE TABLE `learn_maven_list` (
    `id` VARCHAR(64) NOT NULL COLLATE 'utf8_bin',
    `license_no` VARCHAR(255) NULL DEFAULT NULL COLLATE 'utf8_bin',
    `appler` VARCHAR(255) NOT NULL COLLATE 'utf8_bin',
    `farm` VARCHAR(255) NULL DEFAULT NULL COLLATE 'utf8_bin',
    `address` VARCHAR(255) NULL DEFAULT NULL COLLATE 'utf8_bin',
    `contect` VARCHAR(255) NULL DEFAULT NULL COLLATE 'utf8_bin',
    `product` VARCHAR(255) NULL DEFAULT NULL COLLATE 'utf8_bin',
    `large` VARCHAR(255) NULL DEFAULT NULL COLLATE 'utf8_bin',
    `date` DATE NULL DEFAULT NULL,
    `pirorid` VARCHAR(255) NULL DEFAULT NULL COLLATE 'utf8_bin',
    PRIMARY KEY (`id`),
    INDEX `appler` (`appler`, `id`)
)
COLLATE='utf8_bin'
ENGINE=InnoDB;

注意表名应该和excel的sheet名同样,假若有四个sheet,依次创建表结构就能够

官方澳门新永利下载 4

Paste_Image.png

发表评论

电子邮件地址不会被公开。 必填项已用*标注