通过对反爬虫和深层网络的理解可以帮助本项目在架构和设计上有更多的想法,程序不是以量取胜,用最少的代码实现最强的功能一直都是程序设计的梦想。
摘要:在大数据时代,网络数据的获取需要通过计算机自动实现,将获取到的特定相关信息自动存储到本地,本项目采取豆瓣网站的影视数据进行分析,Python网络爬虫被用来捕获豆瓣的视频信息,因为一些传统使用的引擎搜索有一些限制,例如结果返回不准确并不能延伸。现阶段多数系统都基本使用爬虫用作技术的第一模块,使用特定语句响应用户搜查的信息。现在主要问题为网络数据呈现指数型式的提升,造成的结果就是信息的收取与精确很是艰难。数据搜查的局限性一直使我们解决的难点和瓶颈,互联网中存在很多为特定主题来定制的爬虫。这是为了使用主题型爬虫能有效的爬取相关数据,根据爬取数据的规律和特性还有既定的规则和特点反映和过滤出有用的信息。
本文主要基于Python和Scrapy环境,并以影视网站为爬获目标,学习并研究当今网络上特定的影视爬虫所运用的模块和技术,实现基于特定模块和框架的影视爬虫,实现数据抓取并处理信息存入MYSQL,之后整合修改数据并排序导出相关信息。
正则表达式和requests模块来提取html源代码中需要提取的变量,重点通过简洁快速的处理方法实现最终目的,在整个研究过程中在web框架里体现出正则表达式的作用。
此文还会继续对大数据和反爬虫的运用提出相关问题描述,将数据块分解,提出深度爬取的处理和相关问题,最后通过研究以上模块来确定本文的研究方向,提供大数据的全景并强调出运用python在大数据中找出自己想要数据的程序算法和专属爬虫的使用。
关键词: python;爬虫;专属信息;HTML;Scrapy;大数据;抓取算
Design and realization of the movie information query system based on web crawler.
Abstract: In the era of big data, the network data acquisition need to automatically by computer, through the Python implementation web crawler, and access to specific information
automatically stored locally, douban to take on the project website of video data is analyzed, the Python web crawler is used to capture video information of douban, because some of the traditional use of search engine has some limitations, such as inaccurate results back and cannot reach. At present, most systems use crawlers as the first module of technology, and use specific statements in response to user search information. The main problem is the exponential growth of web data, and the result is that information is hard to collect and accurate. The limitations of data search have always been the difficulty and bottleneck that we have solved. There are many crawlers in the Internet that are customized for specific topics. This is to use thematic crawler to effectively crawl relevant data, and reflect and filter useful information according to the law and characteristics of crawler data as well as established rules and characteristics.
In this paper, based on the Python and Scrapy environment, and with the aim to climb a video website, study and research on today's network as applied to the specific film crawler module and technology, implementation based on specific module and frame of the film and television crawler, realize data capturing and processing information in MYSQL, modify data and sorting information derived after integration.
With regular expressions to extract and requests module need to extract the variables in the HTML source code, mainly through the concise and fast processing method to realize the final purpose, in the entire research process in web framework reflects the effect of regular expressions.
Finally this article will continue for big data and the use of the crawler problem description, decompose the data block, depth of crawl and related questions, finally by studying the above module to determine the research direction, this paper offers a panoramic view of the large data and to emphasize the use python in big data to find out the procedure of data they want algorithm and the exclusive use of the crawler.