Web Crawling Lesson 1 Notes

There are many forms of web crawling on the internet, and the 4 most common types of web crawlers are, full web crawlers theme crawlers, incremental crawlers, and deep crawlers. Full web crawlers are web crawlers that will crawl/scrape very basic urls, and can extend to the entire website you are crawling on. This type of crawlers is very large, and it takes up lots of storage and memory to run, and it will require lots of speed and efficiency from your computer if you want to run this type of crawlers smoothly. Theme crawlers are basically web crawlers that crawl depending on the theme that you specify the theme of the data that you want to specifically crawl. This means that if you have a specific/certain need for the type of information you crawl, or just in general what you want to crawl, then you should use theme crawlers, since they don't crawl anything you don't want to crawl, only things you want. Incremental crawlers are good if you want to crawl a website that is constantly updating itself, such as youtube, or a movie website or Netflix for example, since they are constantly updating their website since new movies/tv shows are constantly releasing, what incremental crawlers do, is that they basically only crawl the entire website once, and every update that the website receives, they will only crawl the information that is part of the update, and not go back and crawl the entire website again. This means that incremental crawlers save lots of time, since they only scrape the new information. This means that incremental crawlers are really good if you want to make a "current trending movies" program since it doesn't waste any time to get the updated information by crawling the entire website again. Deep web crawlers are not a great topic to talk about and learn, since it's mostly illegal to crawl the deep web, since lots of websites or pages or "resources" on the deep web are illegal, as they contain illegal content, and crawling the deep web and potentially crawling something illegal, and incriminate you and potentially give you jail time, so developers who work with web crawlers don't typically work with deep web crawlers, because they are illegal. There are three main types of strategies that web crawlers use to search things, and they are the depth first, breadth first, and the focus on crawlers search strategies. The depth-first search method means that when the URL of a certain page is chosen or selected, a depth-first search occurs, and if the page that is being crawled has embeds or links to other pages, then a depth-first search will go to crawl each of those urls or links in those pages/articles or whatever you are scraping, until there are no more urls or links to redirect the crawler to another site. This is what a depth-first search strategy is, simply just a crawler that constantly checks for links or embedded urls in pages that it is scraping, and it will keep crawling each page until there are no more embedded urls or links. The breadth-first basically when a page and all the urls of the page is crawled, it will select that page or another url to crawl, and continue the in-depth search. A focus-on search strategy for crawlers means that the focused crawlers will score the downloaded pages that you want to crawl, and it will queue those downloaded pages that are awaiting crawling, and this helps the crawler focus on the most valuable pages to crawl first, this way the efficiency is kept high. Some websites don't like crawlers, because they send lots of requests to their servers, and once a certain limit for requests for the server that the website is running on is reached, then the power of the server will throttle, limiting legitimate business that the website might do, which is why some websites have certain "tactics" to prevent web crawlers. The 6 most common "tactics" that some websites use, are the use of user request headers, user access frequency limits or "checks", the use of anti-crawler font, the use of loading website directory data, through data encryption, and/or through identification based on a verification code that they send you in order to prevent a robot aka. a web crawler from sending loads of requests to a website's server. The user request headers method is when they require you to input a user-agent in the headers in the user request, and if there is none, then the middleware of that website will "block" the normal response from being outputted. Some websites will also check for the "referer" in the header, so you have to input a user-agent, and a "referer" in the headers, basically identifying who is sending the request. User access frequency is a tactic many websites use, and it's basically them grabbing your ip when you visit their site, to either crawl, or to even use it, and if it senses that too many requests are being sent from that same ip, then it will ban that ip from using their site, therefore shutting down any crawlers that are running on that ip address from crawling their site any longer. Anti-crawler font is basically a special font that some websites use in the html scripts/code for their website, this way the crawler/user won't understand the font, as it only appears as garble and is illegible. The loading of directory data for a website blocks crawlers because they hide the desired data for a crawler, since some websites that use this tactic use Ajax, which to break past this barrier, you have to find the Ajax access interface, you have to analyze the structure and parameters, and you would have to simulate Ajax in the crawler to get the desired data from your crawler. Blocking crawlers through data encryption is pretty self-explanatory. It's basically when a website encrypts certain parameters through javascript, and you would have to find these encrypted parameters/data in the js file of a website, and you would have to master the front-end technology, however it is very time-consuming and difficult. Another solution is to use selenium to simulate human behavior when using that website, this way selenium will perfectly trigger the javascript of that page, which will "trick" the website into thinking it is a human using their website using their website, and not a crawler. The strategy of using selenium to solve the problem of data encryption, is good in it's own ways, however there is a problem, which is the fact that the crawling speed of the crawler when simulating/emulating through selenium is much slower than if it weren't using selenium. This method of solving the problems that are presented when a website encrypts their data, is only used when a developer is forced to, or compelled in their own ways. The last method a website could use to block web crawlers, is identifying that you are a human through a verification problem, code, or image. Some websites' verification will be easier, such as a simple math problem, while others will require certain images to be selected through the parameters that are given to the user, or the website might send a verification email or a code, this problem can be resolved through third-party applicatoins or OCR recognition technology. Certain websites that require you to slide a piece of a image to match up with the rest of the image, can be resolved through the PIL (pillow) library, which is an image processing library.

AI悦创·推出辅导班啦,包括「Python 语言辅导班、C++辅导班、算法/数据结构辅导班、少儿编程、pygame 游戏开发」,全部都是一对一教学:一对一辅导 + 一对一答疑 + 布置作业 + 项目实践等。QQ、微信在线,随时响应!V:Jiabcdefh
AI悦创 » Web Crawling Lesson 1 Notes