Analyzing a basic web scraper/crawler

import time
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}


def crawler(url):
    try:
        html = requests.get(url, headers=headers)
        if html.status_code == 200:
            return html.text
        else:
            return "None"
    except RequestException as e:
        return e


def parse_directory(html):
    global story_title
    soup = BeautifulSoup(html, "lxml")
    story_title = soup.select("#info h1")[0].string
    page_list = soup.select("#list dl dd a")[9:]
    for page in page_list:
        chapter_link = page.get("href")
        chapter_title = page.string
        yield {"story title": story_title, "chapter title": chapter_title, "chapter_link": chapter_link}
        # print(f"Story title: {story_title}")
        # print(f"Chapter title: {chapter_title}")
        # print(f"Chapter link: {chapter_link}")

    # if page_list:
    #     for link in page_list:
    #         embed = link.select("#list dd a")
    #         embed_link = link.find("a", href=True)
    #         embed_title = link.select("#list dd")[0].string
    #         print(title)
    #         print(embed_link)
    #         print(embed_title)


def save(content, filename):
    with open(filename, "a+", encoding="utf-8") as file:
        file.write(content)


def parse_detail_content(htmls):
    for html in htmls:
        detail_content_url = html.get("chapter_link", "None")
        time.sleep(1)
        html = crawler(detail_content_url)
        soup = BeautifulSoup(html, "lxml")
        detail_content_title = soup.select(".box_con .bookname h1")
        contents = soup.select("#content p")[0].select("p")
        # print(detail_content_title)
        # print(detail_content_url)
        print(contents)


def main():
    base_url = "http://www.b520.cc/0_7/"
    html = crawler(base_url)
    content = list(parse_directory(html))
    # print(story_title)
    # for i in content:
    #     save(str(i) + "\n", "story_information{}.txt".format(story_title))
    parse_detail_content(content)


if __name__ == "__main__":
    main()

There are a few important steps to making a basic web crawler, different functions that handle each thing by themselves. The first steps is to import all the prerequisites that are needed in the project, such as the modules and the libraries that we have to import. For this crawler, we need to import the time module, the requests module, the beautifulsoup module, and the exceptions package from the requests module, to display any exceptions or errors caused by the request that was being sent to the website that the user can specify later on. The next step is to add the browser headers to "trick" the website that we are using this crawler for into thinking that it's a human accessing the website, and not a crawler.

After these prerequisites are installed/imported into the project environment, we have to start defining our functions, and making them have their own purposes. The first function that we have to make, is the crawler function, and what this function will do is, send a request to a website that the user can later on specify, and return a certain status code, with each different status code meaning something for the journey of the request, to the desination: the website that the user specified the request to call to. If the request's status code matches up with a status code that shows that the request was successfully sent and the connection was successfully sent, then the function will return the html for the webpage that received the request. If the request was not successfully received by the website for many reasons, then it will return a RequestException, which is why we had to import the exceptions package from the requests module at the beginning of this project.

The next function that has to be defined, is the function that will parse the story title that can be found on the webpage that the user specified, since this crawler is made to crawl a website that contains the contents of many stories, so we will need this first function to crawl the webpage's html code for the story's title and the chapter's title and link. We will use css selectors to search for certain classes that contain the certain elements that contain the elements that we are looking for, as simply using built-in functions like the .find() or the .find_all() method that are built-in to beautifulsoup, are not very accurate, hence why we aren't using them. We can use generators in this function to cleanly return the results of the parsing, so that it looks nicer than the raw result.

The next function that we will have to define, is the function to save the results that our crawler gathers from all the webpages it crawls, and stores it into a file that we can access in case for any reason, we have to view the past webpages/information that the crawler has visited/crawled before. We use the with keyword with the open() function so that we don't have to manually close the file using .close(), and we use a special filemode called "a+", which means that python will open a file in the current directory and open it for reading and writing into it, which is perfect for our needs. We want to write the content that our crawler gathers into a file, so we will achieve that goal using .write().

After we have created the save() function, we will create a more detailed function that will go more in-depth to parse more specific information on a story and its contents. For this function, we will be creating it so that it can parse each chapter's link in the story, and it can find the title of each title, parse it, and return it as a result. We add a time.sleep() method for one second, to prevent the website from automatically blocking our computer through our ip, as no human has a refresh rate of hundreds of requests sent to call the website in one single second, so we can throttle that rate, by using time.sleep().

After we have made the more detailed and in-depth parsing function, we will have to make a function that will call all these functions, thus returning results for the user to see and access. We will name this function "main()", and we want to pass the webpage url that we want the crawler to crawl to it, as well as the html information of that webpage to the parsing functions, so that we can successfully find the information that we need.

The last step to fulfill before our web crawler is up and running, is that we have to add a basic name check for the file that is calling the main() function to make sure that the file is being directly run, and if it is, then call the main() function. This means that after all these functions/steps, our first functional web crawler is up and running!

import time
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}


def crawler(url):
    try:
        html = requests.get(url, headers=headers)
        if html.status_code == 200:
            return html.text
        else:
            return "None"
    except RequestException as e:
        return e


def parse_directory(html):
    global story_title
    soup = BeautifulSoup(html, "lxml")
    story_title = soup.select("#info h1")[0].string
    page_list = soup.select("#list dl dd a")[9:]
    for page in page_list:
        chapter_link = page.get("href")
        chapter_title = page.string
        yield {"story title": story_title, "chapter title": chapter_title, "chapter_link": chapter_link}
        # print(f"Story title: {story_title}")
        # print(f"Chapter title: {chapter_title}")
        # print(f"Chapter link: {chapter_link}")

    # if page_list:
    #     for link in page_list:
    #         embed = link.select("#list dd a")
    #         embed_link = link.find("a", href=True)
    #         embed_title = link.select("#list dd")[0].string
    #         print(title)
    #         print(embed_link)
    #         print(embed_title)


def save(content, filename):
    with open(filename, "a+", encoding="utf-8") as file:
        file.write(content)


def parse_detail_content(htmls):
    for html in htmls:
        detail_content_url = html.get("chapter_link", "None")
        time.sleep(1)
        html = crawler(detail_content_url)
        soup = BeautifulSoup(html, "lxml")
        detail_content_title = soup.select(".box_con .bookname h1")
        contents = soup.select("#content p")[0].select("p")
        # print(detail_content_title)
        # print(detail_content_url)
        print(contents)


def main():
    base_url = "http://www.b520.cc/0_7/"
    html = crawler(base_url)
    content = list(parse_directory(html))
    # print(story_title)
    # for i in content:
    #     save(str(i) + "\n", "story_information{}.txt".format(story_title))
    parse_detail_content(content)


if __name__ == "__main__":
    main()

There are a few important steps to making a basic web crawler, different functions that handle each thing by themselves. The first steps is to import all the prerequisites that are needed in the project, such as the modules and the libraries that we have to import. For this crawler, we need to import the time module, the requests module, the beautifulsoup module, and the exceptions package from the requests module, to display any exceptions or errors caused by the request that was being sent to the website that the user can specify later on. The next step is to add the browser headers to "trick" the website that we are using this crawler for into thinking that it's a human accessing the website, and not a crawler.

After these prerequisites are installed/imported into the project environment, we have to start defining our functions, and making them have their own purposes. The first function that we have to make, is the crawler function, and what this function will do is, send a request to a website that the user can later on specify, and return a certain status code, with each different status code meaning something for the journey of the request, to the desination: the website that the user specified the request to call to. If the request's status code matches up with a status code that shows that the request was successfully sent and the connection was successfully sent, then the function will return the html for the webpage that received the request. If the request was not successfully received by the website for many reasons, then it will return a RequestException, which is why we had to import the exceptions package from the requests module at the beginning of this project.

The next function that has to be defined, is the function that will parse the story title that can be found on the webpage that the user specified, since this crawler is made to crawl a website that contains the contents of many stories, so we will need this first function to crawl the webpage's html code for the story's title and the chapter's title and link. We will use css selectors to search for certain classes that contain the certain elements that contain the elements that we are looking for, as simply using built-in functions like the .find() or the .find_all() method that are built-in to beautifulsoup, are not very accurate, hence why we aren't using them. We can use generators in this function to cleanly return the results of the parsing, so that it looks nicer than the raw result.

The next function that we will have to define, is the function to save the results that our crawler gathers from all the webpages it crawls, and stores it into a file that we can access in case for any reason, we have to view the past webpages/information that the crawler has visited/crawled before. We use the with keyword with the open() function so that we don't have to manually close the file using .close(), and we use a special filemode called "a+", which means that python will open a file in the current directory and open it for reading and writing into it, which is perfect for our needs. We want to write the content that our crawler gathers into a file, so we will achieve that goal using .write().

After we have created the save() function, we will create a more detailed function that will go more in-depth to parse more specific information on a story and its contents. For this function, we will be creating it so that it can parse each chapter's link in the story, and it can find the title of each title, parse it, and return it as a result. We add a time.sleep() method for one second, to prevent the website from automatically blocking our computer through our ip, as no human has a refresh rate of hundreds of requests sent to call the website in one single second, so we can throttle that rate, by using time.sleep().

After we have made the more detailed and in-depth parsing function, we will have to make a function that will call all these functions, thus returning results for the user to see and access. We will name this function "main()", and we want to pass the webpage url that we want the crawler to crawl to it, as well as the html information of that webpage to the parsing functions, so that we can successfully find the information that we need.

The last step to fulfill before our web crawler is up and running, is that we have to add a basic name check for the file that is calling the main() function to make sure that the file is being directly run, and if it is, then call the main() function. This means that after all these functions/steps, our first functional web crawler is up and running!

AI悦创·推出辅导班啦,包括「Python 语言辅导班、C++辅导班、算法/数据结构辅导班、少儿编程、pygame 游戏开发」,全部都是一对一教学:一对一辅导 + 一对一答疑 + 布置作业 + 项目实践等。QQ、微信在线,随时响应!V:Jiabcdefh

https://img-blog.csdnimg.cn/c9f56f26bb854de18ef76629c5d47c0f.png

AI悦创·推出辅导班啦,包括「Python 语言辅导班、C++辅导班、算法/数据结构辅导班、少儿编程、pygame 游戏开发」,全部都是一对一教学:一对一辅导 + 一对一答疑 + 布置作业 + 项目实践等。QQ、微信在线,随时响应!V:Jiabcdefh
AI悦创 » Analyzing a basic web scraper/crawler

发表评论