# Web Crawling Lesson 2 Notes

A very important concept in web crawling, is the efficiency of the web crawler. If you web crawler is very slow, then it will be very tedious to keep waiting for the results that the web crawler returns to come out. No repeating data is also a big part of efficiency, since there's no point in waiting a while for data to be returned, and be returned repeating data, data that will be useless. This means that if we wanted to generate urls or crawl urls for a certain website, we don't want to crawl the same url more than once. This is where url checks will come in handy. The following code below is an example for a movie website called "douban.com":

url1 = "https://www.douban.com/var/page={page_num}"
url2 = "https://www.douban.com/view/page={page_num}"

list1 = []

if url1 in ["https://www.douban.com/var/page={page_num}"]:
for i in range(10):
formatted_url1 = url1.format(page_num=i)
list1.append(formatted_url1)

for i in range(10):
formatted_url2 = url2.format(page_num=i)
list1.append(formatted_url2)

elif url2 in ["https://www.douban.com/view/page={page_num}"]:
for i in range(10):
formatted_url1 = url1.format(page_num=i)
list1.append(formatted_url1)

for i in range(10):
formatted_url2 = url2.format(page_num=i)
list1.append(formatted_url2)

print(list1)


The reason why the code above is inefficient is because we never actually checked the urls PROPERLY, since all we did was run a basic list check for the urls, that we know will be satisfied, so there is no point in doing that. Also, there are some redundant usages of for loops in the code above, since we have two for loops, that are essentially doing the same thing. The third issue about the code above, is that it cannot be used in many scenarios, or it is "inefficient", because it technically can't really thoroughly check the urls that are being passed to it, since we are only passing the urls that is pre-set and "ordered" for us, whereas in a real-scenario, a website won't properly organize everything for a web crawler, so you will have to make it work no matter what urls are being passed to it, as long as some keywords in the url are passed. Here is an example that is much better:

url_var = "https://www.douban.com/var/page={page_num}"
url_view = "https://www.douban.com/view/page={page_num}"

url_list_var = []
for i in range(10):
url_list_var.append(url_var.format(page_num=i))

url_list_view = []
for i in range(10):
url_list_view.append(url_view.format(page_num=i))

url_list = url_list_var + url_list_view

tail = []
for url in url_list:
if "var" in url:
elif "view" in url:
tail.append(url)
else:
print("No url specified.")