In this Tutorial, we will learn about scrapping websites using Python and Selenium module. This Script and Technique will help you to scrap nearly all Websites. Works for all pages in
In the following
What is Web Scrapping?
Web scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when you view the page). Therefore, web crawling is the main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be to find and copy names and phone numbers, or companies and their URLs, to a list (contact scraping).
Prerequistes
- Please ensure that you have selenium installed. If not, run “pip install selenium” to install the latest version.
- Firefox Browser
- You should also have placed the geckodriver.exe in the folder where you are writing the python script. It is necessary to use this driver.
- Python 3.x.x
Go to the official repository to download geckdriver if you don’t have it yet. Follow this link https://github.com/mozilla/geckodriver/releases
The folder structure should look like this:
Checking the Configuration
Download the full configuration from my github account.
Copy and run the following code:
from selenium import webdriver browser = webdriver.Firefox() url = "https://unsplash.com/search/photos/mountains/" browser.get(url)
If you face any error please comment below. I will be happy to help. 😁
If everything went well you will see a firefox tab opening up and the given url will open.
Planning Our Script
Before we start I would like you to go to the website and inspect the source code. You will find an interesting thing that all download links have the title = “Download photo”. We will use this info to separate the download link from other links. This will be our flow for developing the Script.
- Search for all ‘a’ tags.
- Filter the tags having title = “Download photo”.
- Save the links in a text file
- Voila!! We are done
Writing Our Script
Download the full configuration from my github account.
Code
from selenium import webdriver def view_webpage(link_file): try: elem1 = browser.find_elements_by_tag_name('a') except: print('some error occured') try: for elem in elem1: if elem.get_attribute('title') == 'Download photo': print(elem.get_attribute('href'), file=link_file) except: print("No data in Element") browser = webdriver.Firefox() search_term = "mountains/" url = "https://unsplash.com/search/photos/" + search_term browser.get(url) complete = False # we will open the file in append mode link_file = open("links.txt", mode="a+") while not complete: view_webpage(link_file) complete = True # Closing the file to save in drive link_file.close()
Output
Voila!! It worked. Here are the links you will get in link_file.txt.
Stay tuned for my upcoming blog post to get the Improved Version of the Script at pyblog.in, New Script will let download as many photos you want and will support multi-threading.
If you get struck anywhere feel free to comment down below. I will be happy to help. 😁
This blog post is for educational purpose only.