This tutorial is going to teach you how to scraping data with Selenium under Python environment.
What is Selenium?
Selenium is an open-source suite of tools used for automating tasks in web browsers. It’s like a remote control for your web browser, allowing you to write scripts that can perform actions and interact with websites just like a human user would. This makes it a valuable tool for several purposes:
-
Automated testing: Selenium can automate the process of testing web applications. You can write scripts that mimic how users interact with the site, checking for functionality and identifying any bugs.
-
Data scraping: Selenium can be used to extract data from websites. This can be useful for things like collecting product information or monitoring prices.
-
Web scraping: Similar to data scraping, Selenium can be used to grab entire webpages or specific sections of a webpage and store them for later use.
Selenium works with many popular programming languages, including Python, Java, and C#. It’s free to use and has a large community of developers who provide support and resources. In this guide we will focus on using Selenium with Python only.
Install Selenium
First of all make sure Python 3 is installed and updated on your device.
Then, open terminal or CMD and write:
|
|
If you’re using Linux you may repleace pip
with pip3
.
For this guide, I recommend using an IDE you’re comfortable with, especially one that offers auto-fill and auto-correction features. These features can significantly improve your coding efficiency.
First steps
Let’s open python editor, assuming Selenium is already installed, we’ll use it to open a web page and then extract its title.
|
|
Where Can I download the webdrivers?
Good news! Since version 4.6, connecting to browsers no longer requires manual WebDrivers setup. This can save you a significant amount of time, especially if you previously spent over an hour on understand why they are no longer working (like me)😂.
Access elements on page
Accessing elements embedded in a web page is essential for many tasks.
To inspect the HTML, right-click anywhere on the webpage and select “Inspect” or “Inspect Element”.
There are three main ways to identify elements on a webpage: ID (unique identifier), Class Name (may not be unique), and Name (usually unique). When targeting a specific element, it’s best to use an ID as it ensures you’re selecting the exact element you intend to modify. Using names or class names, which can be shared by multiple elements, can lead to unintended consequences.
For example, this is how HTML inspection look like:
|
|
Imagine you want to perform a search query on this blog. After the search is submitted, we want the window to wait 5 seconds before closing. Here’s how we can achieve this: We’ll introduce a new variable called search
that will handle the search functionality.
please pay attention to the import commands that I’ve added
|
|
in order to access element we use find_element
method that working like this:
something.find_element(attributes, what we're looking for)
kind of attributes
The attributes available for the By class are used to locate elements on a page. These are the attributes available for By class:
1 2 3 4 5 6 7 8
ID = "id" NAME = "name" XPATH = "xpath" LINK_TEXT = "link text" PARTIAL_LINK_TEXT = "partial link text" TAG_NAME = "tag name" CLASS_NAME = "class name" CSS_SELECTOR = "css selector"
(This cite has been taken from Selenium with Python, thank you!)
Convention
the convention inside the brackets going this way:
(By.NAME, "What i'm looking for")
another example- if I’m looking for class name that calls “Search_Box”:
(By.CLASS_NAME, "Search_Box")
Remember to import the By
class from the selenium.webdriver.common.by
module first.
Performing Actions 🪖
Let’s say you’re trying to collect information from a website, but some of it’s not readily visible. To get to this hidden data, you might need to actually use the website! This could involve typing something in the search bar and then clicking a different button to navigate to a different page where the information you want is actually displayed. In the following example I made a script that typing a search query and then clicking on my web-site logo in order to navigate to the homepage:
|
|
Imagine you’re giving instructions to a friend on building a robot. You want the robot to find a specific toy, but sometimes things go wrong - maybe the toy is hidden, or the robot’s sensors are a bit slow.
The try-except
block is like a safety net for your robot friend.
- Plan for the Unexpected (Error Handling):
Things can be unpredictable on websites, just like in the real world. The
try
block is where your robot tries to find the toy (the link). If it takes too long (like 10 seconds), an error pops up because the toy can’t be found. - Don’t Let it Crash (Script Stability):
The
except
block catches this error, preventing your robot from freezing. Instead, it tells the robot to say “Toy not found!” and move on.
By using try-except
, your robot friend becomes more dependable, just like your Selenium scripts!
Scrapping
Equipped with the power of actions, we’re ready to dive into the exciting world of data scraping! In this chapter, we’ll use Python lists to collect and store all of the article titles of my blog in a handy CSV file.
|
|
Final words
Congratulations! In this guide, you’ve taken your first steps into the world of web scraping with Selenium and Python. I hope you enjoyed the journey and learned something valuable today.
While there are other powerful scraping tools available in 2024, like Beautiful Soup (which I plan to cover in the future!), I focused on Selenium because it’s a foundational framework for web automation. Understanding Selenium can give you a solid base for exploring other scraping tools.