Web Scraping With Selenium

This tutorial is going to teach you how to scraping data with Selenium under Python environment.

What is Selenium?

Selenium is an open-source suite of tools used for automating tasks in web browsers. It’s like a remote control for your web browser, allowing you to write scripts that can perform actions and interact with websites just like a human user would. This makes it a valuable tool for several purposes:

Automated testing: Selenium can automate the process of testing web applications. You can write scripts that mimic how users interact with the site, checking for functionality and identifying any bugs.
Data scraping: Selenium can be used to extract data from websites. This can be useful for things like collecting product information or monitoring prices.
Web scraping: Similar to data scraping, Selenium can be used to grab entire webpages or specific sections of a webpage and store them for later use.

Selenium works with many popular programming languages, including Python, Java, and C#. It’s free to use and has a large community of developers who provide support and resources. In this guide we will focus on using Selenium with Python only.

Install Selenium

First of all make sure Python 3 is installed and updated on your device.
Then, open terminal or CMD and write:

1

pip install selenium

If you’re using Linux you may repleace pip with pip3.

For this guide, I recommend using an IDE you’re comfortable with, especially one that offers auto-fill and auto-correction features. These features can significantly improve your coding efficiency.

First steps

Let’s open python editor, assuming Selenium is already installed, we’ll use it to open a web page and then extract its title.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


from selenium import webdriver

##It is no longer necessary to install 3rd webdriver.
driver = webdriver.Firefox()

## You can call to any browser that installed on your device.
driver.get("https://Dolev.Blog")

## Let's scrap the title:
title = driver.title
print(title)

## If everything alright- the following command will close the tab.
driver.close()

Where Can I download the webdrivers?

Good news! Since version 4.6, connecting to browsers no longer requires manual WebDrivers setup. This can save you a significant amount of time, especially if you previously spent over an hour on understand why they are no longer working (like me)😂.

Access elements on page

Accessing elements embedded in a web page is essential for many tasks.
To inspect the HTML, right-click anywhere on the webpage and select “Inspect” or “Inspect Element”.
There are three main ways to identify elements on a webpage: ID (unique identifier), Class Name (may not be unique), and Name (usually unique). When targeting a specific element, it’s best to use an ID as it ensures you’re selecting the exact element you intend to modify. Using names or class names, which can be shared by multiple elements, can lead to unintended consequences.
For example, this is how HTML inspection look like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


<html>
<head>
<style>
/* Style the element with the id "myHeader" */
#myHeader {
  background-color: lightblue;
  color: black;
  padding: 40px;
  text-align: center;
}

/* Style all elements with the class name "city" */
.city {
  background-color: tomato;
  color: white;
  padding: 10px;
} 
</style>
</head>
<body>

<h2>Difference Between Class and ID</h2>
<p>A class name can be used by multiple HTML elements, while an id name must only be used by one HTML element within the page:</p>

<!-- An element with a unique id -->
<h1 id="myHeader">My Cities</h1>

<!-- Multiple elements with same class -->
<h2 class="city">London</h2>
<p>London is the capital of England.</p>

<h2 class="city">Paris</h2>
<p>Paris is the capital of France.</p>

<h2 class="city">Tokyo</h2>
<p>Tokyo is the capital of Japan.</p>

</body>
</html>

Imagine you want to perform a search query on this blog. After the search is submitted, we want the window to wait 5 seconds before closing. Here’s how we can achieve this: We’ll introduce a new variable called search that will handle the search functionality. please pay attention to the import commands that I’ve added

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


from selenium import webdriver
from selenium.webdriver.common.by import By
import time

##It is no longer necessary to install 3rd webdriver.
driver = webdriver.Firefox()
## You can call to any browser that installed on your device.
driver.get("https://dolev.blog/search/")
title = driver.title
search = driver.find_element(By.ID, "searchInput").send_keys("abc")

time.sleep(5)

driver.quit()

in order to access element we use find_element method that working like this: something.find_element(attributes, what we're looking for)

kind of attributes

The attributes available for the By class are used to locate elements on a page. These are the attributes available for By class:
1
2
3
4
5
6
7
8
ID = "id"
NAME = "name"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"

(This cite has been taken from Selenium with Python, thank you!)

Convention

the convention inside the brackets going this way: (By.NAME, "What i'm looking for") another example- if I’m looking for class name that calls “Search_Box”:
(By.CLASS_NAME, "Search_Box")

Remember to import the By class from the selenium.webdriver.common.by module first.

Performing Actions 🪖

Let’s say you’re trying to collect information from a website, but some of it’s not readily visible. To get to this hidden data, you might need to actually use the website! This could involve typing something in the search bar and then clicking a different button to navigate to a different page where the information you want is actually displayed. In the following example I made a script that typing a search query and then clicking on my web-site logo in order to navigate to the homepage:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver.support.ui import WebDriverWait 
import time

##It is no longer necessary to install 3rd webdriver.
driver = webdriver.Edge()
## You can call to any browser that installed on your device.
driver.get("https://dolev.blog/search/")

search = driver.find_element(By.ID, "searchInput")
search.send_keys("ABC")

# Define the expected link text (more reliable than using a long string literal)
expected_link_text = "Dolev's Data-Driven Den"
try:
    wait = WebDriverWait(driver, 10)  # Wait for up to 10 seconds - things should be loading. 
    link = wait.until(presence_of_element_located((By.LINK_TEXT, expected_link_text))) #Wait and then search where's the link with this text.
    link.click() 
except:
    print("Link not found or clickable within 10 seconds.")

Imagine you’re giving instructions to a friend on building a robot. You want the robot to find a specific toy, but sometimes things go wrong - maybe the toy is hidden, or the robot’s sensors are a bit slow.
The try-except block is like a safety net for your robot friend.

Plan for the Unexpected (Error Handling): Things can be unpredictable on websites, just like in the real world. The try block is where your robot tries to find the toy (the link). If it takes too long (like 10 seconds), an error pops up because the toy can’t be found.
Don’t Let it Crash (Script Stability): The except block catches this error, preventing your robot from freezing. Instead, it tells the robot to say “Toy not found!” and move on.

By using try-except, your robot friend becomes more dependable, just like your Selenium scripts!

Scrapping

Equipped with the power of actions, we’re ready to dive into the exciting world of data scraping! In this chapter, we’ll use Python lists to collect and store all of the article titles of my blog in a handy CSV file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver.support.ui import WebDriverWait 
import csv

# Replace with your website URL and element locators
driver = webdriver.Edge()
driver.get("http://dolev.blog/")

# Find elements containing data
wait = WebDriverWait(driver, 10)  # Wait for up to 10 seconds
link = wait.until(presence_of_element_located((By.CLASS_NAME, "entry-header")))

elements = driver.find_elements(By.CLASS_NAME, "entry-header")


# Print elements to check if they are being located correctly
print("Number of elements found:", len(elements))
for element in elements:
    print(element.text)

# Extract data and store in a list of lists
data = []
for element in elements:
    row = [element.text]  # Example data extraction
    data.append(row)

driver.quit()  # Close the browser
print(data)
# Export data to CSV

with open('scraped_data.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(data)  # Adjust headers as needed
    for row in data:
        writer.writerow(row)

print("Data exported to scraped_data.csv")

Final words

Congratulations! In this guide, you’ve taken your first steps into the world of web scraping with Selenium and Python. I hope you enjoyed the journey and learned something valuable today.

While there are other powerful scraping tools available in 2024, like Beautiful Soup (which I plan to cover in the future!), I focused on Selenium because it’s a foundational framework for web automation. Understanding Selenium can give you a solid base for exploring other scraping tools.

What is Selenium?#

Install Selenium#

First steps#

Where Can I download the webdrivers?#

Access elements on page#

kind of attributes#

Convention#

Performing Actions 🪖#

Scrapping#

Final words#