This lesson is in the early stages of development (Alpha version)

Elements of Web Scraping with BeautifulSoup

Overview

Teaching: 25 min
Exercises: 15 min
Questions
  • How can I obtain data in a programmatic way from the web without an API?

Objectives
  • Have an idea about how to navigate the HTML element tree with Beautiful Soup and extract relevant information.

Sometimes, the data we are looking for is not available from an API, but it is available on web pages that we can view with our browser. As an example task, in this episode we are going to use the Beautiful Soup Python package for web scraping to find all the relevant information about future Software Carpentry Workshop events.

Exploring HTML code in the browser

Navigate to The Carpentries. The page we see has been rendered by the browser from the HTML, CSS (Cascading Style Sheets) and JavaScript code that is available or linked in the page in some way.

In many browsers (for example, Chrome, Chromium, and Firefox), we can look at the HTML source code of the page we are viewing with the CTRL+u shortcut (alternatively, you can right click on the page and choose “View Source” from the context menu).

Things to notice:

Another way to explore the HTML code is to use the Developer Tools. In most browser, (Chrome, Chromium and Firefox), you can use the CTRL+Shift+I key combination to open the Developer Tools (alternatively, find the right option in your browser menu).

Developer Tools in Safari

In Safari on macOS, the Developer Tools are hidden by default. To enable them, open the Preferences window, go to the Advanced tab, and enable the “Show Develop menu in menu bar” option.

By using these, by pressing the combination CTRL+Shift+C (or clicking on the mouse pointer icon in the top left of the window) you can hover with the mouse on the elements in the rendered page and view their properties. If you click on one of these, the relevant part of the HTML code will be shown to you.

By using these techniques, we can understand how to locate the elements that we want when using Beautiful Soup later on.

Relevant HTML tags for this lessons

There is a number of tags that may be interesting in general, but specifically for what follows, we need to notice:

Scraping the page with Beautiful Soup

From the BeautifulSoup documentaion:

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

First of all, let’s verify that we have BeautifulSoup installed:

python -c "import bs4"

If there is no output, then we are all set. If instead you see something along the lines of

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'bs4'

Then you have to install the package. One way of doing that is via pip, with

pip install beautifulsoup4

Once we are sure the BeautifulSoup is available, we can import the necessary libraries in Python and use requests to GET the Carpentries website content:

import requests
from bs4 import BeautifulSoup

response = requests.get("http://www.carpentries.org")
response
<Response [200]>

So, the request was successful. The HTML of the web page is in the text member of the response. We can pass that directly the the BeautifulSoup constructor, obtaining a soup object that we still need to navigate:

soup = BeautifulSoup(markup=response.text,
                     features="html.parser")

Looking at the HTML code, we see that just above our table there is the text “Upcoming Carpentries Workshops” inside a <h2> tag (code reindented for clarity)

...
<div class="row">
  <div class="medium-12 columns">
    <h2>Upcoming Carpentries Workshops</h2>
    
    Click on an individual event to learn more about that event, including contact information and registration instructions.

    <table class="table table-striped" style="width: 100%;">
      <tr> <td>
          <img src="https://carpentries.org/assets/img/logos/lc.svg" title="lc workshop" alt="lc logo" width="24" height="24" class="flags"/>
        </td>

        <td>
          <img src="https://carpentries.org/assets/img/flags/24/us.png" title="US" alt="us"  class="flags"/>
          
          <img src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online" alt="globe image" class="flags"/>
          
          <a href="https://annajiat.github.io/2021-01-22-uab-NNLM-online">University of Alabama at Birmingham (online)</a>
          
          <br/>
          <b>Instructors:</b> Annajiat Alim Rasel, Cody Hennesy, Camilla Bressan, Mary Ann Warner
          
        </td>
        <td>
          Jan 22 - Apr 23, 2021
        </td>
      </tr>
...

We can then look for the table by finding the HTML element that contains that text, using the string keyword argument:

(soup.find(string="Upcoming Carpentries Workshops"))
'Upcoming Carpentries Workshops'

By using the find method on a BeautifulSoup object, we look at all of its descendants and obtain other BeautifulSoup objects that we can search in the same way as the original one. But how do we get the parent element? We can use the find_parents() method, which returns a list of BeautifulSoup objects that represents the ancestors in the tree of the given element, starting from the immediate parent of the element itself and ending with the element at the root of the tree (soup in this case). The second parent in the list is the one that also contains the table we are interested in:

(soup
 .find(string = "Upcoming Carpentries Workshops")
 .find_parents()[1])
<div class="medium-12 columns">
<h2>Upcoming Carpentries Workshops</h2>
          
	  Click on an individual event to learn more about that event, including contact information and registration instructions.

<table class="table table-striped" style="width: 100%;">
<tr>
<td>
<img alt="lc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/lc.svg" title="lc workshop" width="24">
</img></td>
<td>
<img alt="us" class="flags" src="https://carpentries.org/assets/img/flags/24

It seems we are on the right track. Now let’s focus on the table element:

(soup
 .find(string = "Upcoming Carpentries Workshops")
 .find_parents()[1]
 .find("table"))
<table class="table table-striped" style="width: 100%;">
<tr>
<td>
<img alt="lc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/lc.svg" title="lc workshop" width="24">
</img></td>
<td>
<img alt="us" class="flags" src="https://carpentries.org/assets/img/flags/24/us.png" title="US">
<img alt="globe image" class="flags" src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online">
<a href="https://annajiat.github.io/2021-01-22-uab-NNLM-online">University of

Now we can get a list of row elements with

rows = (soup
 .find(string = "Upcoming Carpentries Workshops")
 .find_parents()[1]
 .find("table")
 .find_all("tr"))

Let’s focus now on the first element:

rows[0]
<tr>
<td>
<img alt="lc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/lc.svg" title="lc workshop" width="24">
</img></td>
<td>
<img alt="us" class="flags" src="https://carpentries.org/assets/img/flags/24/us.png" title="US">
<img alt="globe image" class="flags" src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online">
<a href="https://annajiat.github.io/2021-01-22-uab-NNLM-online">University of Alabama at Birmingham (online)</a>
<br>
<b>Instructors:</b> Annajiat Alim Rasel, Cody Hennesy, Camilla Bressan, Mary Ann Warner
      
      
	</br></img></img></td>
<td>
		Jan 22 - Apr 23, 2021
	</td>
</tr>

We can now split the row into three table data elements:

td0, td1, td2 = rows[0].find_all("td")

If we want the link to the workshop page, we can look at the <a> tag in td1, and specifically at its href attribute:

link = td1.find("a")["href"]
link
'https://annajiat.github.io/2021-01-22-uab-NNLM-online'

We can get a list of instructor names from the text content of td1:

td1_text_split = td1.text.split("Instructors:")

# create a blank list to populate with the instructor names
instructors = []

for name in td1_text_split[1].split(","):
    instructors.append(name.strip())

print(instructors)
# names redacted 
['instructor 1', 'instructor 2', 'instructor 3', 'instructor 4'] 

A more direct way

Can we look directly for table elements in the soup? How would you do that? Would that work?

Solution

We can check how many table elements are in the soup with

len(soup.find_all("table"))

We gather that there is only one table in the soup, so that should be the right one! We can thus use soup.find("table") to reach the right element right away.

List the workshops

Create a list of all the workshops, reporting for each one:

  • link
  • location
  • date
  • names of instructors

Solution

rows = soup.find("table").find_all("tr")
def process_row(row):
    _,td1,td2 = row.find_all("td")
    link = td1.find("a")["href"]

    td1_location_people = td1.text.split("Instructors:")
    location = td1_location_people[0].strip()
    # What about helpers?
    people = td1_location_people[1].split("Helpers:")
    instructors_string = people[0]
    # we ignore helpers, might not be present
    # helpers_string = people[1] 
    instructors = []
    for n in instructors_string.split(","):
        instructors.append(n.strip())
    date = td2.text.strip()

    return dict(
       link = link,
       location = location,
       instructors = instructors,
       date = date
    ) 

workshops = []
for row in rows:
    workshops.append(process_row(row))

Additional material

Beautiful Soup is a rich library that has a lot of powerful features that we are unable to discuss here.

A close look at the official documentation is worth the time for anyone seriously interested in web scraping.

Scraping Energy market data

Look at EPEX SPOT’s data on the energy market. How would you extract the price of the energy as a function of time? Can you look at other countries, and on different dates?

Solution

import requests
import pandas
from bs4 import BeautifulSoup

# From the url displayed in the browser in the address bar 
response = requests.get("https://www.epexspot.com/en/market-data",
                        params=dict(market_area="GB",
                                    trading_date="2021-03-19",
                                    delivery_date="2021-03-20",
                                    underlying_year="",
                                    modality="Auction",
                                    sub_modality="DayAhead",
                                    product="60",
                                    data_mode="table",
                                    period=""))

soup = BeautifulSoup(response.text,"html.parser")

The epex table is in two parts:

  1. the first part just shows baseload and peakload prices plus some whitespace (in total its 5 lines long)
  2. the rest of the data is the hourly prices and has more columns.

HTML allows variable width tables like this, but pandas doesn’t like them. Lets strip off those first 5 rows to make it a valid table again append <table> and </table> to the beginning and end:

rows = "<table>" +str(soup.find("table").find_all("tr")[5:]) + "</table>"

Then we can convert the string to a pandas dataframe:

df = pandas.read_html(str(rows))[0]

The timestamps aren’t stored in the table but a separate div, lets recreate them

df['time'] = range(0,24)
df = df.set_index('time')

we now have the EPEX data inside a pandas dataframe ready for processing, graphing etc. We can try and change the parameters (for example, using “DE-LU”) for another nation. Are you able to change the parameters for another date? Do you get different results?

Javascript code, the DOM and Selenium

The JavaScript code running on the page can actively change the structure of the HTML document. For some web pages, this is a crucial part of the rendering process: in some of those cases the JavaScript code must be run to download the data you are looking for from another URL, and populate the web page with that data and any additional element of the page design.

In those cases, using requests and BeautifulSoup might not be enough (as requests gets the HTML without running the JavaScript code on the page), but you can use the Selenium WebDriver to load the page in a fully-fledged browser and automate the interaction with it.

Key Points

  • A BeautifulSoup object can be navigated in many ways:

  • Use find to look for the first element that matches the given criteria in a subtree

  • Use find_all to obtain a list of elements that matches the given criteria in a subtree

  • Use find_parents to get the list of ancestor of the given element