This lesson is in the early stages of development (Alpha version)

Introduction to the Web and Online APIs

HTTP

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • What are protocols and ports?

  • What are HTTP and HTTPS?

  • What are requests and responses? How can we look at them?

Objectives
  • Understand the meaning of the terms protocol and port.

  • Understand what HTTP and HTTPS are, and how it relates to the Web and other aspects of the modern Internet.

  • Be able to use curl to make requests and view responses.

Since it was first introduced to the world in 1991, the World Wide Web has gone from the toy of computer scientists and particle physicists to a dominant part of everyday life for billions of people. At its core, the initial World Wide Web concept brought together three key ideas:

  1. The use of HTML (Hypertext Markup Language) documents which could contain hyperlinks to other documents (or different parts of the same document). These could reference documents located on any web server in the world.
  2. That every file on the world wide web would have a unique URL (Uniform Resource Locator).
  3. The Hypertext Transfer Protocol (HTTP) that is used to transfer data from the web server to the requesting client.

It has gradually consumed many services that were previously separate online services, or not available on the Internet at all.

Since the mid-2000s, the Web has increasingly been used to go beyond this traditional model of serving HTML to browsers. The same HTTP protocol which once served static HTML pages and images is now used to send dynamic content generated on the fly for consumption by other computer programs.

These Application Programming Interfaces (APIs) provide incredible amounts of structured data, as well as the ability to control things that may previously have required specialist proprietary software or even hardware. In particular, the data available via web APIs is particularly useful for data scientists; many data are now only made available via these APIs, and even in cases where data are made available in other formats, using an API is frequently more convenient.

To make effective use of web APIs, we need to understand a little more about how the Web works than a typical Web user might. This lesson will focus on clients—computers and software applications that make requests to other computers or applications, and receive information in response. Computers and applications that respond to such requests are referred to as servers.

Protocols and ports

You may (or may not) have wondered how it is that different web browsers, written independently by different companies and running on different operating systems, are able to talk to the same web servers using the same addresses, and get the same web pages back. This is because all web browsers implement the HyperText Transfer Protocol, or HTTP.

A protocol is nothing more than a system of rules that allow for communication between computers (or other devices). Much like a (human) language, it defines rules and syntax that when all parties follow, allow information to be transmitted from one device to another. Other examples of protocols you may be familiar with include the Secure Shell SSH, the File Transfer Protocol FTP, and the Simple Mail Transfer Protocol SMTP. Wikipedia has a long list of protocols that are (or once were) in common usage. HTTPS is a protocol closely related to HTTP; it follows many of the same conventions as HTTP, particularly in the way client and server code is written, but includes additional encryption to ensure that untrusted third parties can’t read or modify data in transit.

Given the large number of protocols in existence, computers need a way to identify which protocol a particular network connection is using, in particular on devices that have many different servers running. This is done by another set of protocols, which the above protocols build on top of: the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP). The difference between these isn’t important today; the important fact is that both protocols define port numbers (or ports) that are used to identify which server should handle a particular connection.

A server application must register a particular port to listen for connections on, and then all connections with that port number will be directed to that application. Ports are numbered 1–65,535, with ports up to 1,023 being “system ports” that on Unix-like systems require root access to listen to. Many protocols have standard ports that are used by convention—for example, HTTP uses port 80 by default, and HTTPS port 443. However, there is nothing stopping any protocol being used on any port.

You may have noticed that web addresses sometimes include a colon and a number after the server name; this indicates to the browser which port to connect on, in cases where you don’t want to connect to the default port (80 or 443). For example, Jupyter notebooks are frequently served at http://localhost:8888; this indicates that your browser should make an HTTP connection to your own local machine, on port 8888. Since only one application can listen to a port at a time, sometimes Jupyter finds it can’t listen on port 8888, and so will reserve port 8889 or 8890 instead.

URLs

A URL (also sometimes known as a URI or Uniform Resource Indicator) consists of two or three parts: the protocol followed by ://, the server name or IP address and optionally the path to the resource we wish to access. For example the URL http://carpentries.org means we want to access the default location on the server carpentries.org using the HTTP protocol. The URL https://carpentries.org/contact/ means we want to access the contact location on the carpentries.org server using the secure HTTPS protocol.

Requests and responses

The two main objects in HTTP are the request and the response. Each HTTP connection is initiated by sending a request, and is replied to with a response. Both the request and response have a header, that defines metadata about what is requested and what is included in the response, and both can also have a body, containing data. To look at these in more detail, we can use the curl command. Specifically, to see the request headers, we can use curl -v followed by the URL we wish to request.

$ curl -v http://carpentries.org
*   Trying 13.32.168.28...
* TCP_NODELAY set
* Connected to carpentries.org (13.32.168.28) port 80 (#0)
> GET / HTTP/1.1
> Host: carpentries.org
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Server: CloudFront
< Date: Sat, 13 Mar 2021 01:10:22 GMT
< Content-Type: text/html
< Content-Length: 183
< Connection: keep-alive
< Location: https://carpentries.org/
< X-Cache: Redirect from cloudfront
< Via: 1.1 f25763791d7f1173b560742bb9507145.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: LHR62-C5
< X-Amz-Cf-Id: JJLCGx6qUOpaid_ArD0kph8QddidHgWnKoi72yNn0Jazmla8H5mUGg==
<
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>CloudFront</center>
</body>
</html>
* Connection #0 to host carpentries.org left intact
* Closing connection 0

Lines starting > here are request headers, and lines starting < are response headers. Following this is the body (the section from <html> to </html>), which in this case is a short web page.

In this case, after identifying what type of request this is (a GET request), the location to look for (/), and the HTTP version, we include three headers: the first states the domain name we are looking to contact (in case one server is serving multiple domain names, as is quite common), the second identifies what software we’re using to connect (as some servers will adjust the content depending on, for example, which browser you connect with), and the third tells the server what we’re looking for—in this case we will accept whatever the server has to offer.

The server then responds with a status code, followed by a lot of metadata. In this case, the status code 301 indicates that the site is no longer at the location we tried, so the metadata includes where to look instead. This is followed by a short web page explaining the same thing. Most browsers will see the 301 and automatically redirect to the correct location so you never see this error message.

Let’s see what happens when we follow the redirect. Web pages can be quite long, so for now let’s ignore the body and look only at the headers.

$ curl -v https://carpentries.org > /dev/null

In this case, because we’re connecting via HTTPS, curl gives a lot more debugging information about the secure connection, but after this we see similar request headers (although this time we’re using HTTP/2), and then the response headers start with HTTP/2 200, with the status code 200 indicating that this was a successful request, with the body providing what we asked for.

HTTP status codes are three digits long, and almost always begin with 2, 3, 4, or 5. Status codes beginning 2xx indicate that the request was successfully received, understood, and accepted; 3xx indicates a redirect of some kind; 4xx indicates an error caused by the client (for example the famous 404 Not found where the client has requested a resource that does not exist on the server), and 5xx indicates an error on the server side.

It’s rarely necessary to inspect the request, so if you’re interested in the headers, it’s more convenient to use curl -I to just show the response headers.

$ curl -I https://carpentries.org
HTTP/2 200
content-type: text/html
content-length: 55036
date: Sat, 13 Mar 2021 01:32:50 GMT
last-modified: Sat, 13 Mar 2021 01:26:59 GMT
etag: "f16c8eaddc88e035134aa23e0f8a94ba"
server: AmazonS3
x-cache: Hit from cloudfront
via: 1.1 a25f829e86f504a329e71fa3f4d21485.cloudfront.net (CloudFront)
x-amz-cf-pop: LHR62-C5
x-amz-cf-id: WGyZEdVLxTFbdQ3eKX2rdnPWO0214DDcQi8TA5UpObYt2CgHjCUz7g==
age: 87

Noteworthy here is the first header content-type: text/html; this indicates that the response body is an HTML document (also known as a web page). HTML, the HyperText Markup Language, is the language that all web pages are written in; while we won’t write any today, we will look a little more at how to read it (and get your code to read it) in a later episode.

HyperText?

Both HTTP and HTML refer to HyperText. This was a popular buzzword in the 1990s, and refers to the Web’s ability to include not only text, but also cross-references in the form of links (hypertext links, or hyperlinks) to other documents stored elsewhere, which the user can immediately access.

While this seems entirely obvious and second-nature today, it was revolutionary when it was first introduced, hence the name appearing prominently in technologies that supported it.

Another website

Pick a web page you’ve visited recently and take a look at its response headers with curl -I. How do they differ from the https://carpentries.org/ headers we looked at above? What parts are similar?

Key Points

  • A protocol is a standard for communicating data across a network. A port is a number to identify which program should process a network connection.

  • HTTP is the protocol originally designed for requesting and receiving Web pages, but now also used as the basis for a variety of APIs. HTTPS is the encrypted version of HTTP.

  • Every page on the world wide web is identified with a URL or Uniform Resource Locator.

  • A request is how you tell a server what you want to see. A response will either give you what you asked for, or tell you why the server can’t do that. Both requests and responses have a header, and optionally a body.

  • We can make requests and receive responses, as well as see their headers, using curl.


What do APIs look like?

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • How can requests be made of web APIs?

  • How can responses from web APIs arrive?

  • How can requests to web APIs be authenticated?

Objectives
  • Be able to make requests to web APIs using curl using endpoints, query parameters, and JSON data.

  • Be able to identify responses in plain text and JSON.

  • Be able to authenticate to web APIs with passwords and authentication tokens.

We’ve done a lot of talking about the technologies that will let us interact with APIs so far. Let’s now start putting this into practice and query an API.

$ curl http://numbersapi.com/42
42 is the number of laws of cricket.

Numbers API provides facts about numbers. By putting the number of interest into the address, we tell Numbers API which number to give a fact about. By adding other keywords to the address, we can refine the domain that we’re asking for information in; for example, for specifically mathematical trivia, we can add /math.

$ curl http://numbersapi.com/42/math
42 is a perfect score on the USA Math Olympiad (USAMO) and International Mathematical Olympiad (IMO).

Numbers API is not an especially sophisticated API. In particular, it only offers a single endpoint (specifically, /), and each response to a query is a single string, provided as plain text.

We can think of an API as being similar to a package or library in a programming language, but one that is usable from almost any programming language. In these terms, an endpoint is equivalent to a function; Numbers API provides a single function, /, which gives information about numbers. The response is the return value of the function, and in this case is a single string. This maps well onto HTTP, as the response body of a request is a string of either characters or of bytes. (Byte strings don’t translate well between languages, so are usually avoided, except for specific portable formats such as images.)

However, many useful functions need to return something other than character strings. For example, you might want to return a list, or an array, or a set of related data. Let’s look at another example of a web API and see how this can be handled. Newton is a web API for advanced mathematics. One thing it can do is factorization:

$ curl https://newton.vercel.app/api/v2/factor/x^2-1
{"operation":"factor","expression":"x^2-1","result":"(x - 1) (x + 1)"}

Two things have changed. Firstly, now instead of /, we are specifying that we want to use the factor endpoint provided by the v2 version of the API. This is a very common way of structuring APIs: firstly a version, and then one or more levels of endpoints to specify what function you would like the API to perform.

Secondly, rather than a plain text response, we get a data structure. This is still encoded as plain text (because HTTP can’t natively transmit much else), but we can’t use the text directly—instead, we need to parse it, first. The syntax used here is the most common format for modern web APIs, and is called JSON (pronounced like the name “Jason”; short for JavaScript Object Notation). (You may also encounter older or more old-fashioned APIs that instead use XML, the eXtensible Markup Language.) We can see that this response includes three names, or keys ("operation", "expression", and "result"), and three associated values ("factor", "x^2-1", and "(x - 1) ( x + 1)", respectively).

factor is not the only thing that Newton can do. Let’s try a different endpoint, for integration.

$ curl https://newton.vercel.app/api/v2/integrate/x^2-1
{"operation":"integrate","expression":"x^2-1","result":"1/3 x^3 - x"}

In this case Newton correctly tells us that the "result" of this integration is "1/3 x^3 - x".

The endpoints an API offers, and what format it will give its responses in, will generally be listed in the API’s documentation. Newton’s documentation for example can be found on GitHub.

More math

Read through Newton’s documentation. Try one or more of the other endpoints that we haven’t tried. Check that the results match what you would expect.

Try using a different input function than x^2-1. Again, check that the answers give what you expect.

Errors (or not)

Try using the simplify endpoint for Newton to simplify the expression 0^(-1) (i.e. 1 divided by 0).

Use curl -i to see both the headers and the response. Do these match what you expect?

Solution

The response code for this request is 200 (OK), but the "result" indicates that an error occurred.

This is not uncommon; not all APIs will use the HTTP status code to indicate an error condition. Some will even give you an HTML web page describing an error condition when usually you would expect a non-HTML response. It’s good to check this behaviour for each API that you use, so that you can guard for it in your software.

Authentication and identification

Many web APIs restrict access to registered users or applications. This may be because they are used to control things that are specific to a particular user account, because different people have different privilege levels and so different endpoints available, or simply because the API provider wants to collect statistics on how the API is being used.

Various ways exist for developers to authenticate to an API, including:

For everything other than HTTP authentication, there are also a variety of ways to present the credential to the server, such as:

One important fact about HTTP is that it is stateless: each request is treated entirely separately, with no memory from one request to the next. This means that you must present your authentication credentials with every request you make to the API. (This is in contrast to other protocols like SSH or FTP, where you authenticate once at the start of a session, and then subsequent messages can be sent back and forth without the need for re-authentication.)

For example, NASA offers an API that exposes much of the data that they make public. They require an API key to identify you, but don’t require any authentication beyond this.

Let’s try working with the NASA API now. To do this, first we need to generate our API key by providing our details at the API home page. Once that is done, NASA provide the API key instantly, and send a copy to the email address you provide. They helpfully also provide an example of an API query to try, querying the Astronomy Picture of the Day (APOD). This shows us that NASA expects the API key to be encoded as a query parameter.

$ curl -i https://api.nasa.gov/planetary/apod?api_key=ejgThfasPCRf4kTd39ar55Aqhxv8cwKBdVOyZ9Rr
HTTP/1.1 200 OK
Date: Mon, 15 Mar 2021 00:08:34 GMT
Content-Type: application/json
Content-Length: 1135
Connection: keep-alive
Vary: Accept-Encoding
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1998
Access-Control-Allow-Origin: *
Age: 0
Via: http/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])
X-Cache: MISS
Strict-Transport-Security: max-age=31536000; preload

{"copyright":"Mia St\u00e5lnacke","date":"2021-03-14","explanation":"It appeared, momentarily, like a 50-km tall banded flag.  In mid-March of 2015, an energetic Coronal Mass Ejection directed toward a clear magnetic channel to Earth led to one of the more intense geomagnetic storms of recent years. A visual result was wide spread auroras being seen over many countries near Earth's magnetic poles.  Captured over Kiruna, Sweden, the image features an unusually straight auroral curtain with the green color emitted low in the Earth's atmosphere, and red many kilometers higher up. It is unclear where the rare purple aurora originates, but it might involve an unusual blue aurora at an even lower altitude than the green, seen superposed with a much higher red.  Now past Solar Minimum, colorful nights of auroras over Earth are likely to increase.   Follow APOD: Through the Free NASA App","hdurl":"https://apod.nasa.gov/apod/image/2103/AuroraFlag_Stalnacke_6677.jpg","media_type":"image","service_version":"v1","title":"A Flag Shaped Aurora over Sweden","url":"https://apod.nasa.gov/apod/image/2103/AuroraFlag_Stalnacke_960.jpg"}

We can see that this API gives us JSON output including a links to two versions of the picture of the day, and then metadata about the picture including its title, description, and copyright. The headers also give us some information about our API usage—our rate limit is 2000 requests per day, and we have 1998 of these remaining (probably because the malware scanner on my email server tested the link first to make sure it wasn’t malicious).

With all of these ways to provide identification and authentication information, we don’t have time to cover each possibility exhaustively. For the vast majority of APIs, there will exist good developer documentation that provides examples of how to use the token or other identifier that they provide to connect to their service, including examples.

More complicated queries

Thus far we have queried APIs where any parameters are included as part of the effective “filename” on the server. For example, in http://numbersapi.com/42, the 42 is a parameter to the API, but at first glance it could equally well be an endpoint.

Many APIs make this distinction more clear, by accepting arguments in a query string. This is a sequence of name=value pairs, separated from each other by &s, and separated from the endpoint by a ?.

Using quotes with Curl

When we put an & into a web address for Curl we need to put it inside quotes. If we don’t then our shell will interpret them as meaning we should run the preceeding command in the background instead of passing it as a parameter to curl. This will effectively truncate the address to everything up to the first &.

We have already seen one example of this—we used it to provide our API key to NASA’s APOD endpoint. The APOD endpoint also accepts other parameters, for example, to select the date or dates for which the picture is returned.

$ curl -i "https://api.nasa.gov/planetary/apod?date=2005-04-01&api_key=ejgThfasPCRf4kTd39ar55Aqhxv8cwKBdVOyZ9Rr"
HTTP/1.1 200 OK
Date: Mon, 15 Mar 2021 00:31:45 GMT
Content-Type: application/json
Content-Length: 965
Connection: keep-alive
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1996
Access-Control-Allow-Origin: *
Age: 0
Via: http/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])
X-Cache: MISS
Strict-Transport-Security: max-age=31536000; preload

{"copyright":"Ellen Roper","date":"2005-04-01","explanation":"Can you help discover water on Mars?  Finding water on different regions on Mars has implications for understanding its complex geologic history, the possible existence of past life and the sustenance of potential future astronauts.  Many space missions have taken photographs of the surface of the red planet, and some of them might show a subtle clue pointing to water on Mars that has been missed.  By close inspection of images, following curiosity, applying scientific principles, applying knowledge about features on the Martian surface, and applying principles of planetary geology, such clues might be brought to light.  In the meantime, happy April Fool's Day from the folks at APOD!","hdurl":"https://apod.nasa.gov/apod/image/0504/WaterOnMars2_gcc_big.jpg","media_type":"image","service_version":"v1","title":"Water On Mars","url":"https://apod.nasa.gov/apod/image/0504/WaterOnMars2_gcc.jpg"}

One benefit of being able to construct queries in this way is that the query is more self-descriptive—for unfamiliar APIs, keyword arguments are significantly easier to read than positional ones.

One other way to provide parameters, in particular when they are more complex data structures than can be easily represented in a small string, is to use JSON in the body of the request. Since constructing JSON by hand is tedious, we will defer such APIs to the next section.

NASA aerial imagery

Look through NASA’s API documentation. Use the Earth API to retrieve an aerial image of your current location.

Try first using curl without any flags. What message do you get from curl? Why might this be?

Now try inspecting the headers for the request using curl -I, and look at the Content-Type. Does this match your suspicion as to the reason for curl’s message?

Finally, follow curl’s advice to save the output to a file. Open the resulting file and see if it matches what you expected.

Key Points

  • Interact with web APIs by sending requests to an endpoint representing a function of interest. Parameters can be encoded into the request, or attached as e.g. JSON.

  • Responses are typically plain text or JSON, but could be anything.

  • Most APIs require some form of authentication. This can be by username and password, or via a token.

  • Which choices a given API makes for each of these will be described in the API’s documentation.


dicts

Overview

Teaching: 12 min
Exercises: 8 min
Questions
  • What is a Python dict?

  • How do I use a dict?

Objectives
  • Understand what a dict is.

  • Be able to create, modify, and use dicts in Python.

In the previous episode we saw that some APIs will return data formatted as JSON, including names (or keys) and values associated with them.

Since we would ultimately like to work with data from these APIs in Python, it would be nice if Python had a data structure that behaved similarly. In the Software Carpentry introduction to Python, we learned about lists, which are ordered collections of things, indexed by their position in the ordering. What we would like here is similarly a collection, but rather than having ordering and indexing by position, instead we would like elements to have an arbitrary index of our choice.

In fact, Python has such a collection built into it; it is called a dict (short for dictionary). Let’s construct one now, to hold data from the Mayo Clinic about caffeine levels in various beverages.

caffeine_mg_per_serving = {'coffee': 96, 'tea': 47, 'cola': 24, 'energy drink': 29}

We see here that the dict is created within curly braces {}, and contains keys and corresponding values separated by a :, with successive pairs being separated by a , like in a list.

Again, similarly to a list, we can access elements of the dict with square brackets []. For example, to get the number of mg of caffeine per serving of coffee, we could use the following:

print("Coffee has", caffeine_mg_per_serving['coffee'], "mg of caffeine per serving")
Coffee has 96 mg of caffeine per serving

We can also replace elements in the same way that we can for a list. For instance, you may have spotted that the value for 'cola' is incorrect. Let’s fix that now.

caffeine_mg_per_serving['cola'] = 22
print(caffeine_mg_per_serving)
{'coffee': 96, 'tea': 47, 'cola': 22, 'energy drink': 29}

One thing that we can’t do for lists is create new elements by indexing with []. But dicts let us do that, as well:

caffeine_mg_per_serving['green tea'] = 28
print(caffeine_mg_per_serving)
{'coffee': 96, 'tea': 47, 'cola': 22, 'energy drink': 29, 'green tea': 28'}

Ordering

Python dicts historically were not ordered—you would not be guaranteed to get back results in the same order that you put them in. In more recent versions of Python, dicts do preserve the ordering in which they are created, so 'green tea', having been added most recently, appears at the end.

Missing values

dicts will throw an error, though, if we try to access values for keys that we have not added previously.

print(caffeine_mg_per_serving['guarana'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-dff37d2ef7d1> in <module>
----> 1 caffeine_mg_per_serving['guarana']

KeyError: 'guarana'

To write more robust code, we might like to check whether we can use a particular key before trying to access it. In a list, this is simple, as we can check whether a particular index is less than the length of the list. With a dict, we need to use a keyword to check whether a particular key is in the list:

'coffee' in caffeine_mg_per_serving
True

Alternatively, if we want to get an element of the list and use a default value if the key isn’t found, we can use the .get() method:

print(caffeine_mg_per_serving.get("coffee", 0))
print(caffeine_mg_per_serving.get("hot chocolate", 0))
96
0

(If you don’t specify the default value, then Python uses None for keys that are not found.)

Looping

Now, a particularly useful thing to do with a list is to loop over it. What happens when we loop over a dict?

for item in caffeine_mg_per_serving:
    print(item)
coffee
tea
cola
energy drink
green tea

Looping (or otherwise iterating) over a dict in fact loops over its keys. This matches with what the in keyword does—it would be strange for the two to look at different aspects of the dict. But sometimes we may want to use the values as well as the keys in a loop. We could index back into the dict via the key, but that is repetitive. We can instead use the .items() method of the dict:

for drink, quantity in caffeine_mg_per_serving.items():
    print(drink.capitalize(), "contains", quantity, "mg of caffeine per serving")
Coffee contains 96 mg of caffeine per serving
Tea contains 47 mg of caffeine per serving
Cola contains 22 mg of caffeine per serving
Energy drink contains 29 mg of caffeine per serving
Green tea contains 28 mg of caffeine per serving

What’s in a key?

In this episode, we have used strings as keys, as this is what we’re most likely to see when working with JSON. This is not a Python restriction, however. We can use any “hashable” type as a dict key; this includes strings, numbers, and tuples, among other immutable types. Most notably, this excludes lists and dicts (which are mutable).

dicts of functions

What will the following code do?

import numpy as np

operations = {
    'min': np.min,
    'max': np.max
}

def process(array, operation):
    return operations[operation](array)

print(process([1, 4, 7, 2, -3], 'min'))

When might this kind of behaviour be useful?

Try adjusting the example so that 'mean' and 'std' also work as you might expect.

Solution

This will pull the described function out of the dictionary. This could be useful when you want to allow the user to decide what functionality is desired at run-time, perhaps in a configuration file. Perhaps a choice of inversion algorithms or fitting functions could be offered.

To add other functions, the operations dict could be adjusted as:

operations = {
    'min': np.min,
    'max': np.max,
    'mean': np.mean,
    'std': np.std
}

Nested dicts

It is worth noting that the values in a dict can be of any type (this is not true for the keys). One notable case, is that values can be themselves dicts:

nutrition_values = {'energy': {'units': 'kCal/100g',
                               'values': {'white bread': 273,
                                          'almonds': 512}},
                    'caffeine': {'units': 'mg per serving',
                                 'values': caffeine_mg_per_serving}}

It is then possible to access data using multiple square bracket expressions:

print("Caffeine content of coffee:", nutrition_values['caffeine']['values']['coffee'])
print("Units:", nutrition_values['caffeine']['units'])
Caffeine content of coffee: 96 
Units: mg per serving

Key Points

  • A dict is a collection of key-value pairs.

  • Create a dict with the syntax {key1: value1, key2: value2, ...}.

  • Get and set elements of a dict with square brackets: my_dict[key1] = new_value1.


Requests

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How can I send HTTP requests to a web server from Python?

  • How to interact with web services that require authentication?

  • What are the data formats that are used in HTTP messages?

Objectives
  • Use the Python requests library for GET and POST requests

  • Understand how to deal with common authentication mechanisms.

  • Understand what else the requests library can do for you.

So far, we have been interacting with web APIs by using curl to send HTTP requests and then inspecting the responses at the command line. This is very useful for running quick checks that we are able to access the API, and debugging if we’re not. However, to integrate web APIs into our software and analyses, we’d like to be able to make requests of web APIs from within Python, and work with the results.

In principle we could make subprocess calls to curl, and capture and parse the results, but this would be very cumbersome. Fortunately, other people thought the same thing, and have made libraries available to help with this. Basic functionality around making and processing requests is built into the Python standard library, but far more popular is to use a package called requests, which is available from PyPI.

First off, let’s check that we have requests installed.

$ python -c "import requests"

if you do not see any message, then requests is already installed. If on the other hand you see a message like

Traceback (most recent call last):
  File "<string>", line 1, in <module>\
ModuleNotFoundError: No module name 'requests'

then install requests from pip:

$ pip install requests

Recap: Requests, Responses and JSON

As a reminder, communication with web APIs is done through the HTTP protocol, and happens through messages, which are of two kinds: requests and responses.

A request is composed of a start line, a number of headers and an optional body.

Practically, a request needs to specify one of the HTTP verbs and a URL in the start line and an optional payload (the body).

A response is composed of a status line, a number of headers and an optional body.

The data to be transferred with the body of a request needs to be represented in some way. “Unstructured” text representations are used, e.g., to transmit CSV data. A popular text-based (ASCII) format to transmit data is the JavaScript Object Notation (JSON) format. The Python standard library includes a module to deal with JSON, for serialisation (i.e. representing Python objects as JSON strings):

import json
data = dict(a=1, b=dict(c=(2,3,4)))
representation = json.dumps(data)
representation
'{"a": 1, "b": {"c": [2, 3, 4]}}'

And for parsing (i.e. recovering python objects from their JSON string representation):

data_reparsed = json.loads(representation)
data_reparsed
{'a': 1, 'b': {'c': [2, 3, 4]}}

You can see that for dicts containing strings, integers, and lists, at least, the JSON representation looks very similar to the Python representation. The two are not always directly interchangeable, however.

The Python requests library can parse JSON and serialise the objects, so that you don’t have to deal with this aspect on your own.

Another ASCII format that is used with APIs is the eXtensible Markup Language (XML), which is much more complex to deal with than JSON. Facilities to deal with the XML format are in the xml.etree.ElementTree library.

Another markup language widely used in HTTP message bodies is the HyperText Markup Language, HTML.

HTTP verbs

Up until now we have exclusively used GET requests, to retrieve information from a server. In fact, the HTTP protocol has a number of such verbs, each associated with an operation falling in one of four categories: Create, Read, Update, or Delete (sometimes called the CRUD categories). The most common verbs are:

In this lesson we will focus on GET and POST requests only.

A GET request example

Let’s take the first example we looked at earlier, now with the Python requests library:

import requests
response = requests.get("http://www.carpentries.org")

requests gives us access to both the headers and the body of the response. Looking at the headers first, we can look at what type of data is in the body. As this is the URL of a website, we expect the reponse to contain a web page:

response.headers["Content-Type"]
text/html

Our expectations are confirmed. We can also check the Content-Length header to see how much data we expect to find in the body:

response.headers["Content-Length"]
55036

And, as expected, the length of the body of the response is the same:

len(response.text)
55036

We can look at the content of the body:

response.text

This shows us the same HTML source code as we obtained from curl earlier.

GET with parameters

As we have seen when talking about curl, some endpoints accept parameters in GET requests. Using Python’s requests library, the call to NASA’s APOD endoint that

$ curl -i "https://api.nasa.gov/planetary/apod?date=2005-04-01&api_key=<your-api-key>"

can be expressed in a more human-friendly format:

response = requests.get(url="https://api.nasa.gov/planetary/apod",
                        params={"date":"2005-04-01",
                                "api_key":"<your-api-key>"})

using a dictionary to contain all the arguments.

Get a list of GitHub repositories

The CDT-AIMLAC GitHub organisation (cdt-aimlac) has a number of repositories. Using the official API documentation of GitHub, can you list their name, ordered in ascending order by last updated time? (Look at the examples in the documentation!)

Solution

The url to use is https://api.github.com/orgs/cdt-aimlac/repos. In addition to that, we need to use the parameters sort with value updated and direction with value asc.

response = requests.get(url="https://api.github.com/orgs/cdt-aimlac/repos",
                        params={'sort':'updated',
                                'direction':'asc'})
response
<Response [200]> 

Once we verify that there are no errors, we can extract the data, which is available via the json() method:

for repo in response.json():
   print(repo["name"], ':', repo["updated_at"]) 
testing_exercise : 2020-04-28T13:56:42Z
docker-introduction-2021 : 2021-01-26T19:20:19Z
grid : 2021-03-10T11:59:09Z
training-cloud-vm : 2021-03-23T13:43:03Z
pl_curves : 2021-03-24T14:28:25Z
ccintro-2021 : 2021-09-21T13:57:35Z
git-novice : 2021-11-24T10:21:58Z
aber-pubs : 2021-11-24T14:19:27Z

Authentication and POST

As mentioned above, thus far we have only used GET requests. GET requests are intended to be used for retrieving data, without modifying any state—effectively, “look, but don’t touch”. To modify state, other HTTP verbs should be used instead. Most commonly used for this purpose in web APIs are POST requests.

As such, we’ll switch to using the GitHub API to look at how POST requests can be used.

This will require a GitHub Personal Access Token. If you don’t already have one, then the instructions in the Setup walk through how to obtain one.

Take care with access tokens!

This access token identifies your individual user account, rather than just the application you’re developing, so anyone with this token can impersonate you and manage your account. Be very sure not to commit this (or any other personal access token) to a public repository, (or any repository that might be made public in the future) as it will very rapidly be discovered and used against you.

The most common mistake some people have made here is committing tokens for a cloud service. This has allowed unscrupulous individuals to take over cloud computing services and spend hundreds of thousands of pounds on activities such as mining cryptocurrency.

To POST requests, we can use the function requests.post.

For this example, we are going to post a comment on an issue on GitHub. Issues on GitHub are a simple way to keep track of bugs, and a great way to manage focused discussions on the code.

In order to do so, we need to authenticate. We will now create an object of the HTTPBasicAuth class provided by requests, and pass it to requests.post.

First of all, let’s load the GitHub access token:

with open("github-access-token.txt", "r") as file:
  ghtoken = file.read().strip()

Let’s then create the HTTPBasicAuth object:

from requests.auth import HTTPBasicAuth
auth = HTTPBasicAuth("your-github-username",ghtoken)

We will now create the body of the comment, as a JSON string:

import json
body = json.dumps({"body": "Another test comment"})

Finally, we will post the comment on GitHub and make sure we get a success code:

response = requests.post(url="https://api.github.com/repos/mmesiti/web-novice-test-repo/issues/1/comments",
              data=body,
              auth=auth)
response
<Response [201]>

The code 201 is the typical success response for a POST request, signaling that the creation of a resource has been successful. We can go to the issue page and check that our new comment is there.

Curl and POST

curl can be also used for POST requests, which can be useful for shell-based workflows. One needs to use the --data option.

What have I asked you?

The request that generated a given response object can be retrieved as response.request. Can you see the headers of that request? And what about the body of the message? What is the type of the request object?

Solution

To print the headers:

print(response.request.headers)
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

The body of the request is accessible just with

response.requests.body
'{"body": "A test comment"}'

And the type is PreparedRequest:

type(response.requests)
request.models.PreparedRequest

For better control, one could in principle create a Request object beforehand, call the prepare method on it to obtain a PreparedRequest, and then send it through a Session object.

Forgot the key

What error code do we get if we just forget to add the auth? How do the headers of the request change?

Solution

r = requests.post(url="https://api.github.com/repos/mmesiti/web-novice-test-repo/issues/1/comments",data=body)
r
<Response [401]>

The request headers are:

('User-Agent', 'python-requests/2.25.1')
('Accept-Encoding', 'gzip, deflate')
('Accept', '*/*')
('Connection', 'keep-alive')
('Content-Length', '26')

Most notably, the “Authorization” header is missing.

Authentication is a vast topic. The requests library implements a number of authentication mechanisms that you can use. To handle authentication for multiple requests, one could also use a Session object from the requests library (see Advanced Usage).

Another GET example - the Met Office API

As an additional example of using requests to connect to an API rather than a plain web site we’ll use the Met Office DataPoint API. The Met Office don’t especially want us to modify their forecasts—as much as we might like to modify the weather, so we will limit ourselves to GET requests.

To do this, you will need an API key. If you don’t already have an API for the Met Office DataPoint, then follow the instructions on the Setup page now.

The first step when working with API keys is to load the key into memory. This can either be done from a file, or by specifying the key directly in a settings file.

with open("metoffice-api-key.txt", "r") as file:
  api_key = file.read().strip()

Looking at the Met Office API reference, we can build a url to access the current forecasts for Swansea:

base_metoffice_url = "http://datapoint.metoffice.gov.uk/public/data/"
resource = "val/wxfcs/all/json/310149"
url = base_metoffice_url + resource
url

As shown in the API reference, this time we need to pass 2 parameters in the request: a resource description, and the API key, in order for the Met Office server to identify us.

As we saw in the previous episode, with curl from the command line, we would have to use the following command

$ curl "http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/json/310149?res=3hourly&key=$(cat metoffice-api-key.txt)" | less

building the parameter string explicitly. This is also the syntax that is used in a browser address bar:

"protocol://host/resource/path?parname1=value1&parname2=value2..."

However, using the requests library allows us to use a nicer syntax:

response = requests.get(url, params={"res":"3hourly", "key":api_key})
response
<Response [200]>

As we saw previously, the code 200 means “success”. To make sure the response contains what we expect, let’s quickly print its headers (which has the structure of a dictionary):

for key, value in response.headers.items():
    print((key, value))
('Server', 'WaveServer 1.0')
('ETag', '1615686466177')
('Content-Type', 'application/json')
('WebServer', '-PROD-01')
('Access-Control-Allow-Origin', '*')
('Content-Encoding', 'gzip')
('Content-Length', '1051')
('Cache-Control', 'public, no-transform, must-revalidate, max-age=611')
('Expires', 'Sun, 14 Mar 2021 12:38:16 GMT')
('Date', 'Sun, 14 Mar 2021 12:28:05 GMT')
('Connection', 'keep-alive')
('Vary', 'Accept-Encoding')

As expected the Content-Type is application-json. We can now look at the body of the response:

response.text[:100]
'{"SiteRep":{"Wx":{"Param":[{"name":"F","units":"C","$":"Feels Like Temperature"},{"name":"G","units"

As mentioned, the requests library can parse this JSON representation and return a more convenient Python object, using which we can access the inner data:

data = response.json()
data["SiteRep"]["Wx"]

Another location

As described in the API reference the Met Office has a list of locations available at /public/data/val/wxfcs/all/json/sitelist. Choose a site near you. What is the expected temperature tomorrow at 11 AM?

Hint: Once you have the right data, you can use

data["SiteRep"]["DV"]["Location"]["Period"][1]["Rep"][3]["T"]

to get to the quantity of interest.

Solution

We query the MetOffice API using

sitelist_url = base_metoffice_url + 'val/wxfcs/all/json/sitelist'
site_response = requests.get(sitelist_url, 
                             params = dict(key = api_key)) 
sitelist = site_response.json()["Locations"]["Location"] # sic, unfortunately

We can look for a location, e.g. Cardiff:

for site in sitelist:
    if site["name"] == "Cardiff":
        print(site['id'])

Cardiff has two locations, one of the location’s ID is 350758, so we can use it:

resource = "val/wxfcs/all/json/350758"
url = base_metoffice_url + resource
response = requests.get(url, params={"res":"3hourly", "key":api_key})
data = response.json()

Now we must explore the data to find the information we need. It turns out that it is in

data["SiteRep"]["DV"]["Location"]["Period"][1]["Rep"][3]["T"]

The meanings of the keys in each dictionary can be found in

data["SiteRep"]["Wx"]

Key Points

  • GET requests are used to read data from a particular resource.

  • POST requests are used to write data to a particular resource.

  • GET and POST methods may require some form of authentication (POST usually does)

  • The Python requests library offers various ways to deal with authentication.

  • curl can be used instead for shell-based workflows and debugging purposes.


Elements of Web Scraping with BeautifulSoup

Overview

Teaching: 25 min
Exercises: 15 min
Questions
  • How can I obtain data in a programmatic way from the web without an API?

Objectives
  • Have an idea about how to navigate the HTML element tree with Beautiful Soup and extract relevant information.

Sometimes, the data we are looking for is not available from an API, but it is available on web pages that we can view with our browser. As an example task, in this episode we are going to use the Beautiful Soup Python package for web scraping to find all the relevant information about future Software Carpentry Workshop events.

Exploring HTML code in the browser

Navigate to The Carpentries. The page we see has been rendered by the browser from the HTML, CSS (Cascading Style Sheets) and JavaScript code that is available or linked in the page in some way.

In many browsers (for example, Chrome, Chromium, and Firefox), we can look at the HTML source code of the page we are viewing with the CTRL+u shortcut (alternatively, you can right click on the page and choose “View Source” from the context menu).

Things to notice:

Another way to explore the HTML code is to use the Developer Tools. In most browser, (Chrome, Chromium and Firefox), you can use the CTRL+Shift+I key combination to open the Developer Tools (alternatively, find the right option in your browser menu).

Developer Tools in Safari

In Safari on macOS, the Developer Tools are hidden by default. To enable them, open the Preferences window, go to the Advanced tab, and enable the “Show Develop menu in menu bar” option.

By using these, by pressing the combination CTRL+Shift+C (or clicking on the mouse pointer icon in the top left of the window) you can hover with the mouse on the elements in the rendered page and view their properties. If you click on one of these, the relevant part of the HTML code will be shown to you.

By using these techniques, we can understand how to locate the elements that we want when using Beautiful Soup later on.

Relevant HTML tags for this lessons

There is a number of tags that may be interesting in general, but specifically for what follows, we need to notice:

Scraping the page with Beautiful Soup

From the BeautifulSoup documentaion:

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

First of all, let’s verify that we have BeautifulSoup installed:

python -c "import bs4"

If there is no output, then we are all set. If instead you see something along the lines of

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'bs4'

Then you have to install the package. One way of doing that is via pip, with

pip install beautifulsoup4

Once we are sure the BeautifulSoup is available, we can import the necessary libraries in Python and use requests to GET the Carpentries website content:

import requests
from bs4 import BeautifulSoup

response = requests.get("http://www.carpentries.org")
response
<Response [200]>

So, the request was successful. The HTML of the web page is in the text member of the response. We can pass that directly the the BeautifulSoup constructor, obtaining a soup object that we still need to navigate:

soup = BeautifulSoup(markup=response.text,
                     features="html.parser")

Looking at the HTML code, we see that just above our table there is the text “Upcoming Carpentries Workshops” inside a <h2> tag (code reindented for clarity)

...
<div class="row">
  <div class="medium-12 columns">
    <h2>Upcoming Carpentries Workshops</h2>
    
    Click on an individual event to learn more about that event, including contact information and registration instructions.

    <table class="table table-striped" style="width: 100%;">
      <tr> <td>
          <img src="https://carpentries.org/assets/img/logos/lc.svg" title="lc workshop" alt="lc logo" width="24" height="24" class="flags"/>
        </td>

        <td>
          <img src="https://carpentries.org/assets/img/flags/24/us.png" title="US" alt="us"  class="flags"/>
          
          <img src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online" alt="globe image" class="flags"/>
          
          <a href="https://annajiat.github.io/2021-01-22-uab-NNLM-online">University of Alabama at Birmingham (online)</a>
          
          <br/>
          <b>Instructors:</b> Annajiat Alim Rasel, Cody Hennesy, Camilla Bressan, Mary Ann Warner
          
        </td>
        <td>
          Jan 22 - Apr 23, 2021
        </td>
      </tr>
...

We can then look for the table by finding the HTML element that contains that text, using the string keyword argument:

(soup.find(string="Upcoming Carpentries Workshops"))
'Upcoming Carpentries Workshops'

By using the find method on a BeautifulSoup object, we look at all of its descendants and obtain other BeautifulSoup objects that we can search in the same way as the original one. But how do we get the parent element? We can use the find_parents() method, which returns a list of BeautifulSoup objects that represents the ancestors in the tree of the given element, starting from the immediate parent of the element itself and ending with the element at the root of the tree (soup in this case). The second parent in the list is the one that also contains the table we are interested in:

(soup
 .find(string = "Upcoming Carpentries Workshops")
 .find_parents()[1])
<div class="medium-12 columns">
<h2>Upcoming Carpentries Workshops</h2>
          
	  Click on an individual event to learn more about that event, including contact information and registration instructions.

<table class="table table-striped" style="width: 100%;">
<tr>
<td>
<img alt="lc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/lc.svg" title="lc workshop" width="24">
</img></td>
<td>
<img alt="us" class="flags" src="https://carpentries.org/assets/img/flags/24

It seems we are on the right track. Now let’s focus on the table element:

(soup
 .find(string = "Upcoming Carpentries Workshops")
 .find_parents()[1]
 .find("table"))
<table class="table table-striped" style="width: 100%;">
<tr>
<td>
<img alt="lc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/lc.svg" title="lc workshop" width="24">
</img></td>
<td>
<img alt="us" class="flags" src="https://carpentries.org/assets/img/flags/24/us.png" title="US">
<img alt="globe image" class="flags" src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online">
<a href="https://annajiat.github.io/2021-01-22-uab-NNLM-online">University of

Now we can get a list of row elements with

rows = (soup
 .find(string = "Upcoming Carpentries Workshops")
 .find_parents()[1]
 .find("table")
 .find_all("tr"))

Let’s focus now on the first element:

rows[0]
<tr>
<td>
<img alt="lc logo" class="flags" height="24" src="https://carpentries.org/assets/img/logos/lc.svg" title="lc workshop" width="24">
</img></td>
<td>
<img alt="us" class="flags" src="https://carpentries.org/assets/img/flags/24/us.png" title="US">
<img alt="globe image" class="flags" src="https://carpentries.org/assets/img/flags/24/w3.png" title="Online">
<a href="https://annajiat.github.io/2021-01-22-uab-NNLM-online">University of Alabama at Birmingham (online)</a>
<br>
<b>Instructors:</b> Annajiat Alim Rasel, Cody Hennesy, Camilla Bressan, Mary Ann Warner
      
      
	</br></img></img></td>
<td>
		Jan 22 - Apr 23, 2021
	</td>
</tr>

We can now split the row into three table data elements:

td0, td1, td2 = rows[0].find_all("td")

If we want the link to the workshop page, we can look at the <a> tag in td1, and specifically at its href attribute:

link = td1.find("a")["href"]
link
'https://annajiat.github.io/2021-01-22-uab-NNLM-online'

We can get a list of instructor names from the text content of td1:

td1_text_split = td1.text.split("Instructors:")

# create a blank list to populate with the instructor names
instructors = []

for name in td1_text_split[1].split(","):
    instructors.append(name.strip())

print(instructors)
# names redacted 
['instructor 1', 'instructor 2', 'instructor 3', 'instructor 4'] 

A more direct way

Can we look directly for table elements in the soup? How would you do that? Would that work?

Solution

We can check how many table elements are in the soup with

len(soup.find_all("table"))

We gather that there is only one table in the soup, so that should be the right one! We can thus use soup.find("table") to reach the right element right away.

List the workshops

Create a list of all the workshops, reporting for each one:

  • link
  • location
  • date
  • names of instructors

Solution

rows = soup.find("table").find_all("tr")
def process_row(row):
    _,td1,td2 = row.find_all("td")
    link = td1.find("a")["href"]

    td1_location_people = td1.text.split("Instructors:")
    location = td1_location_people[0].strip()
    # What about helpers?
    people = td1_location_people[1].split("Helpers:")
    instructors_string = people[0]
    # we ignore helpers, might not be present
    # helpers_string = people[1] 
    instructors = []
    for n in instructors_string.split(","):
        instructors.append(n.strip())
    date = td2.text.strip()

    return dict(
       link = link,
       location = location,
       instructors = instructors,
       date = date
    ) 

workshops = []
for row in rows:
    workshops.append(process_row(row))

Additional material

Beautiful Soup is a rich library that has a lot of powerful features that we are unable to discuss here.

A close look at the official documentation is worth the time for anyone seriously interested in web scraping.

Scraping Energy market data

Look at EPEX SPOT’s data on the energy market. How would you extract the price of the energy as a function of time? Can you look at other countries, and on different dates?

Solution

import requests
import pandas
from bs4 import BeautifulSoup

# From the url displayed in the browser in the address bar 
response = requests.get("https://www.epexspot.com/en/market-data",
                        params=dict(market_area="GB",
                                    trading_date="2021-03-19",
                                    delivery_date="2021-03-20",
                                    underlying_year="",
                                    modality="Auction",
                                    sub_modality="DayAhead",
                                    product="60",
                                    data_mode="table",
                                    period=""))

soup = BeautifulSoup(response.text,"html.parser")

The epex table is in two parts:

  1. the first part just shows baseload and peakload prices plus some whitespace (in total its 5 lines long)
  2. the rest of the data is the hourly prices and has more columns.

HTML allows variable width tables like this, but pandas doesn’t like them. Lets strip off those first 5 rows to make it a valid table again append <table> and </table> to the beginning and end:

rows = "<table>" +str(soup.find("table").find_all("tr")[5:]) + "</table>"

Then we can convert the string to a pandas dataframe:

df = pandas.read_html(str(rows))[0]

The timestamps aren’t stored in the table but a separate div, lets recreate them

df['time'] = range(0,24)
df = df.set_index('time')

we now have the EPEX data inside a pandas dataframe ready for processing, graphing etc. We can try and change the parameters (for example, using “DE-LU”) for another nation. Are you able to change the parameters for another date? Do you get different results?

Javascript code, the DOM and Selenium

The JavaScript code running on the page can actively change the structure of the HTML document. For some web pages, this is a crucial part of the rendering process: in some of those cases the JavaScript code must be run to download the data you are looking for from another URL, and populate the web page with that data and any additional element of the page design.

In those cases, using requests and BeautifulSoup might not be enough (as requests gets the HTML without running the JavaScript code on the page), but you can use the Selenium WebDriver to load the page in a fully-fledged browser and automate the interaction with it.

Key Points

  • A BeautifulSoup object can be navigated in many ways:

  • Use find to look for the first element that matches the given criteria in a subtree

  • Use find_all to obtain a list of elements that matches the given criteria in a subtree

  • Use find_parents to get the list of ancestor of the given element