Asynchronous behavior: Coroutines, threads, processes, event loops, asyncio, async & await

Asynchronous behavior in Python touches on a many topics -- and inclusively techniques -- that are described in this section. First things first, let's get the main concept clear:

Asynchronous behavior: Happens when task A starts and task B can start right away.
Synchronous behavior: Happens when task A starts and task B can't start until task A finishes.

First let's recall the Python generator presented at the end of listing A-19. The generate_infinite_order_numbers generator allows discretionary access to an infinite sequence of numbers, by simply calling the next() method to get a new value. The key behavior of this and all Python generators is they can resume their work and allow the execution of other tasks, while they wait to be called again at a user's discretion.

Just like a generator is a special kind of iterator, there's another concept called a coroutine which is a special kind of generator. While a generator is used to make efficient data production, a coroutine -- in addition to doing data production -- can also accept data input to influence its data production. Listing A-24 illustrates a coroutine that creates a bot to inspect a site's robots.txt file and then crawl a series of paths if its allowed by the robots.txt file.

Listing A-24. Python coroutine example

import urllib.request
import urllib.robotparser

def crawler(site):
    try:
        print(f"Attempting to read robots.txt from site {site}")
        robots_file = urllib.robotparser.RobotFileParser()
        robots_file.set_url(f"{site}/robots.txt")
        robots_file.read()
    except Exception as e:
        print(f"Unable to crawl robots.txt from site {site}")
    print(f"Ready to start crawling {site}!")
    while True:
        try:
            path = (yield)
            if path is None:
                path = "/"
            if robots_file.can_fetch("*",path):
                print(f"{site}{path} CAN be crawled, attempting to crawl...")
                page = urllib.request.urlopen(site)
                page_size = len(page.read())
                print(f"{site}{path} size is {page_size}")
            else:
                print(f"{site}{path} CANNOT be crawled")
        except Exception as e:
            print(f"Crawler error on {site}{path}: {e}")

>>> crawler("https://duckduckgo.com")
<generator object crawler at 0x7fecabb602e0>

# Coroutine/generator must use reference to be called
>>> bot = crawler("https://duckduckgo.com")

# Call with next() to advance toward yield statement
>>> next(bot)
Attempting to read robots.txt from site https://duckduckgo.com
Ready to start crawling https://duckduckgo.com!

>>> bot.send("/")
https://duckduckgo.com/ CAN be crawled, attempting to crawl...
https://duckduckgo.com/ size is 5722

>>> bot.send("/lite")
https://duckduckgo.com/lite CANNOT be crawled

# Calling next() on the coroutine/generator ext() yields None, which is treated as root path
>>> next(bot)
https://duckduckgo.com/ CAN be crawled, attempting to crawl...
https://duckduckgo.com/ size is 5722

# Finish coroutine/generator
>>> bot.close()

# Coroutine/generator is now closed
>>> bot.send("/")
Traceback (most recent call last):
   File "", line 1, in 
 StopIteration

Listing A-24 leverages Python's built-in urllib package to inspect a site's robots.txt and crawl site pages. The crawler() coroutine/generator accepts a site argument and immediately attempts to read the robots.txt file of said site. If you're unfamiliar with robots.txt, it's a standard file used by most websites, indicating a site owner's preferences for web crawlers (e.g. allowing/disallowing certain locations, allowing/disallowing certain crawlers by name, etc).

If a site's robots.txt is readable, then an endless loop is started with while True:. The first instruction in the endless loop path = (yield) is the key to defining coroutines. The yield keyword should be familiar from the Python generator section, but notice the yield keyword in this example is an expression (i.e. it's to right side of the = ). Although the yield behavior is identical to the generator representing an execution pause, because yield is now part of an expression, it represents an execution pause until an input value is passed to the generator, thus making it a coroutine.

Once data input is passed to the coroutine/generate through path = (yield), the input is assigned to the path reference indicating a path to crawl on the site provided at the outset. Next, leveraging robots_file created with urllib.robotparser, an evaluation is made to see if the provided path is allowed to be read. If the path is allowed, urllib.request is used to crawl the web page on the path and the page's total size is printed, if the path isn't allowed a message is also printed.

The invocation sequence in listing A-24 shows that attempting to directly call crawler("https://duckduckgo.com") doesn't work, because you're dealing with generator and all generators must be assigned to a reference -- see the Python generator section for details. Next, the call bot = crawler("https://duckduckgo.com") assigns the generator to the bot reference. At this point nothing has been evaluated yet, so a call to next() must be made on the reference to start moving through the coroutine/generator. Notice the first call to next(bot) outputs Attempting to read robots.txt from site https://duckduckgo.com Ready to start crawling https://duckduckgo.com! which means the first section of the coroutine/generator was evaluated.

When the coroutine/generator reaches the path = (yield) statement -- like all other genereators with a yield statement in them -- the execution pauses, in this case the pause will be until a value is passed into the coroutine/generator with send() . You can see the call bot.send("/") attempts to read the home page / on the provided site and outputs the page's size. Shortly after you can see the bot.send("/lite") attempts to read the site's /lite path which is not allowed by the robots.txt file.

Notice it's possible to continue to use next() on the coroutine/generator, however, because in this case you're dealing with a coroutine, it means next(bot) sends an empty None value to path = (yield), which is then treated as the home page / path due to the conditional if path is None:.

Finally, notice the execution of bot.close() which is a method to explicitly finish a coroutine/generator. If you attempt to do another iteration (i.e. next() or send()) on the coroutine/generator once close() is called, you get the exception StopIteration which is the same when all iterators/generators have run their course.

As you can see in listing A-24, coroutines are a powerful Python concept. In this case, you're able to crawl a site's robots.txt file, keep it on hand, input paths in an on-demand fashion to verify and then crawl them, all the while being able to perform other tasks in the interim. So the more complex the computation made at the outset of a coroutine (e.g. crawl a site, query a database) the more value it delivers, since you don't need to constantly perform the same logic over and over again like in a regular method.

With a basic understanding of what constitutes a Python coroutine, let's explore an important issue that can affect coroutines, as well as any other regular Python function.

Although all coroutines allow a function to pause its execution and return to it later while other work is done, this still doesn't mean this other work can start immediately if there's something in the coroutine taking a long time. If you run the coroutine in listing A-24 with a site like https://overstock.com/ and attempt to crawl the home page (e.g. bot.send("/") ), you'll notice the coroutine won't let you run anything for some time. Besides indicating the robots.txt for https://overstock.com/ is not accurate and they overzealously block bots, it also indicates there's synchronous (a.k.a. blocking) logic in the coroutine in listing A-24, potentially leading to unresponsive behavior if such logic is used in an application.

The problematic logic in the coroutine in listing A-24 is urllib.request.urlopen(site). Since this logic is performed over a network and you have no control over a remote site, you don't know if a given site will take 10 miliseconds, 10 seconds, 1 minute or if it will respond at all.

There are various solutions to handle this blocking behavior:

Add an explicit timeout to urllib.request.urlopen statement, so no call blocks longer than timeout.
Use concurrent threads or concurrent processes.
Use an async/await coroutine to run the logic in an event loop.

Option 1 consists of modifying the line page = urllib.request.urlopen(site) in listing A-24 to page = urllib.request.urlopen(site, timeout=5) to indicate the call should take maximum 5 seconds. If after 5 seconds no response is received, the call is abandoned with The read operation timed out and control is returned to the caller. This is a superficial solution because it doesn't address the main blocking problem, it just attempts to guess how long is long enough. Expecting a response from any web site in 5 seconds is reasonable, but as you just saw, it's entirely possible for a site to not respond at all or inclusively take longer if its under heavy load or has intermittent network issues. So it raises the question, is waiting 5 seconds too much or too little ? what about 15 seconds ? or 1 minute ? While setting a timeout value can avoid you some headaches due to runaway logic, ideally you shouldn't rely on this solution alone because it tries to restrict workloads you have no control over.

The other options using concurrent threads, concurrent processes and async/await are discussed in the next sections.

Note Because synchronous behavior can apply to both coroutines and ordinary functions, I'm going to forget about coroutines for a moment and use a plain function that also crawls sites to illustrate synchronous/asynchronous behavior. Toward the end, I'll reintroduce coroutines using the newer async/await syntax

Asynchronous behavior with threads and processes

Listing A-25 illustrates a plain function that crawls a site's home page, inspired by the same logic used in listing A-24.

Listing A-25. Python function with blocking network calls

import urllib.request

SITES = ["https://google.com/","https://duckduckgo.com/",
	 "https://amazon.com/","https://overstock.com/",
	 "https://www.54356456456456.com",
	 "https://nytimes.com","https://ft.com/",
	 "https://wired.com","https://arstechnica.com/",
	 "https://abfdgdfsegfdgfdfsd.com",
	 "https://twitter.com","https://facebook.com/"]

def crawler(site):
    if site is None:
        print("Must provide a site to crawl")
    else:
        try:
            page = urllib.request.urlopen(site)
            page_size = len(page.read())
            print(f"Home page {site} size is {page_size}")
        except Exception as e:
            print(f"Can't crawl home page on {site}: {e}")

def multibot(sites):
    for site in sites:
        crawler(site)

>>> multibot(SITES)
Home page https://google.com/ size is 13828
Home page https://duckduckgo.com/ size is 5722
Can't crawl home page on https://amazon.com/: HTTP Error 503: Service Unavailable
<<BLOCKS UNTIL TIMEOUT IS REACHED OR overstock.com RESPONDS >>
...
...
Can't crawl home page on https://overstock.com/: The read operation timed out
Can't crawl home page on https://www.54356456456456.com: 
Home page https://nytimes.com size is 1514405
Home page https://ft.com/ size is 190189
Home page https://wired.com size is 768291
Home page https://arstechnica.com/ size is 90577
Can't crawl home page on https://abfdgdfsegfdgfdfsd.com: 
Home page https://twitter.com size is 64042
Home page https://facebook.com/ size is 197103

Listing A-25 begins by importing Python's built-in urllib.request to make network calls and crawl site pages, followed by a SITES list to crawl. The crawler() function takes a site argument. Once a site value is received, a check is made to confirm the site is not None, if a site value is present its home page is fetched and its size is output.

Next, listing A-25 declares the multibot method that takes a list of sites as its only argument. Next, the multibot method loops over the list of sites and calls the crawler() function with each site. Finally, you can see the statement multibot(SITES) invokes the method logic with the list of SITES and outputs its print statements. Notice the output for crawling some of the sites isn't always succesful, there's a HTTP Error 503: Service Unavailable message, another The read operation timed out message and a couple of <urlopen error [Errno -2] Name or service not known> messages for the made up sites.

More importantly, notice the crawler function blocks when it reaches the https://overstock.com/ site, just like it did with the coroutine in listing A-24. You can go ahead and add an explicit timeout value to urllib.request.urlopen like you did earlier, but that still doesn't address the root problem: coroutines and plain functions can both be affected by synchronous/blocking logic.

An option to solve this synchronous problem is to use either threads or processes, so each site's crawl activities don't interfere with one another. In very simple terms, threads represent a way to partition a process into various tasks so each task can run on its own without blocking other tasks. On the other hand, with the advent of multicore processors, each core represents a processor that can be leveraged on its own, in which case Python is capable of running different processes so that each computation (e.g. crawl activity) runs as its own process without being blocked by other processes.

Caution Multithreading programming and multicore programming are deep and complicated topics in any programming language, what you'll see next is just a series of "tip of the iceberg" examples on these Python topics, I won't go into a lot of detail about their complications or shortcomings.

Listing A-26 illustrates a refactored version of the crawler in listing A-25 using Python's threading module ^[5].

Listing A-26. Python function using threads to limit blocking calls

import threading
import urllib.request

SITES = ["https://google.com/","https://duckduckgo.com/",
	 "https://amazon.com/","https://overstock.com/",
	 "https://www.54356456456456.com",
	 "https://nytimes.com","https://ft.com/",
	 "https://wired.com","https://arstechnica.com/",
	 "https://abfdgdfsegfdgfdfsd.com",
	 "https://twitter.com","https://facebook.com/"]

def crawler(site):
    if site is None:
        print("Must provide a site to crawl")
    else:
        try:
            page = urllib.request.urlopen(site)
            page_size = len(page.read())
            print(f"Home page {site} size is {page_size}")
        except Exception as e:
            print(f"Can't crawl home page on {site}: {e}")

def multibot(sites):
    threads = []
    
    for site in sites:
    	# Invoke crawler with thread
        t = threading.Thread(target=crawler, args=(site,), name=site)
	# Start thread
        t.start()
        print(f"Starting thread for site {site}")
	# Add thread to list of threads to track status
        threads.append(t)
    
    # Loop over threads until empty
    while threads:
        for thread in threads:
            if thread.is_alive():
               print(f"Thread {thread.name} still running...")
            else:
               thread.join()
               print(f"Thread {thread.name} finished.")
	       # Remove from thread list, since it's now finished
               threads.remove(thread)

>>> multibot(SITES)
Starting thread for site https://google.com/
Starting thread for site https://duckduckgo.com/
Starting thread for site https://amazon.com/
...
...
Home page https://google.com/ size is 12990
Home page https://duckduckgo.com/ size is 5722
...
...
Thread https://duckduckgo.com/ finished.
Thread https://google.com/ finished.
...
...
Thread https://overstock.com/ still running...
Thread https://overstock.com/ still running...
Thread https://overstock.com/ still running...
Thread https://overstock.com/ still running...
Thread https://overstock.com/ still running...

Listing A-26 is essentially the same as listing A-25 with the exception of the multibot function. The first thing the multibot function does differently is declare a threads list to keep track of the status of each thread. Next, a loop is made over the sites reference to invoke a thread on each site. On each iteration, a thread is created with the threading.Thread(target=crawler, args=(site,), name=site) syntax, where target is the name of the method to execute on the thread, args the arguments for said method and name serves as a friendly name given to the thread. Once each thread is assigned to the t reference, the t.start() call triggers the work for the thread and the same reference is added to the threads list to keep track of all the created threads.

At this point you have multiple crawlers each working on their own thread. The next loop while threads: indicates it should run until the threads list is empty. Because the first time around you'll have a thread list the size of SITES it will enter right away. Next, a loop is made over each of the running threads. If a thread is still running -- a check made with thread.is_alive() -- a message is printed, otherwise, it indicates the thread has finished and a call is made to thread.join to mark it as terminated, plus the thread itself is also removed from the threads list so it isn't found on subsequent iterations looking for running threads.

Finally, you can see in listing A-26 the execution of multibot(SITES) outputs the print messages for the various threads running the crawler() method. The good news about the output in listing A-26 is that the work as whole is no longer held back by a single unresponsive site, in this case 11 of 12 sites complete their work in their own time without interfering with other sites. The not so good news, is the thread for site https://overstock.com/ keeps running until it finishes its work (e.g. timing out or getting a response).

More importantly to remember though, even with the site https://overstock.com/ taking a long time, the rest of an application is unaffected by this kind of runaway task since its running on its own thread, which can be left to run until completion or have some other logic applied to it (e.g. report an error and/or kill the thread after 'x' time).

Listing A-27 illustrates a refactored version of the crawler in listing A-26 using processes -- instead of threads -- for which it uses Python's multiprocessing module ^[6].

Listing A-27. Python function using processes to limit blocking calls

import multiprocessing
import urllib.request

SITES = ["https://google.com/","https://duckduckgo.com/",
	 "https://amazon.com/","https://overstock.com/",
	 "https://www.54356456456456.com",
	 "https://nytimes.com","https://ft.com/",
	 "https://wired.com","https://arstechnica.com/",
	 "https://abfdgdfsegfdgfdfsd.com",
	 "https://twitter.com","https://facebook.com/"]

def crawler(site):
    if site is None:
        print("Must provide a site to crawl")
    else:
        try:
            page = urllib.request.urlopen(site)
            page_size = len(page.read())
            print(f"Home page {site} size is {page_size}")
        except Exception as e:
            print(f"Can't crawl home page on {site}: {e}")

def multibot(sites):
    processes = []
    
    for site in sites:
    	# Invoke crawler with process
        p = multiprocessing.Process(target=crawler, args=(site,), name=site)
	# Start process
        p.start()
        print(f"Starting process for site {site}")
	# Add process to list of processes to track status
        processes.append(p)
    
    # Loop over processes until empty
    while processes:
        for process in processes:
            if process.is_alive():
               print(f"Process {process.name} still running...")
            else:
               process.join()
               print(f"Process {process.name} finished.")
	       # Remove from processes list, since it's now finished
               processes.remove(process)

>>> multibot(SITES)
Starting process for site https://google.com/
Starting process for site https://duckduckgo.com/
...
...
Home page https://google.com/ size is 12990
Home page https://duckduckgo.com/ size is 5722
...
...
Process https://duckduckgo.com/ finished.
Process https://google.com/ finished.
...
...
Process https://overstock.com/ still running...
Process https://overstock.com/ still running...
Process https://overstock.com/ still running...
Process https://overstock.com/ still running...
Process https://overstock.com/ still running...

The changes in listing A-27 vs. listing A-26 are minimal, in part because the multiprocessing module has syntax and functionality similar to the threading module. So, import threading changes to import multiprocessing; threads = [] to processes = []; and threading.Thread(target=crawler, args=(site,), name=site) changes to multiprocessing.Process(target=crawler, args=(site,), name=site). The remainder of the syntax and execution workflow remains the same, with the process for the site https://overstock.com/ running until the very end and the other 11 of 12 sites finishing before.

So when should you choose threads over processes ? Or vice versa ? It depends. Threads are more lightweight than processes, so threads execute much faster than full-fledged processes that are spun up at the operating system level. On the other hand, processes have greater isolation so they're easier to implement (i.e. it's harder to mess things up), where as with threads running in a single process, it's easier to inadvertently introduce a bug where one thread interferes with another, albeit there are times when you must use threads because you want to communicate activity between tasks.

In addition to the approaches presented in listing A-26 and listing A-27, there's yet another approach in Python to implement both threads and processes using the concurrent.futures module ^[6]. Listing A-28 illustrates this iteration with concurrent.futures for both threads and processes.

Listing A-28. Python function using thread/process pools to limit blocking calls

import concurrent.futures
import urllib.request

SITES = ["https://google.com/","https://duckduckgo.com/",
	 "https://amazon.com/","https://overstock.com/",
	 "https://www.54356456456456.com",
	 "https://nytimes.com","https://ft.com/",
	 "https://wired.com","https://arstechnica.com/",
	 "https://abfdgdfsegfdgfdfsd.com",
	 "https://twitter.com","https://facebook.com/"]

def crawler(site):
    if site is None:
        print("Must provide a site to crawl")
    else:
        try:
            page = urllib.request.urlopen(site)
            page_size = len(page.read())
            print(f"Home page {site} size is {page_size}")
        except Exception as e:
            print(f"Can't crawl home page on {site}: {e}")

def multibot(sites):
    #with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor:
        executor.map(crawler, sites)


>>> multibot(SITES)

Listing A-28 greatly simplifies the implementation of threads and processes in the multibot method, with everything else remaining the same. The concurrent.futures module offers the concept of a pool, which means you create a pool of either threads or processes that can be re-used as soon as they finish their work. This technique is simpler and more efficient than the ones presented earlier, because you don't need to explicitly spin-up new threads or processes.

In listing A-28 you can see the thread implementation with concurrent.futures.ThreadPoolExecutor is commented out, but the process implementation concurrent.futures.ProcessPoolExecutor uses identical syntax. The with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor lines opens a pool and makes it available through the executor reference, where the max_workers=5 parameter indicates the creations of 5 workers (i.e. threads or processes) available in the pool. Here it's important to mention that if you set max_workers too low you run the risk of creating a backlog of work since a small amount of threads/processes can be kept busy/blocked, whereas a high amount of max_workers can mean you create unused threads/processes that consume extra resources (e.g. memory).

With access to a thread/process pool via executor, the executor.map(crawler, sites) line triggers the execution of the actual threads/processes. In this case, the pool's .map function works just like Python's core map() function, where the first argument is the function on which to create the threads/processes -- in this case crawler -- and the second argument is a list of arguments with which to run the thread/process creation function, in this case a list of sites.

The final execution step in listing A-28 is identical to the one in listing A-27 and listing A-26. With the exception of outputting when a thread starts and finishes, the outcome of the execution in listing A-28 is also identical to these past two listings, with the thread/process for the site https://overstock.com/ blocking longer than any of the other sites.

Asynchronous behavior with asyncio

With the release of Python 3.4 -- circa 2014 -- Python introduced the asyncio library^[8], which stands for Asynchronous I/O. asyncio is an ambitious undertaking that covers a wide range of features around asynchronous Python, such as coroutines, network operations, subprocesses, queues and event loops, among other things. Although asyncio has become part of Python's standard library, it has undergone a long evolution, to the point even the latest Python 3.8 release^[9] and Python 3.9 release^[10], include new features and deprecated syntax.

So given asyncio's large ambitions and scope creep, what I'll describe next is just the reimplementation of the previous asynchronous examples using asyncio. Be aware there's a lot more to asyncio than the following examples, not to mention, you're likely to encounter other sources that use older asyncio syntax that achieve the same results presented in some of the following examples.

Python event loops

Event loops have been a popular computer science concept for decades, particularly in user interface focused programming languages and high-performance designs. For example, languages like JavaScript and Tcl/Tk rely on an event loop at the center of their design, similarly, high-performance web servers like Nginx and LiteSpeed also use an event loop to be able to efficiently handle more requests than other web servers.

Event loops are closely tied to asynchronous/non-blocking behavior, whereby events(tasks) are dispatched to a loop where they can await their conclusion, all the while the loop is able to receive more events(tasks) without waiting synchronously/blocking for other events(tasks) to complete. For example, all activity done in a JavaScript engine (e.g. clicks, requests) is funneled through its event loop, similar to how requests made to web servers like Nginx are also managed through its event loop, all with the purpose of delivering a more performant and non-blocking behavior.

It's worth pointing out that event loops aren't new to Python, event loops have been at the center of Python libraries like Twisted^[11] and Tornado^[12] long before asyncio. What's new is asyncio incorporates an event loop into Python's standard library, making it easier to interact with a standardized event loop vs. learning to work with different event loops (i.e. APIs).

However, what's still required to work with Python's asyncio event loop are explicit instructions, so there's no way around learning new syntax and APIs. This in contrast to event loops like those in JavaScript, where the event loop just is and events(tasks) are implicitly handled by the event loop without needing special instructions.

Coroutines revisited with `asyncio`

Back in listing A-24 you learned how Python coroutines allow an operation to pause and then resume its work at a caller's discretion, while also being able to accept data input in the process. It turns out, this coroutine behavior (i.e. pause, resume, accept input) is a natural fit to event loops, since coroutines are capable of pausing/resuming activity just like event loops expect to receive operations and return results after an indeterminate amount of time.

One particularity of asyncio's event loop is it expects to work with coroutines, but not just any Python coroutine (e.g. like the one in listing A-24), rather coroutines that use asyncio's syntax.

Using the same crawling theme as the prior examples, listing A-29 illustrates a Python coroutine that crawls a site using asyncio's syntax relying on asyncio's event loop, as well as an equivalent regular function that also crawls a site.

Listing A-29. Python coroutine using event loop with asyncio syntax vs. regular function

import asyncio
import urllib.request

def plaincrawler(site):
    if site is None:
        print("Must provide a site to crawl")
    try:
        page = urllib.request.urlopen(site)
        page_size = len(page.read())
        print(f"Home page {site} size is {page_size}")
    except Exception as e:
        print(f"Can't crawl home page on {site}: {e}")


async def crawler(site):
    if site is None:
        print("Must provide a site to crawl")
    try:
        page = urllib.request.urlopen(site)
        page_size = len(page.read())
        print(f"Home page {site} size is {page_size}")
    except Exception as e:
        print(f"Can't crawl home page on {site}: {e}")

>>> plaincrawler("https://google.com")
Home page https://google.com size is 13828

>>> crawler("https://google.com")
<coroutine object crawler at 0x7f253846be40>

>>> asyncio.run(crawler("https://google.com"))
Home page https://google.com size is 13828

>>> asyncio.run(plaincrawler("https://google.com"))
Home page https://google.com size is 13828
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.8/asyncio/runners.py", line 37, in run
    raise ValueError("a coroutine was expected, got {!r}".format(main))
ValueError: a coroutine was expected, got None

Listing A-29 begins by importing asyncio to gain access to asyncio's event loop, as well as importing urllib.request to be able to perform basic crawl operations. Next, you can see the plaincrawler and crawler functions both accept a site argument and perform the same logic: they validate the site argument is not None; crawl the provided site with urllib.request.urlopen; determine the site's page size and print said size.

Notice the only difference between the plaincrawler and crawler functions is the latter uses the async keyword. The async prefix on a function is what's used to denote an asyncio coroutine. Next, let's move on to the execution of the plaincrawler and crawler functions. You can see invoking plaincrawler("https://google.com") crawls and outputs google's home page size, this execution is pretty basic, since it's a synchronous call -- like any other plain function -- plus it's a single site crawl.

Calling crawler("https://google.com") is interesting because it outputs <coroutine object crawler at 0x7f253846be40>. The reason for this output is a coroutine can't be invoked directly. Drawing parallels, recall from listing listing A-24 that attempting to directly invoke a coroutine/generator outputs <generator object crawler at 0x7fecabb602e0>. In order to execute an asnycio coroutine it must be through the event loop, which in listing A-29 is done with the asyncio.run() method. Notice asyncio.run(crawler("https://google.com")) triggers the coroutine and also outputs google's home page size, this execution is more interesting because it's performed asynchronously through Python's asyncio event loop, although for the moment it might appear inconsequential because it only crawls a single site, the next sections explore how to crawl multiple sites.

Finally, the last call asyncio.run(plaincrawler("https://google.com")) in listing A-29 shows that attempting to run a non-coroutine on the event loop triggers an error: a coroutine was expected, got None. As mentioned a few paragraphs ago, this safeguard is so anything that's run through the event loop is explicitly marked/designed for it and regular functions aren't inadvertently put through it.

With a basic understanding of asyncio coroutines and the asyncio event loop, let's take a closer look at Python asyncio event loops.

A closer look at Python event loops: Ways to interact with an event loop, tasks/futures and multiple loops

All asyncio event loops must be managed. It might not have been obvious from the example in listing A-29, but the statement asyncio.run(crawler("https://google.com")) creates an event loop to run the crawler() coroutine and closes the event loop once its done.

Although it's perfectly valid to use asyncio.run() statements as many times as needed, most Python applications use the asyncio.run() statement once -- to kick-off the main entry point of an application -- not to mention making use of the asyncio.run() multiple times can be inefficient since it starts and closes the event loop multiple times. All of which takes us to the next example.

Listing A-30 illustrates how to reuse the same asyncio event loop to execute multiple coroutines, as well as the precautions you need to take when closing and working with multiple loops.

Listing A-30. Python coroutines added to event loop with asyncio, plus multiple & closed loop behavior

import asyncio
import urllib.request

async def crawler(site):
    if site is None:
        print("Must provide a site to crawl")
    try:
        page = urllib.request.urlopen(site)
        page_size = len(page.read())
        print(f"Home page {site} size is {page_size}")
    except Exception as e:
        print(f"Can't crawl home page on {site}: {e}")

# Get event loop
>>> loop = asyncio.get_event_loop()

# Run coroutine directly on loop
>>> loop.run_until_complete(crawler("https://duckduckgo.com"))
Home page https://duckduckgo.com size is 5722

>>> loop.run_until_complete(crawler("https://google.com"))
Home page https://google.com size is 12929

# Add tasks to loop
>>> wired_crawl = loop.create_task(crawler("https://wired.com"))
>>> arstechnica_crawl = loop.create_task(crawler("https://arstechnica.com"))

# Check task status
>>> wired_crawl
<Task pending name='Task-3' coro=<crawler() running at <stdin>:1>>
>>> arstechnica_crawl
<Task pending name='Task-4' coro=<crawler() running at <stdin>:1>>

>>> loop.run_until_complete(crawler("https://twitter.com"))
Home page https://wired.com size is 789307
Home page https://arstechnica.com size is 92684
Home page https://twitter.com size is 68764

# Re-check task status
>>> wired_crawl
<Task finished name='Task-3' coro=<crawler() done, defined at <stdin>:1> result=None>
>>> arstechnica_crawl
<Task finished name='Task-4' coro=<crawler() done, defined at <stdin>:1> result=None>


# Close the loop 
>>> loop.close()


# Re-run coroutine directly on loop
>>> loop.run_until_complete(crawler("https://duckduckgo.com"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/asyncio/base_events.py", line 591, in run_until_complete
    self._check_closed()
  File "/usr/lib/python3.8/asyncio/base_events.py", line 508, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

# Try getting event loop again
>>> loop = asyncio.get_event_loop()

# Loop is still closed
>>> loop.run_until_complete(crawler("https://duckduckgo.com"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/asyncio/base_events.py", line 591, in run_until_complete
    self._check_closed()
  File "/usr/lib/python3.8/asyncio/base_events.py", line 508, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

# Create new loop
>>> new_loop = asyncio.new_event_loop()
>>> new_loop.run_until_complete(crawler("https://duckduckgo.com"))
Home page https://duckduckgo.com size is 5722


# Set new loop to global loop
>>> asyncio.set_event_loop(new_loop)

# Get event loop again to run coroutine
>>> loop = asyncio.get_event_loop()
>>> loop.run_until_complete(crawler("https://duckduckgo.com"))
Home page https://duckduckgo.com size is 5722

Listing A-30 imports and defines the same crawler() function as listing A-29, so the focus in listing A-30 is on the execution of said function. The first execution statement loop = asyncio.get_event_loop() gets the event loop on the operating system's (OS) current thread. Because there isn't an event loop on the OS current thread yet, the asyncio.get_event_loop() call creates an event loop which is assigned to the loop reference.

Next, with the loop reference to the event loop, two calls are made to run_until_complete() that accept coroutine arguments: loop.run_until_complete(crawler("https://duckduckgo.com")) & loop.run_until_complete(crawler("https://google.com")). Notice each of these calls executes the coroutine immediately just like asyncio.run() in listing A-29.

Another alternative to running things on the event loop is to explicitly create asyncio tasks. Tasks are a special asyncio object type designed to schedule coroutines to run on the event loop and be able to track their state. In addition, tasks are a special kind of a more generic asnycio object type called futures that are designed to represent the outcome of anything that's put into an event loop.

You can then see in listing A-30 two calls made to loop.create_task that schedule coroutines on the event loop and assign the tasks to the wired_crawl and arstechnica_crawl references. Next, notice the output for both these task object references is in the form <Task pending name='Task-X' coro=<crawler() running at <stdin>:1>>, indicating both tasks are still pending. Next, you can see the familiar call loop.run_until_complete(crawler("https://twitter.com")) that immediately runs coroutines on the loop, however, notice the output of this call is for three coroutines. This means that when a call is made to the event loop with run_until_complete() it not only executes the coroutine argument, but also all other tasks assigned to the event loop. Also confirming the execution of tasks in the event loop, you can see the contents of the wired_crawl and arstechnica_crawl references now output task objects in the form <Task finished name='Task-3' coro=<crawler() done, defined at <stdin>:1> result=None>, indicating both tasks are finished and returned a result of None.

Finally, the second half of listing A-30 illustrates the behavior of closing an event loop and having multiple event loops. The loop.close() statement effectively closes a loop from any other activity, so if you try to execute a couroutine after it, you'll get the error RuntimeError: Event loop is closed as shown in listing A-30. You can also see that after an event loop is closed, using the same initial loop = asyncio.get_event_loop() statement doesn't work either, since this only creates a new event loop when there isn't one in the OS thread -- in this case there's one, it's just closed.

In order to create a new event loop you can use the asyncio.new_event_loop() function. In listing A-30, you can see the reference new_loop pointing to the new event loop is capable of executing new_loop.run_until_complete(crawler("https://duckduckgo.com")). At this point, you have a new event loop, but the main one in the OS thread is still closed. The statement asyncio.set_event_loop(new_loop) sets the new_loop as the main one in the OS thread, after which a call to asyncio.get_event_loop() works as expected, with the obtained reference being able to execute loop.run_until_complete(crawler("https://duckduckgo.com")).

As you can see from listing A-30, it's possible to interact with Python's asyncio event loop in a more granular way than with asyncio.run(), as well as schedule the exeuction of coroutines via tasks to track their status, in addition to working with multiple asyncio event loops. The only word of caution I would give you about this example is you should always think twice about creating a new loop with asyncio.new_event_loop() and closing one with loop.close(). In most cases one event loop, the main event loop, should be sufficient, although there are cases where multiple event loops are helpful, if in doubt, it's better to use asyncio.run() which creates/closes the event loop for you, because running multiple event loops in the same application can lead to problems that are difficult to debug.

Deeper into Python event loops: Using `await`, the perils of blocking the event loop and using an executor to run blocking calls.

The prior sections served as a good foundation to understand how asyncio event loops work and how they operate with coroutines prefixed with async, however, there's still a couple of important topic to explore. Because asyncio event loops have the safeguard to only accept methods prefixed with async, it can lead to a false sense of security that nothing can go wrong in an event loop, but in fact, the same thing that went wrong in some of the first examples in this asynchronous appendix -- that they block the entire workflow -- can also happen in an event loop.

Listing A-31 shows an asyncio based bot designed to crawl multiple sites, similar to the ones presented in listing A-25, listing A-26, listing A-27 & listing A-28.

Listing A-31. Python coroutines with `await` and blocking behavior

import asyncio
import urllib.request

SITES = ["https://google.com/","https://duckduckgo.com/",
	 "https://amazon.com/","https://overstock.com/",
	 "https://www.54356456456456.com",
	 "https://nytimes.com","https://ft.com/",
	 "https://wired.com","https://arstechnica.com/",
	 "https://abfdgdfsegfdgfdfsd.com",
	 "https://twitter.com","https://facebook.com/"]

async def crawler(site):
    if site is None:
        print("Must provide a site to crawl")
    try:
        page = urllib.request.urlopen(site)
        page_size = len(page.read())
        print(f"Home page {site} size is {page_size}")
    except Exception as e:
        print(f"Can't crawl home page on {site}: {e}")

async def multibot(sites):
    tasks = [asyncio.create_task(crawler(site)) for site in sites]
    for coroutine in asyncio.as_completed(tasks):
        await coroutine


>>> asyncio.run(multibot(SITES))
Home page https://google.com/ size is 13828
Home page https://duckduckgo.com/ size is 5722
Can't crawl home page on https://amazon.com/: HTTP Error 503: Service Unavailable
<<BLOCKS UNTIL TIMEOUT IS REACHED OR overstock.com RESPONDS >>
...
...
Can't crawl home page on https://overstock.com/: The read operation timed out
Can't crawl home page on https://www.54356456456456.com: 
Home page https://nytimes.com size is 1514405
Home page https://ft.com/ size is 190189
Home page https://wired.com size is 768291
Home page https://arstechnica.com/ size is 90577
Can't crawl home page on https://abfdgdfsegfdgfdfsd.com: 
Home page https://twitter.com size is 64042
Home page https://facebook.com/ size is 197103

Listing A-31 simply triggers a call with asyncio.run() to the multibot coroutine that takes a single argument, which in this case is the SITES list defined at the top. Notice the multibot method is prefixed with async to make it a viable coroutine that's acceptable by the event loop. The multibot coroutine loops over each value in the sites list, using asyncio.create_task -- as presented in listing A-30 -- to create a list of tasks that invoke the crawler(site) method and schedule them to run on the event loop. Next, an iteration is made over each of the tasks invoking asyncio.as_completed -- to trigger the task execution just like it's done in listing A-30 -- where each iteration uses the await coroutine syntax, where the await keyword is one of the most important aspects of this example.

The await keyword is an asyncio coroutine syntax that works like Python's yield keyword in classical coroutines & generators. If you remember from prior examples, the yield keyword works as a wait to be called again & don't mind me anymore, move along to other things. In the case of listing A-31, the await coroutine statement works as in run the coroutine, but move along to other things so it doesn't hold up the event loop., which allows multiple coroutines to run in the event loop. An important thing to note about await is it can only be used with asyncio tasks like it's done in listing A-31 or with asyncio coroutines (i.e. those prefixed with async) (e.g. await crawler(site)), since they're deemed to operate with non-blocking behavior.

Next, if you run asyncio.run(multibot(SITES)) you'll notice the execution of the event loop blocks! It's that pesky https://overstock.com/ site in the list that takes a long time to respond that's blocking the workflow like previous examples. So what's going on ? Aren't the async and await keywords supposed to be a safeguard against this behavior ? Everything is working as expected, unfortunately, there are no safeguards against introducing blocking logic in asyncio coroutines.

The root problem in listing A-31 is the urllib.request.urlopen(site) line in async def crawler(site). When the workflow reaches this line with a site like https://overstock.com/, it triggers a network call to scrape the site, but it waits...and waits...and waits...until the site responds. So the root problem is in the Python urllib library that's synchronous in nature and not desgined to give up control so the event loop can continue its work.

So what's the solution to this event loop blocking behavior ? One alternative is to use a separate thread or process to delegate this synchronous/blocking logic so the event loop can continue uninterrupted, while another alternative is to use an asyncio compatible network library (i.e. that uses async/await internally) so the event loop works without interruption. Listing A-32 illustrates an example with a separate thread/process to delegate synchrnous/block logic outside of the event loop, while the next section concludes by describing an example that uses an asyncio compatible network library.

Listing A-32. Python coroutines with `await` and `run_in_executor` to avoid blocking behavior

import asyncio
import concurrent.futures
import functools
import urllib.request

SITES = ["https://google.com/","https://duckduckgo.com/",
	 "https://amazon.com/","https://overstock.com/",
	 "https://www.54356456456456.com",
	 "https://nytimes.com","https://ft.com/",
	 "https://wired.com","https://arstechnica.com/",
	 "https://abfdgdfsegfdgfdfsd.com",
	 "https://twitter.com","https://facebook.com/"]

def synchronous_scrape(site):
    try:
        print(f"Starting scrape for site {site}")
        page = urllib.request.urlopen(site)
        print(f"Finished scrape for site {site}")
        page_size = len(page.read())
        return page_size
    except Exception as e:
        print(f"Can't crawl home page on {site}: {e}")

async def crawler(site):
    if site is None:
        print("Must provide a site to crawl")
    loop = asyncio.get_event_loop()
    #with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor:
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        result = await loop.run_in_executor(executor, functools.partial(synchronous_scrape, site))
        if result:
            print(f"Home page {site} size is {result}")
        else:
            print(f"Home page {site} size can't be determined")

async def multibot(sites):
    tasks = [asyncio.create_task(crawler(site)) for site in sites]
    for coroutine in asyncio.as_completed(tasks):
        await coroutine
    

>>> asyncio.run(multibot(SITES))
Starting scrape for site https://google.com/
Starting scrape for site https://duckduckgo.com/
...
...
Can't crawl home page on https://abfdgdfsegfdgfdfsd.com: 
Home page https://www.54356456456456.com size can't be determined
Starting scrape for site https://facebook.com/
Home page https://google.com/ size is 12882
Home page https://abfdgdfsegfdgfdfsd.com size can't be determined
...
...
Home page https://arstechnica.com/ size is 93050
<<BLOCKS UNTIL TIMEOUT IS REACHED OR overstock.com RESPONDS >>
...
...

Listing A-32 makes use of two new imports concurrent.futures and functools. It's worth pointing out the use of concurrent.futures in this example is identical to the one used in listing A-28 that uses thread/process pools to limit blocking calls. The functools import is required by an asyncio method to make calls with arguments.

Since the issue in listing A-31 is the urllib.request.urlopen(site) line in async def crawler(site), in listing A-32 this logic is spun out into its own method called synchronous_scrape(site). You can see the synchronous_scrape(site) performs similar logic, attempting to scrape the site's page and returning its size, as well a raising an error in case a site can't be crawled.

The other change in listing A-32 ocurrs where the urllib.request.urlopen(site) call was made in the async def crawler(site) method. The first change is getting hold of the asyncio event loop with loop = asyncio.get_event_loop(). Once this is done, you can opt to create a thread pool or process poll to delegate the blocking method (i.e. synchronous_scrape), with either with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor or with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor, respectively. This last logic is identical to the one used in listing A-28, so consult that example for additional details on picking one or the other.

Next, in the context of the thread/process pool, a call is made to the event loop's run_in_executor() method. In loop.run_in_executor(executor, functools.partial(synchronous_scrape, site)), the first argument executor represents the thread/process pool to delegate a task to, while the second argument uses functools.partial method to define a method to delegate as a task -- synchronous_scrape -- followed by the arguments to pass to this last method. More importantly, notice that loop.run_in_executor is prefixed by await, this means that while the scraping work is being performed by the thread/process pool, the crawler(site) coroutine can continue its work without blocking the event loop and print the final output once the thread/process for a given site is finished.

When you run the example in listing A-32, you'll notice that all the sites in the SITES get scraped, with the https://overstock.com/ site dropping toward the end of the workflow because it takes the longest, at which point you can decide -- just like it's done with regular runaway threads or processes -- to kill or take other remedial actions.

As you can see in listing A-32, moving tasks that have the potential to run synchronously or block into separate threads/processes is a viable solution to avoid blocking Python's event loop.

Python event loop nirvana: Using all `asyncio` compatible calls and libraries.

The final stop in this Python asynchronous exploration is the ideal asyncio solution. The biggest problem you'll face by far when trying to implement this kind of solution, is the lack of Python libraries that use asyncio compatible calls, or said another way, the excess of Python libraries that continue to use plain blocking/synchronous logic.

Recapping from the past two examples, the crux of matter lies in the urllib.request.urlopen(site) call. How could you make this built-in Python library work natively with asyncio ? The short answer is you would need to re-write it, which would be no small task re-writing a core library like urllib. A quicker solution is to search for a Python library that performs the same functionality and is designed to work natively with asyncio's event loop. It turns out there is such a library and it's called aiohttp^[13].

Listing A-33 illustrates another bot designed to make use of the aiohttp library to achieve the ideal solution of all asynchronous calls using Python's event loop.

Listing A-33. Python coroutines with `await` and `aiohttp` network calls with asynchronous behavior

import aiohttp
import asyncio

SITES = ["https://google.com/","https://duckduckgo.com/",
	 "https://amazon.com/","https://overstock.com/",
	 "https://www.54356456456456.com",
	 "https://nytimes.com","https://ft.com/",
	 "https://wired.com","https://arstechnica.com/",
	 "https://abfdgdfsegfdgfdfsd.com",
	 "https://twitter.com","https://facebook.com/"]

async def crawler(site):
    if site is None:
        print("Must provide a site to crawl")
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(site) as page:
                page_content = await page.read()
                page_size = len(page_content)
                print(f"Home page {site} size is {page_size}")
    except Exception as e:
        print(f"Can't crawl home page on {site}: {e}")

async def multibot(sites):
    tasks = [asyncio.create_task(crawler(site)) for site in sites]
    for coroutine in asyncio.as_completed(tasks):
        await coroutine
    

>>> asyncio.run(multibot(SITES))
Can't crawl home page on https://www.54356456456456.com: Cannot connect to host www.54356456456456.com:443 ssl:default [Name or service not known]
Can't crawl home page on https://abfdgdfsegfdgfdfsd.com: Cannot connect to host abfdgdfsegfdgfdfsd.com:443 ssl:default [Name or service not known]
Home page https://google.com/ size is 12900
Home page https://duckduckgo.com/ size is 5722
Home page https://twitter.com size is 68901
Home page https://arstechnica.com/ size is 93050
Home page https://facebook.com/ size is 210935
Home page https://wired.com size is 783104
Home page https://nytimes.com size is 904106
Home page https://ft.com/ size is 189917
Home page https://amazon.com/ size is 493348

Listing A-33 resembles the example in listing A-31 more since it doesn't create a new function to execute logic like listing A-32. Notice the try/except block logic in listing A-33 is changed in favor of multiple aiohttp statements vs. the original urllib.request.urlopen(site) call. By design of the aiohttp library, it's first necessary to get a client session to initiate the scraping process -- note the async prefix -- followed by a get operation on a page through the session -- also note the async prefix -- after which time it's possible to read the contents of the page -- notice the await page.read() syntax -- to finally print the page_size like it's done in previous examples.

Finally, notice the execution of asyncio.run(multibot(SITES)) in listing A-33 drops the https://overstock.com/ site toward the end of the workflow because it takes the longest to get processed. Another particularity of the output in listing A-33 is that sites that can't be resolved (i.e. that don't exist) are output first even though they're not put first into the event loop. The reason for this output order is due to the asynchronous nature of the aiohttp library, since attempting to get a non-existent site fails fastest, they're the first tasks to be marked as done in the event loop, with the remaining order output representing the quickest to slowest crawl times by the aiohttp library.

What does asynchronous Python mean for Django ?

As you've seen in this Python asynchronous appendix, there are a lot of topics and techniques you need to grasp in order to properly work with Python asynchronous code. Given the main topic of this book is Django, it begs the question, what does all this mean for Django development ?

Even though asynchronous Django is a reality and the main selling point for Django 3, you don't have to go all in or necessarily care for it. Some of the reasons for being cautious are exemplified in the last sections of this appendix, through the care and work that's needed to truly achieve asynchronous behavior in an event loop across a pair of simple functions. If you extrapolate this to even a small Django application, it can take an extraordinary amount work to reach a high level of asynchronous compliance vs. sticking with the tried and true approach of using classical synchronous Python in Django applications.

Is it worth it to design a Django asynchronous application ? It depends on an application's purpose, does it make heavy use of read/write operations ? Does it require the utmost performance ? Only you know the answer to these and other questions. Now, will a Django asynchronous application outperform a classic non-asynchronous Django applications ? Most likely, yes; if it's well designed, most definitely; if it's badly designed, it may perform worse -- see the blocking event loop examples earlier in this appendix.

It's also necessary to set expectations and be realistic about what can be achieved with Django asynchronously. Django now supports running on ASGI servers^[14] vs. classical WSGI servers, in addition to supporting Django asynchronous views^[15], but other than this, the rest is a lot of work in progress. In addition, how many library dependencies do you estimate your Django project can have ? dozens ? hundreds ? thousands ? Will you be able to find asynchronous substitutes for all of these libraries ? Like you found aiohttp to substitute urllib.request ? Some Python libraries might take years to become asynchronous compliant or never be created at all -- similar to how a lot of Python 2 to Python 3 library migrations panned out.

So approach Django asynchronous initiatives with your eyes wide open, they will take extra work, which may be warranted or not depending on your project's requirements.