Python like most programming languages has certain behaviors that can confuse anyone who is new to the language. This appendix contains an overview of the Python features that are most important to understand for anyone who wants to create Django applications and who is already familiar with another programming language (e.g. Ruby, PHP).

In this appendix you'll learn about: Python strings, unicode and bytes; Python methods and how to use them with default, optional, *args and **kwargs arguments; Python classes and subclasses; Python loops, iterators and generators; Python list comprehensions, generator expressions, maps and filters; as well as how to use the Python lambda keyword for anonymous methods; in addition to asynchronous Python constructs

Strings and unicode

Working with text is so common in web applications, that you may eventually be caught off guard by some of the ways Python interprets it. First off, Python 3 uses Unicode[1], this means Python 3 is capable of interpreting practically every character from most world languages (e.g. English, Spanish, Japanese, Arab, Hebrew). It's worth pointing out that this is in stark contrast to Python 2, that even though also supports Unicode, defaults to using ASCII -- which is limited to representing 128 characters -- and hence can require a lot more workarounds to work with different kinds of characters -- see the older version of this page for additional details on Python 2

Listing A-1 illustrates a series of literal string statements run in Python 3 to showcase this Python text behavior.

Listing A-1. Python 3 literal strings

Python 3.7.4 (default, Aug 14 2019, 21:56:36) 
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> 'café & pâtisserie'
'café & pâtisserie'
>>> "コーヒー と パティスリー"
'コーヒー と パティスリー'
>>> '''café y confitería'''
'café y confitería'

As you can see in listing A-1, Python's default encoding is UTF-8, where UTF stands for 'Unicode Transformation Format' and the '8' indicates 8-bit bytes to store values. By using UTF-8 encoding in all Python 3.x versions, it makes working with text much simpler. Notice in listing A-1 there's no need to worry or deal with how special characters are interpreted (e.g. non-ASCII characters like é, â or just work).

Note Unicode and UTF-8 are different things, although they're often used interchangeably. Unicode defines how to represent characters as code points (e.g. c=U+0063, o=U+006F, &=U+0026), ensuring computer characters -- inclusive symbols, emojis and other visualizations -- have unique code points. How these Unicode code points are represented/stored on a computer is left to an encoding system, of which UTF-8 is one of many, with other encoding systems being: ASCII, ISO-8859-1, Windows-1250, UTF-16 and UTF-32, among others.

UTF-8 is a variable width encoding scheme capable of using from one to four 8-bit bytes, which in turn allows UTF-8 to support the entire set of Unicode code points. The first byte in UTF-8 allows it to map the first 128 Unicode code points which also coincide with the ASCII character set[2]; the second byte in UTF-8 allows it to map the next 1,920 Unicode code points which include most special characters in western languages (e.g. letters with accents or symbols with special meaning); the third byte in UTF-8 allows it to map the next 63,488 Unicode code points which include most common symbols used in Chinese, Japanese and Korean (CJK) languages; the fourth byte in UTF-8 allows it to map the remaining Unicode code points which correspond to less common CJK language symbols, historic script symbols, mathematical symbols, and emojis -- it's worth pointing out that the fourth byte in UTF-8 allows it to map up to 2,031,616 code points, even though the most recent Unicode standard defines 1,114,112 code points as a whole, so UTF-8 is capable of supporting many more Unicode code points in the future.

Turning our attention back to listing A-1, notice there are various ways to declare text as Python string literals. It's possible to use single quotes ', double quotes ", as well as triple single/double quotes ''' or """ to enclose text, with the triple single/double quotes options used to define text spanning multiple lines.

Next, let's move on to explore the use of Python's escape character and strings. In Python, the backslash \ character is Python's escape character, which is used to escape the special meaning of a character and declare it as a literal value.

For example, to use an apostrophe in a string delimited by single quotes, you would need to escape the apostrophe so Python doesn't confuse where the string ends (e.g.'This is Python\'s "syntax"'). A more particular case for using Python's backslash is on those special characters that use a backslash themselves. Listing A-2 illustrates various strings that use characters composed of a backslash so you can see this behavior.

Listing A-2. Python backslash escape character and raw strings

>>> print("In Python this is a tab \t and a line feed is \n")
In Python this is a tab          and a line feed is

>>> print("In Python this is a tab \\t and a line feed is \\n")
In Python this is a tab \t and a line feed is \n
>>> print(r"In Python this is a tab \t and a line feed is \n")
In Python this is a tab \t and a line feed is \n

In the first example in listing A-2 you can see the \t character is converted to a tab space and the \n character to a line feed (i.e. new line). This is the actual character composition of a tab -- as a backslash followed by the letter t -- and a line feed -- as a backslash followed by the n. As you can see in the second example in listing A-2, in order for Python to output the literal value \t or \n you need to add another backslash -- which is after all Python's escape character.

The third example in listing A-2 is the same string as the previous ones, but it's preceded by r to make it a Python raw string. Notice that even though the special characters \t and \n are not escaped, the output is like the second example with escaped characters.

This is what's special about Python raw strings. By preceding a string with r, you tell Python to interpret backslashes literally, so there's no need to add another backslash like the second example in listing A-7.

Python raw strings can be particularly helpful when manipulating strings with a lot of backslashes. And one particular case of strings that rely a lot on backslashes are regular expressions. Regular expressions are a facility in almost all programming languages to find, match or compare strings to patterns, which makes them useful in a wide array of situations.

The crux of using Python and regular expression together, is they both give special meaning to backslashes, a problem that even the Python documentation calls The Backslash Plague[3]. Listing A-3 illustrates this concept of the backslash plague and raw strings in the context of Python regular expressions.

Listing A-3. Python backslash plague and raw strings with regular expressions

>>> import re
# Attempt to match literal '\n', (equal statement: re.match("\\n","\\n") )
>>> re.match("\\n",r"\n")  
# Attempt to match literal '\n', (equal statement: re.match("\\\\n","\\n") )
>>> re.match("\\\\n",r"\n")
<re.Match object; span=(0, 2), match='\\n'>
# Attempt to match literal '\n', (equal statement: re.match(r"\\n","\\n") )
>>> re.match(r"\\n",r"\n")
<re.Match object; span=(0, 2), match='\\n'>

In listing A-3, we're trying to find a regular expression to match a literal \n -- in Python syntax this would be r"\n" or "\\n". Since regular expressions also use \ as their escape character, the first logical attempt at a matching regular expression is "\\n", but notice this first attempt in listing A-3 fails.

Because we're attempting to define a regular expression in Python, you'll need to add an additional backslash for Python and yet another one to escape the regular expression, bringing the total to four backslashes! As you can see in listing A-3, the regular expression that matches a literal \n is the second attempt "\\\\n".

As you can see in this example, dealing with backslashes in Python and in the context of regular expression can lead to very confusing syntax. To simplify this, the recommended approach to define regular expressions in Python is to use raw strings so backslashes are interpreted literally. In the last example in listing A-3, you can see the regular expression r"\\n" matches a literal \n and is equivalent to the more confusing regular expression "\\\\n".

So far you've learned how Python 3 uses UTF-8 to support all Unicode characters through string literals, as well as how string literals prefixed with r represent raw strings that make it easier to work with strings that use backslashes to escape characters. Another string representation you can encounter in Python 3 is one prefixed by the letter b, which is called a bytes literal.

To the naked eye a Python bytes literal can look like regular text, but it's in fact a sequence of bytes that represents a Python bytes object. The purpose of a Python bytes object is to manage binary data, which is how computers natively work with data (e.g. when it's transmitted over a network or stored in files). But unlike a Python string literal or raw string, a Python bytes literal has two important differences:

Listing A-4 illustrates some of the core behaviors for a bytes literal in Python 3.

Listing A-4. Python bytes literal behavior

>>> string_literal = 'coffee & pastries'
>>> bytes_literal = b'coffee & pastries'
>>> string_literal[0]
'c'
>>> string_literal[1]
'o'
>>> string_literal[7]
'&'
>>> bytes_literal[0] 
99
>>> bytes_literal[1]
111
>>> bytes_literal[7]
38
>>> chr(99)
'c'
>>> chr(111)
'o'
>>> chr(38) 
'&'
>>> hex(99)
'0x63'
>>> hex(111)
'0x6f'
>>> hex(38)
'0x26'
>>> decoded_bytes = bytes_literal.decode()
>>> decoded_bytes[0]
'c'
>>> decoded_bytes[1]
'o'
>>> decoded_bytes[7]
'&'
>>> decoded_bytes == string_literal
True

Listing A-4 begins by declaring two references with the text 'coffee & pastries', one as a string literal and the other as a bytes literal -- note the lack of special characters (i.e. non-ASCII) is on purpose and is addressed in the next listing. Although both text statements in listing A-4 are visually identical, let's explore the first behavioral difference when attempting to access parts of a string literal and a bytes literal.

When accessing the first character (0-based) in string_literal the output is c, where as accessing the second and eighth characters (0-based) in string_literal the output is o and &, respectively. Next, notice the output when attempting to access the same character positions in bytes_literal, the output for the first, second and eighth characters (0-based) in bytes_literal is 99, 111 and 38, respectively.

Recall that a bytes literal represents byte sequences, in this case, the first (0-based) position in bytes_literal outputs 99 because 99 is the ASCII decimal code point representation for the letter c. Similarly, 111 is the ASCII decimal code point for the letter o and 38 is the ASCII decimal code point for the & symbol. These last equivalencies are re-confirmed in listing A-4 with the use of Python's built-in chr method which returns the character (string) for a given Unicode code point (integer) (e.g. chr(99) returns c, chr(111) returns o and chr(38) returns &) -- recall that I mentioned earlier ASCII code points map directly to Unicode code points.

If you happen to review a Unicode code point table, you'll notice the Unicode standard uses hexadecimal numbering for its code points. Hexadecimal numbering is simply another way of expressing the same code point using base 16, this means hexadecimal sequences are in the form 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F, where as decimal sequences are in the form 0,1,2,3,4,5,6,7,8,9. In other words, the sixteenth element in a hexadecimal sequence is F and in a decimal sequence is 16; the seventeenth element in a hexadecimal sequence is 00 and in a decimal sequence is 17; the eighteenth element in a hexadecimal sequence is 01 and in a decimal sequence is 18 and so on.

The next section in listing A-4 makes use of Python's built-in hex method which returns the hexadecimal representation (string) for a given decimal (integer) value. In this case, you can see the decimal code point 99 is equivalent to the 0x63 hexadecimal codepoint or c character; the decimal code point 111 is equivalent to the 0x6f hexadecimal codepoint or o character; and the decimal code point 38 is equivalent to the 0x26 hexadecimal codepoint or & character. Python uses the 0x prefix to indicate a hexadecimal representation, where as Unicode makes use of the U+ prefix -- padded with 0's to fill four spaces -- to define code points (e.g. c=U+0063, o=U+006F, &=U+0026).

Finally, the last snippets in listing A-4 illustrate how it's possible to quickly convert the contents of a bytes literal to a string literal. Since a bytes literal is a Python bytes object, it automatically comes equipped with several utility methods[4] to facilitate working with this kind of binary data. In listing A-4 you can see the Python bytes object decode() method is used on the bytes_literal reference to create the new decoded_bytes reference. What the decode() method does is decode the bytes/binary data in a Python bytes object to a Unicode (by default) string, the behavior of the decode() method is confirmed by outputting the same first, second and eighth characters (0-based) in the newly created decoded_bytes reference and observing a character (string), as well as the final equivalency test that confirms the new decoded_bytes reference is equal to the original string_literal reference.

Listing A-5 illustrates the behaviors for a bytes literal with Unicode.

Listing A-5. Python bytes literal behavior with Unicode

>>> b'café & pâtisserie'
  File "<stdin>", line 1
SyntaxError: bytes can only contain ASCII literal characters.
>>> bytes('café & pâtisserie','utf-8')
b'caf\xc3\xa9 & p\xc3\xa2tisserie'
>>> bytes('café & pâtisserie','latin-1')
b'caf\xe9 & p\xe2tisserie'
>>> bytes('café & pâtisserie','ascii')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 3: ordinal not in range(128)
>>> bytes_literal = bytes('café & pâtisserie','utf-8')
>>> bytes_literal.decode()
café & pâtisserie

The first line in listing A-5 attempts to create the bytes literal b'café & pâtisserie' for which Python throws the exception SyntaxError: bytes can only contain ASCII literal characters. As outlined earlier, a bytes literal can only contain ASCII characters, so in this case the non-ASCII characters é and â cause Python to generate an error. A byte literal can include non-ASCII characters, but they must be included in literal form to reflect ASCII supported characters.

The next lines in listing A-5 make use of Python's built-in bytes() class to create a Python bytes object which in turn produce a bytes literal. The bytes() class in this case receives two arguments, the first is the data to convert into a bytes literal and the second is the encoding type for said data. The bytes('café & pâtisserie','utf-8') statement creates a bytes literal from the 'café & pâtisserie' string with UTF-8 encoding, which in turn produces the b'caf\xc3\xa9 & p\xc3\xa2tisserie' bytes literal. Notice the non-ASCII characters é and â are converted to the ASCII-compliant \xc3\xa9 and \xc3\xa2, respectively, in this case both conversions represent the Unicode literals for each of the non-ASCII characters.

For illustrative purposes, the next two bytes() class statements in listing A-5 create a bytes literal using two other encodings. The bytes('café & pâtisserie','latin-1') statement creates a bytes literal using the latin1 encoding -- also known as ISO-8859-1 -- producing the slightly different b'caf\xe9 & p\xe2tisserie' bytes literal. The reason for this difference is because the non-ASCII characters é and â in ISO-8859-1 are represented literally as \xe9 and \xe2, respectively. The bytes('café & pâtisserie','ascii') throws an error because the é and â characters don't exist in ASCII, therefore attempting to encode such characters as ASCII, throws the error 'ascii' codec can't encode character.

Because UTF-8 is generally the most common option to encode data into a bytes literal, the statement bytes('café & pâtisserie','utf-8') is re-executed in listing A-5 and its result assigned to the bytes_literal reference. Finally, the same Python bytes object decode() method used in listing A-4 is used to decode bytes_literal into the original Unicode string.

Note It's perfectly valid to have both a bytes literal and raw string, so it's possible to declare or encounter the syntax br'' which implies bytes literal with escaped characters.

In most cases, you'll rarely need to work with Python bytes literals like it's shown in listing A-4 and A-5, however, it's important to understand these fundamentals, because you're very likely to encounter chunks of text preceded with b'' when a Python application interacts with data transmitted over a network or stored in files.

Listing A-6 illustrates how Python reads data transmitted over a network from a web page.

Listing A-6. Read a Unicode Python data stream from a web page

>>> import urllib.request
>>> utf_8_stream = urllib.request.urlopen("https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html")
>>> page_content = utf_8_stream.read()
>>> page_content
b'<!DOCTYPE html PUBLIC "-//W3C//DTD......
.....'

>>> page_content.decode()
'<!DOCTYPE html PUBLIC "-//W3C//DTD......
.....'

Listing A-6 begins by importing Python's urllib.request module to request a web page. Next, a request is made to get the contents of the web page at https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html. Once the web page url is open, a call is made to the read() method to read the contents of the data stream.

Notice the contents of the data stream after the call to read() is a bytes literal (i.e. it's prefixed with b). If you inspect the entire result, you'll notice it contains non-ASCII characters encoded as Unicode literals like all bytes literals. Next, you can see the Python bytes object decode() method is used on the page_content bytes literal reference to output the data stream as a regular Unicode string.

The example in listing A-6 is straightforward because the contents of the web page are encoded as UTF-8, however, this isn't necessarily always the case, as illustrated in listing A-7.

Listing A-7. Read a ISO-8859-19 Python data stream from a web page

>>> import urllib.request

>>> iso_8859_1_stream = urllib.request.urlopen("https://www.w3.org/Style/Examples/010/iso-8859-1-correct.html")
>>> page_content = iso_8859_1_stream.read()
>>> page_content
b'<!DOCTYPE html PUBLIC \'-//W3C//DTD XHTML 1.0 Strict//EN\'\n  \'http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\
'>\n\n<html lang="en">\n<head>\n<title>Test: iso-8859-1 with correct @charset</title>\n
<link rel="stylesheet" href="iso-8859-1-correct.css" />\n</head>\n<body>\n<p class="gr\xe9\xe9n">
The linked style sheet is encoded in iso-8859-1 and\nhas a (correct) @charset "iso-8859-1". The HTTP server (on purpose)
\nomits the charset parameter. If this text is green, the style sheet is\ncorrectly read as ISO-8859-1.</p>\n</body>\n</html>\n\n
<!-- Keep this comment at the end of the file\nLocal variables:\nmode: xml\nsgml-declaration:"~/SGML/xml.dcl"\nsgml-default-doctype-name:"html"\nEnd:\n-->\n'
>>> page_content.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 274: invalid continuation byte
>>> page_content.decode('iso-8859-1')
>>> '<!DOCTYPE html PUBLIC \'-//W3C//DTD XHTML 1.0 Strict//EN\'\n  \'http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\
'>\n\n<html lang="en">\n<head>\n<title>Test: iso-8859-1 with correct @charset</title>\n
<link rel="stylesheet" href="iso-8859-1-correct.css" />\n</head>\n<body>\n<p class="gréén">
The linked style sheet is encoded in iso-8859-1 and\nhas a (correct) @charset "iso-8859-1". The HTTP server (on purpose)
\nomits the charset parameter. If this text is green, the style sheet is\ncorrectly read as ISO-8859-1.</p>\n</body>\n</html>\n\n
<!-- Keep this comment at the end of the file\nLocal variables:\nmode: xml\nsgml-declaration:"~/SGML/xml.dcl"\nsgml-default-doctype-name:"html"\nEnd:\n-->\n'
>>> 

Listing A-7 begins just like listing A-6 by importing Python's urllib.request module to request a web page. However, a request is made to get the contents of the web page at https://www.w3.org/Style/Examples/010/iso-8859-1-correct.html which uses an ISO-8859-1 or latin-1 encoding. Once the web page url is open, a call is made to the read() method to read the contents of the data stream.

Once again, the contents of the data stream after the call to read() is a bytes literal (i.e. it's prefixed with b). If you inspect the entire result, you'll notice it contains two non-ASCII characters encoded as ISO-8859-1 literals: \xe9\xe9. Next, you can see that calling the Python bytes object decode() method on the page_content bytes literal reference throws the error 'utf-8' codec can't decode byte 0xe9 in position 274: invalid continuation byte.

By default, the Python bytes object decode() method assumes the underlying Python bytes data stream is encoded as UTF-8, which is not the case in listing A-7. Because the characters \xe9\xe9 are not recognized as Unicode literals Python throws an error. In order to transform the contents of an ISO-8859-1 data stream with the decode() method, it's necessary to explicitly pass the encoding type as an argument to the decode() method. You can see in listing A-7 that executing page_content.decode('iso-8859-1') results in outputting the data stream as a regular Unicode string and the \xe9\xe9 ISO-8859-1 literals are correctly converted to éé.

As you can tell from listings A-6 and A-7, even though Python provides all the necessary mechanisms to work with UTF-8 and other types of encodings, it's still important to be aware of the encoding type used by the data you're attempting to process in Python.

Finally, listing A-8 illustrates another common task involving text in Python 3, which is converting a Python data structure to a JSON data structure.

Listing A-8. Python Unicode converted to JSON

>>> import json
>>> translations = { "french":"café & pâtisserie",
                              "japanese":"コーヒー と パティスリー",
			      "spanish":"café y confitería"}
>>> json.dumps(translations)
'{"french": "caf\\u00e9 & p\\u00e2tisserie",
  "japanese": "\\u30b3\\u30fc\\u30d2\\u30fc \\u3068 \\u30d1\\u30c6\\u30a3\\u30b9\\u30ea\\u30fc",
  "spanish": "caf\\u00e9 y confiter\\u00eda"}'
>>> bytes('caf\u00e9 & p\u00e2tisserie','utf-8')
b'caf\xc3\xa9 & p\xc3\xa2tisserie'
>>> bytes('caf\u00e9 & p\u00e2tisserie','latin-1')
b'caf\xe9 & p\xe2tisserie'
>>> json.dumps(translations,ensure_ascii=False)
'{"french": "café & pâtisserie",
  "japanese": "コーヒー と パティスリー",
  "spanish": "café y confitería"}'

Listing A-8 begins by importing Python's json module to work with JSON data structures. Next, a Python dictionary is created with several literal strings containing special characters (i.e. non-ASCII characters). Immediately after, the json.dumps() method is used to convert a Python data structure into a JSON data structure. By default, the json.dumps() method automatically converts all data to ASCII compliant characters, similar to how a Python bytes literal also converts non-ASCII characters.

In the case of listing A-8, notice the é character is transformed into the ASCII compliant \\u00e9 character, as well as how all the other special characters are also transformed into a similar \\uxxxx format.

Note json.dumps() produces a string, so to represent the literal \uxxxx format it must be escaped as \\uxxxx.
The reason the representation for the é character and all other characters is different from the previous examples is that the json module transforms non-ASCII characters to their Python source code representation. This is confirmed in listing A-8 with the creation of two bytes literal that make use of the Python source code representations \u00e9 and \u00e2 and which also get translated into their equivalent encoded bytes literal presented in listing A-5 (e.g. the Python source code \u00e9 which corresponds to é gets encoded to \xc3\xa9 in the UTF-8 bytes literal representation and to \xe9 in the latin1 -- also known as ISO-8859-1 -- bytes literal representation.

Finally, in case you want to override the json.dumps() default behavior and not have it convert all data to ASCII compliant Python source code characters, you can use the method argument ensure_ascii=False. As you can see in listing A-8, by using the ensure_ascii=False as the second argument to the json.dumps() method, json.dumps() outputs a JSON data structure with Unicode characters as is.

  1. https://en.wikipedia.org/wiki/Unicode    

  2. http://www.columbia.edu/kermit/ascii.html    

  3. https://docs.python.org/3/howto/regex.html#the-backslash-plague    

  4. https://docs.python.org/3/library/stdtypes.html#bytes-and-bytearray-operations