Python like most programming languages has certain behaviors that can confuse anyone who is new to the language. This appendix contains an overview of the Python features that are most important to understand for anyone who wants to create Django applications and who is already familiar with another programming language (e.g. Ruby, PHP).

In this appendix you'll learn about: Python strings, unicode and other annoying text behaviors; Python methods and how to use them with default, optional, *args and **kwargs arguments; Python classes and subclasses; Python loops, iterators and generators; Python list comprehensions, generator expressions, maps and filters; as well as how to use the Python lambda keyword for anonymous methods.

Strings, unicode and other annoying text behaviors

Working with text is so common in web applications, that you'll eventually be caught by some of the not so straightforward ways Python interprets it. First off, beware there are considerable difference in how Python 3 and Python 2 work with strings.

Python 3 provides an improvement over Python 2, in the sense there are just two instead of three ways to interpret strings. But still, it's important to know what's going on behind the scenes in both versions so you don't get caught off-guard working with text. Listing A-1 illustrates a series of string statements run in Python 2 to showcase this Python version's text behavior.

Listing A-1. Python 2 literal unicode and strings

Python 2.7.3 (default, Apr 10 2013, 06:20:15) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> 'café & pâtisserie'
'caf\xc3\xa9 & p\xc3\xa2tisserie'
>>> print('\xc3\xa9')
é
>>> print('\xc3\xa2')
â

The first action in listing A-1 shows the default Python encoding that corresponds to ascii and which is the default for all Python 2.x versions. In theory, this means Python is limited to representing 128 characters, which are the basic letters and characters used by all computers -- see any ASCII table for details[1]. This is just in theory though, because you won't get an error when attempting to input a non-ASCII character in Python.

If you create a string statement with non-ASCII characters like 'café & pâtisserie', you can see in listing A-1 the é character is output to \xc3\xa9 and the â character is output to \xc3\xa2. These outputs which appear to be gibberish, are actually literal Unicode or UTF-8 representations of the é and â characters, respectively. So take note that even though the default Python 2 encoding is ASCII, non-ASCII characters are converted to literal Unicode or UTF-8 representations.

Next in listing A-1 you can see that using the print() statement on either of these character sequences outputs the expected é or â characters. Behind the scenes, Python 2 offers the convenience of inputting non-ASCII characters in an ASCII encoding environment, by automatically encoding strings into literal Unicode or UTF-8 representations. To confirm this behavior, you can use the decode() method, as illustrated in listing A-2

Listing A-2. Python 2 decode unicode and u'' prefixed strings

Python 2.7.3 (default, Apr 10 2013, 06:20:15) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 'café & pâtisserie'.decode('utf-8')
# Outputs: u'caf\xe9 & p\xe2tisserie'
>>> print(u'\xe9')
# Outputs: é
>>> print(u'\xe2')
# Outputs: â

In listing A-2 you can see the statement 'café & pâtisserie'.decode('utf-8') outputs u'caf\xe9 & p\xe2tisserie'. So now the same string decoded from Unicode or UTF-8 converts the é character or \xc3\xa9 sequence to \xe9 and the â character or \xc3\xa2 sequence to \xe2. More importantly, notice the output string in listing A-2 is now preceded by a u to indicate a Unicode or UTF-8 string.

Therefore the é character can really be represented by both \xc3\xa9 and \xe9, it's just that \xc3\xa9 is the literal Unicode or UTF-8 representation and \xe9 is a Unicode or UTF-8 character, representation. The same case applies for the â character or any other non-ASCII character. The way Python 2 distinguishes between the two representations is by appending a u to the string. In listing A-2 you can see calling print(u'\xe9') -- note the preceding u -- outputs the expected é and calling print(u'\xe2') outputs the expected â.

This Python 2 convenience of allowing non-ASCII characters in an ASCII encoding environments, works so long as you don't try to forcibly convert a non-ASCII string that's already loaded into Python into ASCII, a scenario that's presented in listing A-3.

Listing A-3. Python 2 UnicodeEncodeError: 'ascii' codec can't encode character

Python 2.7.3 (default, Apr 10 2013, 06:20:15) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 'café & pâtisserie'.decode('utf-8').encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)

In listing A-3 you can see the call 'café & pâtisserie'.decode('utf-8').encode('ascii') throws the UnicodeEncodeError error. Here you're not getting any convenience behavior -- like when you input non-ASCII characters -- because you're trying to process an already Unicode or UTF-8 character (i.e. \xe9 or \xe2) into ASCII, so Python rightfully tells you it doesn't know how to treat characters that are outside of ASCII's 128 character range.

You can of course force ASCII output on non-ASCII characters, but you'll need pass an additional argument to the encode() method as illustrated in listing A-4.

Listing A-4. Python 2 encode arguments to process Unicode to ASCII

Python 2.7.3 (default, Apr 10 2013, 06:20:15) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 'café & pâtisserie'.decode('utf-8').encode('ascii','replace')
# Outputs: 'caf? & p?tisserie'
>>> 'café & pâtisserie'.decode('utf-8').encode('ascii','ignore')
# Outputs: 'caf & ptisserie'
>>> 'café & pâtisserie'.decode('utf-8').encode('ascii','xmlcharrefreplace')
# Outputs: 'caf&#233; & p&#226;tisserie'
>>> 'café & pâtisserie'.decode('utf-8').encode('ascii','backslashreplace')
# Outputs: 'caf\\xe9 & p\\xe2tisserie'

As you can see in listing A-4, you can pass a second argument to the encode() method to handle non-ASCII characters: the replace argument so the output uses ? for non-ASCII characters; the ignore argument to simply bypass any non-ASCII positions; the xmlcharrefreplace to output the XML entity representation of the non-ASCII characters; or the backslashreplace to add a backlash allowing the output of an escaped non-ASCII reference.

Finally, listing A-5 illustrates how you can create Unicode strings in Python 2 by prefixing them with the letter u.

Listing A-5. Python 2 Unicode strings prefixed with u''
Python 2.7.3 (default, Apr 10 2013, 06:20:15) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'café & pâtisserie'
u'caf\xe9 & p\xe2tisserie'
>>> print(u'caf\xe9 & p\xe2tisserie')
café & pâtisserie

In listing A-5 you can see the u'café & pâtisserie' statement. By appending the u to the string you're telling Python it's a Unicode or UTF-8 string, so the output for the characters é and â are \xe9 and \xe2, respectively. And by calling the print statement on the output for this type of string preceded by u, the output contains the expected é and â letters.

Now let's explore how Python 3 works with unicode and strings in listing A-6.

Listing A-6. Python 3 unicode and string

Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> 'café & pâtisserie'
'café & pâtisserie'

As you can see in listing A-6, the encoding is UTF-8 or Unicode, which is the default for all Python 3.x versions. By using UTF-8 or Unicode as the default, it makes working with text much simpler. There's no need to worry or deal with how special characters are handled, everything is handled as UTF-8 or Unicode. In addition, because the default is Unicode or UTF-8, the leading u on strings is irrelevant and not supported in Python 3.

Next, let's move on to explore the use of Python's escape character and strings. In Python, the backslash \ character is Python's escape character and is used to escape the special meaning of a character and declare it as a literal value.

For example, to use an apostrophe quote in a string delimited by quotes, you would need to escape the apostrophe quote so Python doesn't confuse where the string ends (e.g.'This is Python\'s "syntax"'). A more particular case of using Python's backslash is on those special characters that use a backslash themselves. Listing A-7 illustrates various strings that use characters composed of a backslash so you can see this behavior.

Listing A-7. Python backslash escape character and raw strings

>>> print("In Python this is a tab \t and a line feed is \n")
In Python this is a tab          and a line feed is 
>>> print("In Python this is a tab \\t and a line feed is \\n")
In Python this is a tab \t and a line feed is \n
>>> print(r"In Python this is a tab \t and a line feed is \n")
In Python this is a tab \t and a line feed is \n

In the first example in listing A-7 you can see the \t character is converted to a tab space and the \n character to a line feed (i.e. new line). This is the actual character composition of a tab -- as a backslash followed by the letter t -- and a line feed -- as a backslash followed by the n. As you can see in the second example in listing A-7, in order for Python to output the literal value \t or \n you need to add another backslash -- which is after all Python's escape character.

The third example in listing A-7 is the same string as the previous ones, but it's preceded by r to make it a Python raw string. Notice that even though the special characters \t and \n are not escaped, the output is like the second example with escaped characters.

This is what's special about Python raw strings. By preceding a string with r, you tell Python to interpret backslashes literally, so there's no need to add another backslash like the second example in listing A-7.

Python raw strings can be particularly helpful when manipulating strings with a lot of backslashes. And one particular case of strings that rely a lot on backslashes are regular expressions. Regular expressions are a facility in almost all programming languages to find, match or compare strings to patterns, which makes them useful in a wide array of situations.

The crux of using Python and regular expression together, is they both give special meaning to backslashes, a problem that even the Python documentation calls The Backslash Plague[2]. Listing A-8 illustrates this concept of the backslash plague and raw strings in the context of Python regular expressions.

Listing A-8. Python backslash plague and raw strings with regular expressions

>>> import re
# Attempt to match liternal '\n', (equal statement: re.match("\\n","\\n") )
>>> re.match("\\n",r"\n")  
# Attempt to match liternal '\n', (equal statement: re.match("\\\\n","\\n") )
>>> re.match("\\\\n",r"\n")
<_sre.SRE_Match object at 0x7fedfb2c7988>
# Attempt to match liternal '\n', (equal statement: re.match(r"\\n","\\n") )
>>> re.match(r"\\n",r"\n")
<_sre.SRE_Match object at 0x7fedfb27c238>

In listing A-8, we're trying to find a regular expression to match a literal \n -- in Python syntax this would be r"\n" or "\\n". Since regular expressions also use \ as their escape character, the first logical attempt at a matching regular expression is "\\n", but notice this first attempt in listing A-8 fails.

Because we're attempting to define a regular expression in Python, you'll need to add an additional backslash for Python and yet another one to escape the regular expression, bringing the total to four backslashes! As you can see in listing A-8, the regular expression that matches a literal \n is the second attempt "\\\\n".

As you can see in this example, dealing with backslashes in Python and in the context of regular expression can lead to very confusing syntax. To simplify this, the recommended approach to define regular expressions in Python is to use raw strings so backslashes are interpreted literally. In the last example in listing A-8, you can see the regular expression r"\\n" matches a literal \n and is equivalent to the more confusing regular expression "\\\\n".

Note Python's escape character and raw string behavior is the same in both Python 2 and Python 3.
  1. https://www.cs.cmu.edu/~pattis/15-1XX/common/handouts/ascii.html    

  2. https://docs.python.org/3/howto/regex.html#the-backslash-plague