Python (2016)
CHAPTER 8: Python Strings
Aside from numbers, strings are some of the simplest data types not only in Python but in most other programming languages. Strings can be created simply by putting single or double quotations around things. Here is an example of a string using single quotes:
>>>’Hello world!’
‘Hello world!’
Strings with double quotes do not affect the string, as can be shown in the following example:
>>>”Hello world!”
‘Hello world!’
There is also a way to perform a “mathematical” operation on strings even though they are not numbers—for example, you can join two different strings together by using the plus sign:
>>>”Hello,”+”world!”
‘Hello, world!’
If the strings are not held by variables, you can also concatenate them:
>>>”Hello” “world” “no” “spaces”
‘Helloworldnospaces’
Note that this last example puts spaces in between the literal strings -- but this was only done for demonstrative clarity. In practice, you will not need to put spaces in between them in order for concatenation to work. The following will work the same: “Hello””world””no””spaces”.
Now, imagine that you have to type some very long string, which repeats itself. This can be done by using the asterisk (*).
>>>print(“hello”*3)
Hellohellohello
Escape Characters
There are a few characters which cannot be expressed easily through a string. These are the “escape characters”, which can be integrated easily within a string through two or more characters. In the Python language, we can denote escape characters using a backslash “\” at the start. To start another line in the string, for example, a line feed could be added:
>>>”Hello world!\n”
‘Hello world!\n’
This is not really impressive, if you look at it this way. However, let’s try using “print()” on this line:
>>>print(“Hello world!\n”)
Hello world!
You might not have seen something new... but there is that extra line right under the string, ready to receive a new string.
For your reference, here is a table of the other escape characters. Don’t worry about memorizing them, though, since the most important one you have to use is \n
\’’ - Double quote
\* - Single quote (‘)
\\ - Backslash
\a - ASCII Bell
\b - ASCII Backspace
\f - ASCII Form-feed
\n - ASCII Linefeed
\r - ASCII Carriage Return
\t - ASCII Horizontal Tab
\v - ASCII Vertical Tab
\ooo - A character with the octal value ooo
\xhh - A character with the hex value hh
\N{name} - A character with the name “name” in Unicode database
\uxxx - A character with the 16-bit hex value xxxx.
\uxxxxxxxx - A character with the 32-bit hex value xxxxxxxx
For the astute reader, the usage of the backslash might pose an entry into a problem with the string. For example, here is a line of code where the programmer needs to print a directory name for Windows.
>>>print(“C:\newfolder”)
C:
ewfolder
This is an example of where the “\n” escape character kicks in, creating a new line and picking up with the “ewfolder” part. In order to correct this, one could then use the “backslash” escape character -- “\\”. Hence, we have:
>>>print(“C:\\newforlder”)
C:\newfolder
This might grow tiresome, however, when you have very long directory strings. This calls for a simple way than using two backslashes all the time. In this case, we can simply use the prefix “R”. Like most instances, one can use the lowercase version as well. Once you place this prefix before your string quotations, Python interprets the string as a literal or raw one. In fact, “R” stands for “raw string”. This will tell Python to ignore the escape characters within the entire string. Here is an example:
>>>print(r”C:\newfolder”)
C:\newfolder
You may also assign the strings as variables. Here is how it’s done:
>>>spam=r”C:\newfolder”
>>>print(spam)
C:\newfolder
Newlines
Now, say for example that you want to print text in multiple lines. You can do it this way:
>>>print(“Hi!\nHello!\nHeya!”)
Such a string can turn into a very long (and confusing) line, but there is a trick that you can use to allow you to span the text into multiple lines. This can be done with three quotations (“””) in order to start and end the string.
>>>print(“””
...Hi!
...Hello!
...Heya!
...”””)
Hi!
Hello!
Heya!
As you see, this can make things a lot easier and a lot less confusing. However, you will also notice that there is a linefeed that appears at the very start. This can be improved by adding a backslash at the start:
>>>print(“””\
...Hi!
...Hello!
...Heya!”””)
Heya!
Hi!
Hello!
Welcome!
As shown, this fixes the issue and puts out the errant linefeed.
Speaking of errant linefeeds, trying these exercises in your Python interpreter would alert you to the fact that the language automatically places another linefeed on the end of print(). If you wish to bypass this, it can be done this way:
>>>print(“Welcome to Python!”, end=””)
Welcome to Python!
You can also string multiple lines without having to deal with automatic linefeeds. This can be done by using parentheses:
>>>spam=(“Hello
...world!”)
>>>print (spam)
Hello world!
Formatting
Much like in the C language, the strings in Python can also be subject to special formatting. This serves a specific purpose by easing the way for a better-formatted output. You will be able to format a string using the percent sign (%). You can also use the curly brackets ({}) formatting. Here is an example:
>>>print(“The number five (%d).” %5)
The number five (5).
The example code used the special format character %d, which has been replaced by a decimal-based integer. We used the percent sign (%) right after the string—this is the item that replaces the format characters. Below is another demonstration of formatting strings:
>>>name=”Python”
>>>date=2016
>>>print(“Copyright (c) %s %d” % (name, date))
Copyright (c) Python 2016
Notice the usage of the comma and the parenthesis. If these are not added around the format arguments, then an error will appear.
Here is a table of all the most commonly used formats in Python:
s. This is a string format, which is the default for formatting.
b. This is the binary format.
c. This converts the integer into a Unicode character, before being formatted.
d. This is the decimal format.
o. This is the octal format.
x. This is the hexadecimal format. Use the lowercase “x” for a-h, and the uppercase “X” for A-H.
n. This is the number format. This is almost the same as “d”, but this instead uses the current locale setting in order to insert the needed number separators.
e. This is the exponent notation. This will print the scientific notation of the number. There is a default precision level (6). You can also use the uppercase version, which will print an “E” in the notation.
f. This is the fix point. This will display a “fixed-point”number, with the default precision of 6. You can also use the uppercase version, which will convert “inf” to “INF” and “nan” to “NAN”.
g. This is the general format. You can also use the uppercase version, which will automatically switch to “E” once the numbers become too large.
Indexing
The strings in Python will support indexing, which will allow the programmer to retrieve just a part of the string. The following is a demonstration so that you can easily grasp this concept:
>>>”Hi there!”[1]
‘i’
>>>spam=”Hi there!”
>>>spam[1]
‘i’
As can be seen in the example, there is a number in the square brackets ([])—this is the index number. Using this, you will be able to extract a character from the string partnered with the index number. Remember that in Python, indexing starts from 0—so the maximum possible index of the specified string is one less than the number of all characters in it. Punctuation marks and spaces also count as characters. In case you choose a number that is beyond the string’s length, you will be flashed with a “string index out of range” error.
Now, consider the following piece of code:
>>>spam=”Hi there!”
>>>spam[len(eggs)-1]
“!”
What you just read is a piece of code that was meant to extract the last character in the string—no matter how long it is. The formula is the string length (len) minus 1. The function “len()” is built in, and can be used to automatically count the length of the string. Typing “len()” will return 9, and 9-1=8—this makes 8 the index number, which then corresponds to “!”.
The astute reader may see a disconnect here—why did it pull up the “!” instead of “e”, if we are counting one from the end? This is because the string length count does not start from zero, although the index number does. Thus, the string has 9 characters in total (string length), but “!” is indexed as 8 since “H” comes in at 0.
Another important thing to consider is the immutability of strings—meaning their contents cannot be manipulated. These immutable types have values that are fixed and cannot change. If you wish to change the value, you will need to reassign the complete variable. Consider the following example:
>>>spam=”Hi”
>>>spam=spam+” there!”
>>>spam
‘Hi there!’
This piece of code demonstrates how the variable “spam” is assigned to a different value. How is this then related to indexing? The same rules will apply to indexing—all indexes cannot be manipulated nor can they be assigned new values.
In order to reassign string variables while replacing a part of the substring, one will have to work a bit more in slicing the string. This will be taught in the next section, but we will give you an example of how it looks like here:
>>>spam=”Hi there!”
>>>spam=spam[:2]+”x”+spam[3:]
>>>spam
‘Hi txhere!’
Slicing
In Python, slicing will be one of the more important concepts that you will be learning. This is a feature that will allow you to extract a “substring” from the main string. This substring is in essence a string within the string—so the words “Hi” and “there” are both substrings of the string “Hi there!”. But substrings do not have any boundaries, so you can extract a single character (letter, punctuation mark, or space) out of a very long string of text if you so wish.
In slicing, the most important character is the colon (:). Here is an example of the basic application of the colon:
>>>spam=”Hi there!”
>>>spam[0:1]
‘H’
In this example, you will see how Python builds upon the indexing feature that we had previously discussed. The line “spam[0:1]” uses the slicing feature on the string encased in the spam variable. It is basically read by Python as “get the substring starting from the character with the index 0 until the character with the index of 1”. In essence, the first number spells where the slice will begin, and the second number is where it will end.
This form of slicing can be helpful in many situations. However, what if you wish to get just the first 4 characters after the start of the string? The len() function can be useful, but there is an easier method. By removing one of the parameters in our slice function, the language will slice from either the beginning or the end (depending on which side of the colon was omitted).
Here is an example:
>>>eggs=”Hi there!”
>>>eggs[:6]
‘Hi ther”
>>>eggs[6:]
‘e!’
As was demonstrated, omitting the first number will start counting from the very top all the way to the specified index number; those after it will be sliced. On the other hand, removing the number after the colon will start slicing from the start of the string all the way to the specified number. Another way to look at it is in terms of subsets—when combined, eggs[:6] and eggs[6:] equate to simply eggs.
You will also notice that in slicing, the common “index out of range” error message does not apply, even when the index numbers you specify are really out of range. This error is suppressed, and instead returns the entire string.
Encoding
Now that we know what a string means and how it works (and is worked), it is time to delve further into its nature. In fact, we have only seen one part of the string’s nature. The true nature of a string may be different things, without the string having to change—it all depends on the encoding.
There are two prominent encoding schemes used for strings—Unicode and ASCII. Of the two, ASCII is the simpler one. It is a simple scheme for some (though not all) Latin characters, and some other things like numbers, money units, and signs. Unicode, on the other hand, is a larger encoding scheme that can include up to thousands of characters. Unicode is used to create one scheme which contains all of the alphabets, scripts, and characters in the world. Python 3.X uses Unicode as its default encoding system. This means that one can put almost anything into the string, and it can be correctly printed out by the interpreter. This is perfect for countries that do not use English as their standard language. ASCII will do them little good, since it does not allow too much characters—only 127 are present in this encoding scheme.