Data Structures and Algorithms - Python Unlocked (2015)

Python Unlocked (2015)

Chapter 4. Data Structures and Algorithms

Data structures are the building blocks to solve programming problems. They provide organization for the data, and algorithms provide the logic to carve the perfect solution. Python provides many efficient built-in data structures that can be used effectively. There are other good data-structure implementations in the standard library as well as third-party libraries. Often, the more pressing question is when to use what, or what data-structure is good for the present problem description. To resolve this, we will cover the following topics:

· Python data structures

· Python library data structures

· Third-party data structures

· Algorithms on scale

Python built-in data structures

Key 1: Understanding Python's in-built data structure.

Before going in on how to use different data structures, we should take a look at the attributes of the object that are important for built-in data structures. For the default sorting to work, the object should have one of the __lt__, and __gt__ methods defined. Otherwise, we can pass a key function to the sorting method to use in getting the intermediate keys that are used to compare it, as shown in the following code:

def less_than(self, other):

return self.data <= other.data

class MyDs(object):

def __init__(self, data):

self.data = data

def __str__(self,):

return str(self.data)

__repr__ = __str__

if __name__ == '__main__':

ml = [MyDs(i) for i in range(10, 1, -1)]

try:

ml.sort()

except TypeError:

print("unable to sort by default")

for att in '__lt__', '__le__', '__gt__', '__ge__':

setattr(MyDs, att, less_than)

ml = [MyDs(i) for i in list(range(5, 1, -1)) + list(range(1, 5,))]

try:

ml.sort()

print(ml)

except TypeError:

print("cannot sort")

delattr(MyDs, att)

ml = [MyDs(i) for i in range(10, 1, -1)]

print("sorted", sorted(ml, key=lambda x: x.data))

ml.sort(key=lambda x: x.data)

print("sort", ml)

The output of the preceding code is as follows:

[1, 2, 2, 3, 3, 4, 4, 5]

cannot sort

[5, 4, 4, 3, 3, 2, 2, 1]

cannot sort

sorted [2, 3, 4, 5, 6, 7, 8, 9, 10]

sort [2, 3, 4, 5, 6, 7, 8, 9, 10]

Whether two objects are equal in value is defined by the output of the __eq__ method. Collections have the same value if they have the same length and the same value of all items, as shown in the following code:

def equals(self, other):

return self.data == other.data

class MyDs(object):

def __init__(self, data):

self.data = data

def __str__(self,):

return str(self.data)

__repr__ = __str__

if __name__ == '__main__':

m1 = MyDs(1)

m2 = MyDs(2)

m3 = MyDs(1)

print(m1 == m2)

print(m1 == m3)

setattr(MyDs, '__eq__', equals)

print(m1 == m2)

print(m1 == m3)

delattr(MyDs, '__eq__')

print("collection")

l1 = [1, "arun", MyDs(3)]

l2 = [1, "arun", MyDs(3)]

print(l1 == l2)

setattr(MyDs, '__eq__', equals)

print(l1 == l2)

l2.append(45)

print(l1 == l2)

delattr(MyDs, '__eq__')

print("immutable collection")

t1 = (1, "arun", MyDs(3), [1, 2])

t2 = (1, "arun", MyDs(3), [1, 2])

print(t1 == t2)

setattr(MyDs, '__eq__', equals)

print(t1 == t2)

t1[3].append(7)

print(t1 == t2)

The output of the preceding code is as follows:

False

False

False

True

collection

False

True

False

immutable collection

False

True

False

Hash function maps a larger value set to the smaller hash set. Hence, two different objects can have the same hash, but objects with a different hash must be different. In other words, equal value objects should have the same hash, and objects with a different hash must have different values for hash to be meaningful. When we define __eq__ in a class, we must define a hash function as well. By default, for user class instances, hash uses the ID of the object, as shown in the following code:

class MyDs(object):

def __init__(self, data):

self.data = data

def __str__(self,):

return "%s:%s" % (id(self) % 100000, self.data)

def __eq__(self, other):

print("collision")

return self.data == other.data

def __hash__(self):

return hash(self.data)

__repr__ = __str__

if __name__ == '__main__':

dd = {MyDs(i): i for i in (1, 2, 1)}

print(dd)

print("all collisions")

setattr(MyDs, '__hash__', lambda x: 1)

dd = {MyDs(i): i for i in (1, 2, 1)}

print(dd)

print("all collisions,all values same")

setattr(MyDs, '__eq__', lambda x, y: True)

dd = {MyDs(i): i for i in (1, 2, 1)}

print(dd)

The output of the preceding code is as follows:

collision

{92304:1: 1, 92360:2: 2}

all collisions

collision

collision

{51448:1: 1, 51560:2: 2}

all collisions,all values same

{92304:1: 1}

It can be seen that mutable objects do not have hash defined. Although this is not advised, we can, however, do so in our user defined classes:

· Tuples: These are immutable lists, slice operations are O(n), retrieval is O(n), and they have small memory requirements. They are normally used to group objects of different types in a single structure, such as C language structures, where the position is fixed for particular types of information, shown as follows:

· >>> sys.getsizeof(())

· 48

· >>> sys.getsizeof(tuple(range(100)))

848

Named tuples that are available from the collections module can be used to access values with object notation, as follows:

>>> from collections import namedtuple

>>> student = namedtuple('student','name,marks')

>>> s1 = student('arun',133)

>>> s1.name

'arun'

>>> s1.marks

133

>>> type(s1)

<class '__main__.student'>

· Lists : These are mutable data structures that are similar to tuples. They are good to collect objects of similar types. When analyzing their time-complexity, we see that insert, delete, slice, and copy operations require O(n), Retrieval require len O(1), and sort requires O(nlogn). Lists are implemented as dynamic arrays. It must resize to double of its previous on increase in size greater than current capacity. Insert and delete at the front of the list takes more time as it must move all references to other elements one by one:

· >>> sys.getsizeof([])

· 64

· >>> sys.getsizeof(list(range(100)))

1008

· Dictionary: These are mutable mappings. A key can be any hashable object. Getting a value for key, setting a value for a key, and deleting a key are all O(1), and copying is O(n):

· >>> d = dict()

· >>> getsizeof(d)

· 288

· >>> getsizeof({i:None for i in range(100)})

6240

· Sets: These can be thought as of groups of items where hash is used to retrieve items. Sets have methods to check union, and intersection, which is useful rather than checking the same with lists. Let's take an example of groups of animals, as follows:

· >>> air = ("sparrow", "crow")

· >>> land = ("sparrow","lizard","frog")

· >>> water = ("frog","fish")

· >>> # find animal able to live on land and water

· ...

· >>> [animal for animal in water if animal in land]

· ['frog']

· >>>

· >>> air = set(air)

· >>> land = set(land)

· >>> water = set(water)

· >>> land | water #animal living either land or water

· {'frog', 'fish', 'sparrow', 'lizard'}

· >>> land & water #animal living both land and water

· {'frog'}

· >>> land ^ water #animal living on only one land or water

{'fish', 'sparrow', 'lizard'}

Their implementation and time-complexity is very similar to dictionary, shown as follows:

>>> s = set()

>>> sys.getsizeof(s)

224

>>> s = set(range(100))

>>> sys.getsizeof(s)

8416

Python library data structures

Key 2: Using Python's standard library data structures.

· collections.deque: The collections module have a deque implementation. Deque is useful for the scenarios where item insertion and deletion occurs at both ends of structure as it has efficient inserts at the start of structure as well. Time-complexity is similar to copy O(n), insert—O(1), and delete—O(n). The following graph shows an insert at 0 position operation comparison between list and deque:

· >>> d = deque()

· >>> getsizeof(d)

· 632

· >>> d = deque(range(100))

· >>> getsizeof(d)

1160

The following image is the graphical representation of the preceding code:

Python library data structures

· PriorityQueue: A standard library queue module has implementations for multiproducer, and multiconsumer queues. We can simplify and reuse its PriorityQueue for simpler cases using the heapq module, as follows:

· from heapq import heappush, heappop

· from itertools import count

·

· class PriorityQueue(object):

· def __init__(self,):

· self.queue = []

· self.counter = count()

·

· def __len__(self):

· return len(self.queue)

·

· def pop(self,):

· item = heappop(self.queue)

· print(item)

· return item[2],item[0]

·

· def push(self,item,priority):

· cnt = next(self.counter)

heappush(self.queue, (priority, cnt, item))

Other than these, queue modules have threadsafe, LifoQueue, PriorityQueue, queue, deque implementations. Also, lists can be used as stacks or queues. Collections also have orderedDict, which remembers the sequence of elements.

Third party data structures

Key 3: Using third-party data structures.

Python has a good bunch of data structures in the core language/library. But sometimes, an application has very specific requirements. We can always use third-party data-structure packages. Most of such modules are Python wrapper over C, C++ implementations:

· The blist module provides a drop-in replacement for list, sortedList, and sortedset. It is discussed in greater detail in later chapters.

· The bintrees module provides binary, AVL tree, and Red-Black trees.

· The banyan module provides Red-Black trees, splay tree, and sorted lists.

· The Sortedcontainers module provides SortedList, SortedDict, and SortedSet. So, one can get almost every data structure for Python easily. More stress should be given on why one data structure is better than another for a use case.

Arrays/List

For numeric calculations involving math, NumPy arrays should be considered. They are fast, memory-efficient, and provide many vector and matrix operations.

Binary tree

Trees have better worst-case insertion/removal, O(log(n)), min/max, and look-ups than dictionaries. There are several implementations that are available.

One module is bintrees, which have C implementations that are available for Red-Black trees, AVL tree, and Binary trees. For example, in Red-Black trees, it is easy to find max, and min, ranges as shown in the following example:

tr = bintrees.FastRBTree()

tr.insert("a",40)

tr.insert("b",5)

tr.insert("a",9)

print(list(tr.keys()),list(tr.items()))

print(tr.min_item())

print(tr.pop_max())

print(tr.pop_max())

tr = bintrees.FastRBTree([(i,i+1) for i in range(10)])

print(tr[5:9])

The output of the preceding code is as follows:

['a', 'b'] [('a', 9), ('b', 5)]

('a', 9)

('b', 5)

('a', 9)

FastRBTree({5: 6, 6: 7, 7: 8, 8: 9})

Sorted containers

These are pure-Python modules having SortedList, SortedSet, and SortedDict Data structures, which can keep the keys/items sorted. The SortedContainers module claims to have speed comparable to C extensions modules, shown as follows:

import sortedcontainers as sc

import sys

l = sc.SortedList()

l.update([0,4,2,1,4,2])

print(l)

print(l.index(2),l.index(4))

l.add(6)

print(l[-1])

l = sc.SortedList(range(10))

print(l)

print(list(l.irange(2,6)))

seta = sc.SortedSet(range(1,4))

setb = sc.SortedSet(range(3,7))

print(seta - setb)

print(seta | setb )

print(seta & setb)

print([i for i in seta])

The output of the preceding code is as follows:

SortedList([0, 1, 2, 2, 4, 4], load=1000)

2 4

6

SortedList([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], load=1000)

[2, 3, 4, 5, 6]

SortedSet([1, 2], key=None, load=1000)

SortedSet([1, 2, 3, 4, 5, 6], key=None, load=1000)

SortedSet([3], key=None, load=1000)

[1, 2, 3]

Trie

This is an ordered-tree data-structure, where the position in the tree defines the key. The keys are normally strings. In comparison to dictionaries, it has faster worst-case data retrieval O(m). Hash functions are not needed. If we are using strings only to be stored in the keys, it can take a lot less space then dictionary.

Trie

In Python, we have the marisa-trie package that provides this functionality as static Data structures. It is a Cython wrapper over the C++ library. We can associate values with the keys as well. It also provides memory mapped I/O, which is useful to decrease memory usage on cost of speed. The datrie is another package that provides read-write tries, The following are some basic usage of these libraries:

>>> import string

>>> import marisa_trie as mtr

>>> import datrie as dtr

>>>

>>>

>>> # simple read-only keys

... tr = mtr.Trie([u'192.168.124.1',u'192.168.124.2',u'10.10.1.1',u'10.10.1.2'])

>>> #get all keys

... print(tr.keys())

['10.10.1.1', '10.10.1.2', '192.168.124.1', '192.168.124.2']

>>> #check if key exists

... print(tr.has_keys_with_prefix('192'))

True

>>> # get id of key

... print(tr.get('192.168.124.1'))

2

>>> # get all items

... print(tr.items())

[('10.10.1.1', 0), ('10.10.1.2', 1), ('192.168.124.1', 2), ('192.168.124.2', 3)]

>>>

>>> # storing data along with keys

... btr = mtr.BytesTrie([('192.168.124.1',b'redmine.meeku.com'),

... ('192.168.124.2',b'jenkins.meeku.com'),

... ('10.5.5.1',b'gerrit.chiku.com'),

... ('10.5.5.2',b'gitlab.chiku.com'),

... ])

>>> print(list(btr.items()))

[('10.5.5.1', b'gerrit.chiku.com'), ('10.5.5.2', b'gitlab.chiku.com'), ('192.168.124.1', b'redmine.meeku.com'), ('192.168.124.2', b'jenkins.meeku.com')]

>>> print(btr.get('10.5.5.1'))

[b'gerrit.chiku.com']

>>>

>>> with open("/tmp/marisa","w") as f:

... btr.write(f)

...

>>>

... # using memory mapped io to decrease memory usage

... dbtr = mtr.BytesTrie().mmap("/tmp/marisa")

>>> print(dbtr.get("192.168.124.1"))

[b'redmine.meeku.com']

>>>

>>>

>>> trie = dtr.Trie('0123456789.') #define allowed character range

>>> trie['192.168.124.1']= 'redmine.meeku.com'

>>> trie['192.168.124.2'] = 'jenkins.meeku.com'

>>> trie['10.5.5.1'] = 'gerrit.chiku.com'

>>> trie['10.5.5.2'] = 'gitlab.chiku.com'

>>> print(trie.prefixes('192.168.245'))

[]

>>> print(trie.values())

['gerrit.chiku.com', 'gitlab.chiku.com', 'redmine.meeku.com', 'jenkins.meeku.com']

>>> print(trie.suffixes())

['10.5.5.1', '10.5.5.2', '192.168.124.1', '192.168.124.2']

>>>

>>> trie.save("/tmp/1.datrie")

>>> ntr = dtr.Trie.load('/tmp/1.datrie')

>>> print(ntr.values())

['gerrit.chiku.com', 'gitlab.chiku.com', 'redmine.meeku.com', 'jenkins.meeku.com']

>>> print(ntr.suffixes())

['10.5.5.1', '10.5.5.2', '192.168.124.1', '192.168.124.2']

Algorithms on scale

Key 4: Thinking out-of-the-box for the algorithms.

An algorithm is how we solve a problem. The most common issue of not being able to solve the problem is to not being able to define it properly. Normally, we look to apply an algorithm only at a small level, such as in a small functionality, or sorting in a function. We, however, do not think about algorithms when the scale increases, mostly the stress is on how fast it is. Let's take a simple requirement of sorting a file and sending it to a user. If the file is, let's say 10-20 KB or so, it will be best to simply use the Python sorted function to sort the entries. In the following code, the file is of format where columns are ID, name, due, and due-date. We want to sort it based on dues, as follows:

10000000022,shy-flower-ac5,-473,16/03/25

10000000096,red-water-e85,-424,16/02/12

10000000097,dry-star-c85,-417,16/07/19

10000000070,damp-night-c76,-364,16/03/12

10000000032,muddy-shadow-aad,-362,16/08/05

def dosort(filename,result):

with open(filename) as ifile:

with open(result,"w") as ofile:

for line in sorted(

map(lambda x:x.strip(), ifile.readlines()

),key=lambda x:int(x.split(',')[2])

):

ofile.write(line)

ofile.write('\n')

This works great, but as the file increases in size, the memory requirement increases. We cannot load all contents in the memory at the same time. Hence, we can use external merge-sort to divide the file into small parts, sort them, and then merge the sorted results together. In the following code, we used heapq.merge to merge iterators:

import tempfile

import heapq

def slowread(f, nbytes):

while True:

ilines = f.readlines(nbytes)

if not ilines:

break

for line in ilines:

yield int(line.split(',')[2]),line

def dosort(filename, result):

partition = 5000

with open(filename,"r") as ifile:

with open(result,"w") as ofile:

tempfiles = []

while True:

ilines = ifile.readlines(partition)

if len(ilines) == 0 :

break

tfile = tempfile.TemporaryFile(mode="w+")

tfile.writelines(

sorted(

ilines,

key=lambda x:int(x.split(',')[2])

))

tfile.seek(0)

tempfiles.append(tfile)

lentempfiles = len(tempfiles)

read_generators = [slowread(tfile, partition//(lentempfiles+1)) for tfile in tempfiles]

res = []

for line in heapq.merge(*read_generators):

res.append(line[1])

if len(res) > 100:

ofile.writelines(res)

res.clear()

if res:

ofile.writelines(res)

ofile.close()

Once we use up memory of a single computer, or have files distributed over multiple computers in a network, the file-based algorithm will not work. We will need to sort incoming streams from upstream servers, and send the sorted stream to the downstream. If we look at the following code carefully, we will see that we have not changed the underlying mechanism. We are still using heapq.merge to merge elements, but now, we are getting elements from the network instead. The following client code is simple, it just starts sending sorted lines by lines on receive of the next command from a downstream server:

import socket

import sys

from sort2 import dosort2

HOST = '127.0.0.1'

PORT = 9002

NCLIENTS = 2

class Client(object):

def __init__(self,HOST,PORT,filename):

self.skt = socket.socket(socket.AF_INET,socket.SOCK_STREAM)

self.skt.connect((HOST,PORT))

self.filename = filename

self.skt.setblocking(True)

def run(self):

for line in dosort2(self.filename):

print("for",line)

data = self.skt.recv(1024)

print("data cmd",data)

if data == b'next\r\n':

data = None

self.skt.send(line[1].encode())

else:

print("got from server",data)

print("closing socket")

self.skt.close()

c = Client(HOST,PORT,sys.argv[1])

c.run()

In the server code, the ClientConn class abstracts away network operations and provides an iterator interface to heapq.merge. We can greatly enhance the code using buffering. Here, the get_next method gets new line from the client. Simple abstraction solved a great problem:

import socket

import heapq

from collections import deque

HOST = '127.0.0.1'

PORT = 9002

NCLIENTS = 2

class Empty(Exception):

pass

class ClientConn(object):

def __init__(self, conn, addr):

self.conn = conn

self.addr = addr

self.buffer = deque()

self.finished = False

self.get_next()

def __str__(self, ):

return '%s' % (str(self.addr))

def get_next(self):

print("getting next", self.addr)

self.conn.send(b"next\r\n")

try:

ndata = self.conn.recv(1024)

except Exception as e:

print(e)

self.finished = True

ndata = None

if ndata:

ndata = ndata.decode()

print("got from client", ndata)

self.push((int(ndata.split(',')[2]), ndata))

else:

self.finished = True

def pop(self):

if self.finished:

raise Empty()

else:

elem = self.buffer.popleft()

self.get_next()

return elem

def push(self, value):

self.buffer.append(value)

def __iter__(self, ):

return self

def __next__(self, ):

try:

return self.pop()

except Empty:

print("iter empty")

raise StopIteration

class Server(object):

def __init__(self, HOST, PORT, NCLIENTS):

self.nclients = NCLIENTS

self.skt = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

self.skt.setblocking(True)

self.skt.bind((HOST, PORT))

self.skt.listen(1)

def run(self):

self.conns = [] # list of all clients connected

while len(self.conns) < self.nclients: # accept client till we have all

conn, addr = self.skt.accept()

cli = ClientConn(conn, addr)

self.conns.append(cli)

print('Connected by', cli)

with open("result", "w") as ofile:

for line in heapq.merge(*self.conns):

print("output", line)

ofile.write(line[1])

s = Server(HOST, PORT, NCLIENTS)

s.run()

Summary

In this chapter, we learned about the data structures that are available in the Python standard library and some third-party libraries, which are extremely useful for everyday programming. Knowledge of data-structure usage is very much important in choosing right tool for the job. Choosing of an algorithm is highly application-specific, and we should always try to find out a solution that is simpler to read.

In the next chapter, we will cover design patterns that provide great help in writing elegant solutions to the problems.