Built-in Data Structures and Algorithms in Python

It’s very important to choose the proper data structure and algorithm for various programming tasks. Even when we are using high-level programming language such as Python, it’s still crucial in terms of time and space complexity. In this post, I’ll try to summary all built-in data structures and algorithms in Python as a recap and reference. There are so many details which could be covered in this post, please refer to Python’s official documents.

Most of the built-in data structures in Python are sequencial, which is also the most basic and important structure.

The first logo of Python, designed by Guido’s brother Just van Rossum

list

List might be the most frequently used sequencial data structure in Python. It is fast, versatile and easy to use.

mutable
$O(1)$ for accessing random data element
$O(1)$ for tail operation (append, pop)
$O(n)$ for head and middle insert and delete
$O(n)$ for membership testing (in operation)
more space usage than tuple
auto-resizing
suitable to be used as Stack (LIFO) structure

Normally, we use list to store homogeneous data, but this is not mandatory. It’s just a good programming style.

One common pitfall about list is to change it while iterating it:

>>> a = [1,2,3,4,5]
>>> for i in a:  # quick fix, use a[:] instaed
...   print(i)
...   a.remove(i)
... 
1
3
5
>>> a
[2, 4]

Under list structure is just a piece of continuous memory so that it could achieve $O(1)$ for random element accessing. Due to the same reason, insert or delete at the head or in the middle needs to move memory and they become $O(n)$ operations.

tuple and namedtuple

Tuple is not just an immutable version of list. Tuple is best to for record data, such as rows retrieved from database. And these data are normally heterogeneous.

immutable
$O(1)$ for accessing random data element
$O(n)$ for membership testing
less space usage than list
faster than list for constant sequential data

Namedtuple needs to be imported from collections module, and each element could be accessed as a property similar to column name of database table.

>>> from collections import namedtuple
>>> rec = namedtuple('datarow',('name','age','salary'))
>>> xinlin = rec('xinlin', 99, 100.123)     
>>> xinlin.name, xinlin.age, xinlin.salary
('xinlin', 99, 100.123)

bytearray

We can treat bytearray as a special type of list which is designed only to hold raw byte data.

list-like, mutable, only hold raw byte data, 0 – 255
memory efficient design

bytes

tuple-like, immutable, only hold raw byte data, 0 – 255
memory efficient design

array

Array is less frequently used in Python due to the powerful and handy list. Compared to list, array is specially designed to store only homogeneous numeric data.

list-like, mutable
store homogeneous numeric data, specify data type while initiating
memory efficient design
easy to interact with C API
if store byte data, choose bytes or bytearray instead!

>>> import array
>>> ai = array.array('i',[1,2,3,4,5])  # i: int
>>> ad = array.array('d',[1,2,3,4,5])  # d: double-precision float
>>> ai
array('i', [1, 2, 3, 4, 5])
>>> ad
array('d', [1.0, 2.0, 3.0, 4.0, 5.0])

Another reason why array is less often used is due to NumPy’s ndarray, which is so powerful and prevalent! The main benefits of array might be fact of built-in module.

str

In Python, we don’t have char type, only str (string). Char could be represented by str object with length 1.

$O(1)$ for accessing any single char
$O(n)$ for membership testing
rich and handy string operations

deque

When we need frequently insert or delete elements at the head of a sequential structure, deque might be the data structure we should employ. Deque (pronounced “deck”) is the Double-Ended Queue. It is specially designed and optimized for both end operations.

mutable
$O(1)$ for both end operations
$O(n)$ for middle insert or delete
$O(n)$ for membership testing
$O(n)$ for randomly accessing by index
support both auto-resizable and fixed-size
thread-safe
could be used as Stack (LIFO) and Queue (FIFO) structure

Under deque structure is linked-list so that it could achieve $O(1)$ for both end operations. But don’t forget accessing random element in deque is a $O(n)$ operation due to the same reason.

Unlike list, we could optionally initiate a deque object with a parameter called maxlen. With this parameter, the deque object would act like a rolling window.

>>> from collections import deque
>>> dq = deque(maxlen=3)
>>> dq.append(1)
>>> dq.append(2)
>>> dq.append(3)
>>> dq.append(4)
>>> dq
deque([2, 3, 4], maxlen=3)
>>> dq.appendleft(1)
>>> dq
deque([1, 2, 3], maxlen=3)

If initiating deque object without maxlen, it just behaves like a standard double-ended auto-resizable queue.

queue

In multi-threading programs, queue structure is quite common to share data between multiple threads. Python has a bulit-in queue module in which there are a few very common and powerful queue implementations could be used in multi-threading scenarios. They are SimpleQueue (FIFO), Queue (FIFO), LifoQueue and PriorityQueue (min-heap).

multiprocessing.Queue

In multiprocessing module, there is a Queue class which is nearly a clone of queue.Queue but used in multiprocessing scenarios.

set

Set is a group of unordered unique hashable elements.

$O(1)$ for adding or removing element
$O(1)$ for membership testing!
not support random element accessing
support lots of standard mathematical set operations (such as intersection, union and difference)

One common use case of set is to remove duplications:

>>> a = [1,2,3,2,1,2,3,4,5,4,3,2,1]
>>> a = list(set(a))
>>> a
[1, 2, 3, 4, 5]

Hashable objects are those whose hash values would not be changed during their life time. Immutables are all hashable. Mutables are not. User-defined objects by default are hashable if __eq__ and __hash__ methods are not overrided.

Hashable is just the other side of unique immutable element.

frozenset

Immutable version of set, hashable, could be an element in a set or as a key in dict object.

dict

Dict is a hash-mapping structure for unordered key-value pairs. It’s incredibly powerful and prevalent inside Python.

mutable
$O(1)$ for adding or removing KV pairs
$O(1)$ for accessing by key (hashable)
$O(1)$ for membership testing
preserving inserting order (since 3.7, same as built-in OrderedDict from collections module)

An example of the feature of preserving inserting order:

>>> ops = { 
...     'step1': 'open the door of refrigerator',
...     'step2': 'push the elephant inside',
...     'step3': 'close the door',
... }
>>> for k,v in ops.items():  # iterating by inserting order
...     print(k, ':', v)
... 
step1 : open the door of refrigerator
step2 : push the elephant inside
step3 : close the door

Two caveats for leveraging the feature of preserving inserting order of dict object:

updating a key does not change its order
deleting a key and re-inserting it moves it to the end

Counter

The Counter class, found in the collections module, is a specialized subclass of dict designed for quickly and efficiently counting hashable objects. We could say this is a built-in counting algorithm in Python.

>>> from collections import Counter
>>> a = 1,2,3,4,3,2,2,3,4,5,6,5,4,3,2
>>> count = Counter(a)
>>> count
Counter({2: 4, 3: 4, 4: 3, 5: 2, 1: 1, 6: 1})
>>> count.most_common()
[(2, 4), (3, 4), (4, 3), (5, 2), (1, 1), (6, 1)]
>>> count.most_common(3)
[(2, 4), (3, 4), (4, 3)]

sort (Timsort)

The built-in sort algorithm in Python is very fast (C speed) and stable.

Stable means the order of multiple same objects would be preserved after sorting. It’s a critical feature in many scenarios.

We sometimes call the built-in sort algorithm in Python Timsort to honor the author Tim Peter. It’s a hybrid sorting algorithm combined from inserting sort and merge sort algorithms. Merge sort has a nice time complexity as $O(n\log{n})$. Inserting sort is super fast if the sequence is partially sorted. The best time complexity of inserting sort is $O(n)$. The basic idea of Timsort is to take advantage of partially sorted sequences, called runs, merge runs instead of single element. The time complexity of Timsort is still $O(n\log{n})$ but with a smaller hidden coefficient in most real cases.

Two ways to call Timsort algorithm, sorted and list.sort.

bisect (Binary Search)

As the name bisect implies, this is the famous efficient binary search algorithm. Python’s built-in in search operation has $O(n)$ time complexity on any sequential structure, $O(1)$ on hashing structure (set and dict). In between, bisect provides $O(\log{n})$ time complexity search algorithm on sorted sequence.

Basic useage of bisect module:

>>> from bisect import bisect_left, bisect
>>> a = [1,2,3,3,4,5,6,7,8]  
>>> bisect(a, 3)  # bisect right
4
>>> bisect_left(a, 3)
2
>>> bisect(a, 7)     
8
>>> bisect_left(a, 7)
7

Keep in mind:

Interface bisect returns the index (i) where is just the right position of the one or more searching elements (e). If returns i, each element in a[i:] is larger than e.
Interface bisect_left returns the index (i) where is just the left position of the one or more searching elements (e). If returns i, each element in a[:i] is smaller than e.
Time complexity is $O(\log{n})$ for both bisect and bisect_left.
Only apply to sorted sequence. No error raised if applied to non-sorted sequence. Programmers need to make sure about it.
A key parameter is provided, like sorted interface.

Module bisect also offers us two insert methods:

>>> from bisect import insort_left, insort
>>> a = ['a','b','c1','c2','d']
>>> insort(a, 'c0', key=lambda x:x[0])  # insort right
>>> a
['a', 'b', 'c1', 'c2', 'c0', 'd']
>>> insort_left(a, 'c3', key=lambda x:x[0])
>>> a
['a', 'b', 'c3', 'c1', 'c2', 'c0', 'd']

Pay attention:

Both insort and insort_left are $O(n)$ operations, just like insert something in the middle of a list.
After inserting by insort or insort_left, the order keeps, even for those elements with same value. No element would be inserted in between of same-value-elements.
Only apply to mutable and sorted sequences.

heapq (Priority Queue)

Python’s heapq module is an implementation of priority queue algorithm. It is stored in a list and min-heap implmentation, which means the top element ([0]) is the smallest one. It is the base for queue.PriorityQueue.

heapq.heapify(x), $O(n)$ in-place linear operation, x must be a list, default comparison __lt__ is used.
heapq.push(heap, newitem), $O(\log{n})$
heapq.pop(heap), $O(\log{n})$, if heap is empty, raise IndexError. If you only need to read the top element, no need to pop, read heap[0].
heapq.heappushpop(heap, newitem) is faster than heapq.push and heapq.pop. But the returned element might be the one just pushed.
heapq.heapreplace(heap, newitem) is faster than heapq.pop and heapq.push. Pop first. So, it could not be applied on an empty heap. And the returned element is guaranteed from the original heap before push.

Since the implementation of heap module is based on list, how could push and pop operation reach $O(\log{n})$ time complexity?

When push, it’s not inserting in the middle, the append method is used, and then pop up (sift up). When pop, the last element is swap to the position of index 0, and then heapify down (sift down).

dataclass

Python’s dataclass decorator gives us a convenient way create a class holding various data, like a struct in C. A bunch of boilerplate functions could be automatically added when decorated, such as the default three: __init__, __repr__ and __eq___.

from dataclasses import dataclass

# order=True gives you __lt__, __le__, __gt__ and __ge__.
# frozen=True gives you read-only instance, no assigning operation.
@dataclass(order=True, frozen=True)
class Point:
    x: int  # This is called a field.
    y: int = 0

p1 = Point(1, 2)
p2 = Point(1, 2)
p3 = Point(3)

print(p1)         # Point(x=1, y=2)
print(p1 == p2)   # True
print(p1 == p3)   # False
print(p1.x)       # 1
print(p1 > p3)    # False
# p2.x = 5, raise

When the fields are mutable objects such as list and dict, it’s a little tricky to define:

from dataclasses import field

@dataclass
class xyz:
    # items: list[int] = []  # DO NOT DO THIS !!!
    items: list[int] = field(default_factory=list)
    settings: dict[str,str] = field(default_factory=dict)

functools

functools.lru_cache

Python’s functools.lru_cache is a decorator that provides a Least Recently Used (LRU) cache for function calls. It’s used for memoization, which means it stores the results of expensive function calls and returns the cached result when the same inputs occur again, avoiding redundant computations. When the memory is used up, the least recently used item would be removed to new memory item. It is suitable for those functions do lots of computation.

itertools

A few handy iteration algorithms in itertools module.

itertools.chain

Iterating more than one iterables in a chain way:

>>> import itertools
>>>
>>> a = 1,2
>>> b = [3]
>>> c = (i for i in range(4,6))
>>> for i in itertools.chain(a,b,c):
...   print(i)
...
1
2
3
4
5

itertools.combinations

Generate all combinations by specifying items and combination length:

>>> for c in itertools.combinations('abcd',3):
...   print(c)
...
('a', 'b', 'c')
('a', 'b', 'd')
('a', 'c', 'd')
('b', 'c', 'd')

itertools.permutations

Generate all permutations by specifying items and permutation length:

>>> for p in itertools.permutations('abc',2):
...   print(p)
...
('a', 'b')
('a', 'c')
('b', 'a')
('b', 'c')
('c', 'a')
('c', 'b')

Solve Combination and Permutation by Backtracking Algorithm

itertools.product

Generate Cartesian Product:

>>> for p in itertools.product('ab','12'):
...   print(p)
...
('a', '1')
('a', '2')
('b', '1')
('b', '2')

itertools.groupby

Generate groups based on user-defined criterior (don’t be confused with group by clause in SQL):

>>> a = 'a','ab','b','abc','c','bcd','acd'
>>>
>>> # groupby length
>>> result = {}
>>> for k,g in itertools.groupby(a,len):
...     result.setdefault(k,[]).append(tuple(g))
...
>>> result
{1: [('a',), ('b',), ('c',)], 2: [('ab',)], 3: [('abc',), ('bcd', 'acd')]}
>>>
>>> # groupby first letter
>>> result = {}
>>> for k,g in itertools.groupby(a,lambda x:x[0]):
...     result.setdefault(k,[]).append(tuple(g))
...
>>> result
{'a': [('a', 'ab'), ('abc',), ('acd',)], 'b': [('b',), ('bcd',)], 'c': [('c',)]}

Speedup Tricks

Comprehension is Faster Than Loops

Comprehension could be used to construct list, set or dict. They are called list comprehension, set comprehension and dict comprehension respectively. This concise syntax provides CPython interpreter a chance to implement them in C level. The overhead of repeatedly calling append or add methods for each element is removed, and CPython could pre-allocate enough memory to the whole returned object.

$ python -m timeit -p 'a = [i for i in range(10000)]'
1000 loops, best of 5: 346 usec per loop
$ python -m timeit -p -s 'a=[]' 'for i in range(10000):  a.append(i)'
100 loops, best of 5: 733 usec per loop

$ python -m timeit -p 'a = {i:i for i in range(10000)}'
500 loops, best of 5: 654 usec per loop
$ python -m timeit -p -s 'a={}' 'for i in range(10000):  a[i]=i'
500 loops, best of 5: 777 usec per loop

Avoid Function Calling while Initializing Common Objects

Unless you have to:

$ python -m timeit -p 'a=[]'
10000000 loops, best of 5: 28.7 nsec per loop
$ python -m timeit -p 'a=list()'
5000000 loops, best of 5: 83.3 nsec per loop

$ python -m timeit -p 'a=()'
20000000 loops, best of 5: 15 nsec per loop
$ python -m timeit -p 'a=tuple()'
5000000 loops, best of 5: 45 nsec per loop

$ python -m timeit -p 'a={}'
10000000 loops, best of 5: 29.4 nsec per loop
$ python -m timeit -p 'a=dict()'
5000000 loops, best of 5: 77.4 nsec per loop

Using Tuple as Constant Data Structure

Initializing a tuple is cheaper than an equal list:

$ python -m timeit -p 'a = 1,2,3,4'
20000000 loops, best of 5: 14.8 nsec per loop
$ python -m timeit -p 'a = [1,2,3,4]'
5000000 loops, best of 5: 54.9 nsec per loop

OK… Happy coding with Python!

list#

tuple and namedtuple#

bytearray#

bytes#

array#

str#

deque#

queue#

multiprocessing.Queue#

set#

frozenset#

dict#

Counter#

sort (Timsort)#

bisect (Binary Search)#

heapq (Priority Queue)#

dataclass#

functools#

itertools#

Speedup Tricks#