It’s very important to design or choose the proper data structure and algorithm for various programming tasks. Even when we are using high-level programming language such as Python, it’s still crucial in terms of time and space complexity. In this post, I’ll try to summary the built-in data structures and algorithms in Python as a recap and reference. There are so many details which couldn’t be covered in this post, please refer to Python’s official documents.

Most of the built-in data structures in Python are sequencial, which is also the most basic and important structure while programming. Memory is sequential.

The first logo of Python, designed by Guido's brother Just van Rossum

The first logo of Python, designed by Guido’s brother Just van Rossum

list

List might be the most frequently used sequencial data structure in Python. It is fast, versatile and easy to use.

  • mutable
  • $O(1)$ for random accessing
  • $O(1)$ for tail operation (append, pop)
  • $O(n)$ for inserting and deleting at head or middle
  • $O(n)$ for membership testing (in operation)
  • more space usage than tuple
  • auto-resizing
  • suitable to be used as Stack (LIFO) structure

Normally, we use list to store homogeneous data, but this is not mandatory. But it’s a good programming style. One common pitfall about list is to change it while iterating:

>>> a = [1,2,3,4,5]
>>> for i in a:  # quick fix, use a[:] or a.copy() (shallow copy)
...   print(i)
...   a.remove(i)
... 
1
3
5
>>> a
[2, 4]

Under list structure is a piece of continuous memory so that it could achieve $O(1)$ for random element accessing. Due to the same reason, insert or delete at the head or in the middle somewhere needs to move memory and they are $O(n)$ operations.

tuple and namedtuple

Tuple is not just an immutable version of list. Tuple is best to for record data, such as rows retrieved from database table. And these data are normally heterogeneous.

  • immutable
  • $O(1)$ for random accessing
  • $O(n)$ for membership testing
  • less space usage than list
  • faster than list for constant sequential data

Namedtuple needs to be imported from collections module, and each element in namedtuple could be accessed by a property name.

>>> from collections import namedtuple
>>> rec = namedtuple('datarow',('name','age','salary'))
>>> xinlin = rec('xinlin', 99, 100.123)     
>>> xinlin.name, xinlin.age, xinlin.salary
('xinlin', 99, 100.123)

Nuances between standard tuple and namedtuple:

  • standard tuple is a hard-coded C structure, namedtuple is an imported class
  • create a standard tuple is faster than that of a namedtuple
  • namedtuple could access element by name, which is more readable and maintainable but a little slower than accessing by index (a getattr call to lookup attribute)
  • memory footprint for both are nearly the same
  • namedtuple only store field names in definition, which makes namedtuple much more memory-efficient compared to dict object

bytearray

We can treat bytearray as a special type of list which is designed only to hold raw byte data. (List objects hold references of other objects.)

  • list-like, mutable, only hold raw byte data, 0 – 255
  • memory efficient design

bytes

  • tuple-like, immutable, only hold raw byte data, 0 – 255
  • memory efficient design

array

Array is less frequently used in Python due to the powerful and handy list. Compared to list, array is specially designed to store only homogeneous numeric data.

  • list-like, mutable
  • store homogeneous numeric data, specify data type while initialization
  • memory efficient design
  • easy to interact with C API
  • if storing byte data, consider bytes or bytearray instead!
>>> import array
>>> ai = array.array('i',[1,2,3,4,5])  # i: int
>>> ad = array.array('d',[1,2,3,4,5])  # d: double-precision float
>>> ai
array('i', [1, 2, 3, 4, 5])
>>> ad
array('d', [1.0, 2.0, 3.0, 4.0, 5.0])

Another reason why array is less often used is due to NumPy’s ndarray, which is so powerful and prevalent! The main benefits of array might be the fact that it’s a built-in module.

str

In Python, we don’t have char type, only str (string). Char could be represented by str object with length 1. Str is actually another sequential data structure used to store and manipulate unicode characters.

  • tuple-like, immutable, store unicode
  • $O(1)$ for random accessing
  • $O(n)$ for membership testing
  • rich and handy string operations

Now, we start introducing sequential structure with non-continuous memory!

deque

When we need frequently insert or delete elements at the head of a sequential structure, deque might be the data structure we should employ. Deque (pronounced “deck”) is the Double-Ended Queue, which is implemented doubly-linked blocks. It is specially designed and optimized for both end operations.

  • mutable
  • $O(1)$ for both end operations
  • $O(n)$ for random accessing
  • $O(n)$ for inserting and deleting in the middle
  • $O(n)$ for membership testing
  • auto-resizable or fixed-size
  • could be used as Stack (LIFO) and Queue (FIFO) structure

Under deque structure is linked-list so that it could achieve $O(1)$ for both end operations. But don’t forget accessing random element in deque is a $O(n)$ operation due to the same reason. Unlike list, we could optionally initiate a deque object with a parameter called maxlen. With this parameter, the deque object would act like a rolling window.

>>> from collections import deque
>>> dq = deque(maxlen=3)
>>> dq.append(1)
>>> dq.append(2)
>>> dq.append(3)
>>> dq.append(4)
>>> dq
deque([2, 3, 4], maxlen=3)
>>> dq.appendleft(1)
>>> dq
deque([1, 2, 3], maxlen=3)

If initiating deque object without maxlen, it just behaves like a standard double-ended auto-resizable queue.

queue

In multi-threading programs, queue structure is often used to store shared data. Python has a bulit-in queue module in which there are a few very basic and powerful queue implementations could be used in multi-threading scenarios. They are SimpleQueue (FIFO), Queue (FIFO), LifoQueue and PriorityQueue (min-heap).

multiprocessing.Queue

In multiprocessing module, there is a Queue class which is nearly a clone of queue.Queue but is designed to be used in multiprocessing scenarios.

set

Set is a group of unordered unique hashable elements.

  • $O(1)$ for adding or removing element
  • $O(1)$ for membership testing!

One common use case of set is to remove duplications:

>>> a = [1,2,3,2,1,2,3,4,5,4,3,2,1]
>>> a = list(set(a))
>>> a
[1, 2, 3, 4, 5]

Hashable objects are those whose hash values would not be changed during their life time. Immutables are all hashable. Mutables are not. User-defined objects by default are hashable if __eq__ and __hash__ methods are not overrided. Hashable is a feature of unique immutable element.

Set objects support standard mathematical set operations:

>>> a = set('abcde')
>>> b = set('bcdef')
>>>
>>> a | b  # union, in a or b
{'a', 'c', 'f', 'd', 'b', 'e'}
>>> a & b  # intersection, in both
{'b', 'c', 'd', 'e'}
>>> a - b  # in a, but not in b
{'a'}
>>> a ^ b  # in a or b, but not in both
{'f', 'a'}

frozenset

Immutable version of set, hashable, could be an element in a set or as a key in dict object.

dict

Dict is a hash-mapping structure for unordered key-value pairs. It’s incredibly powerful and prevalent inside Python.

  • mutable
  • $O(1)$ for adding or removing KV pairs
  • $O(1)$ for accessing by key (hashable)
  • $O(1)$ for membership testing
  • preserving inserting order (since 3.7, same as built-in OrderedDict from collections module)

An example of the feature of preserving inserting order:

>>> ops = { 
...     'step1': 'open the door of refrigerator',
...     'step2': 'push the elephant inside',
...     'step3': 'close the door',
... }
>>> for k,v in ops.items():  # iterating by inserting order
...     print(k, ':', v)
... 
step1 : open the door of refrigerator
step2 : push the elephant inside
step3 : close the door

Two caveats for leveraging the feature of preserving inserting order of dict object:

  1. updating a key does not change its order
  2. deleting a key and re-inserting it moves it to the end

Counter

The Counter class, found in the collections module, is a specialized subclass of dict designed for quickly and efficiently counting hashable objects. We could say this is a built-in counting algorithm in Python.

>>> from collections import Counter
>>> a = 1,2,3,4,3,2,2,3,4,5,6,5,4,3,2
>>> count = Counter(a)
>>> count
Counter({2: 4, 3: 4, 4: 3, 5: 2, 1: 1, 6: 1})
>>> count.most_common()
[(2, 4), (3, 4), (4, 3), (5, 2), (1, 1), (6, 1)]
>>> count.most_common(3)
[(2, 4), (3, 4), (4, 3)]

sort (Timsort)

The built-in sort algorithm in Python is very fast (C speed) and stable.

Stable means the order of multiple same objects would be preserved after sorting. It’s a critical feature in many scenarios.

We sometimes call the built-in sort algorithm in Python Timsort to honor the author Tim Peter. It’s a hybrid sorting algorithm combined from inserting sort and merge sort algorithms. Merge sort has a nice time complexity as $O(n\log{n})$. Inserting sort is super fast if the sequence is partially sorted. The best time complexity of inserting sort is $O(n)$. The basic idea of Timsort is to take advantage of partially sorted sequences, called runs, merge runs instead of single element. The time complexity of Timsort is still $O(n\log{n})$ but with a smaller hidden coefficient in most real cases.

Two ways to call Timsort algorithm, sorted and list.sort.

reversed

Built-in function, simple reverse the order of sequence. It’s very useful while looping.

As the name bisect implies, this is the very famous and efficient binary search algorithm. Python’s built-in in search operation has $O(n)$ time complexity on any sequential structure, $O(1)$ on hashing structure (set and dict). In between, bisect provides $O(\log{n})$ time complexity search algorithm on sorted sequence.

Basic useage of bisect module:

>>> from bisect import bisect_left, bisect
>>> a = [1,2,3,3,4,5,6,7,8]  
# bisect gives you the index to insert
>>> bisect(a, 3)  # bisect right
4
>>> bisect_left(a, 3)
2
>>> bisect(a, 7)     
8
>>> bisect_left(a, 7)
7
>>> bisect(a, 10)  # this index is out of range
9
>>> bisect(a, 0) 
0
>>> bisect_left(a, 0)
0

Keep in mind:

  • bisect returns the index (i) where is just the right position of the searching element (e). If returns i, each element in a[i:] is larger than e. The returned i might be out of range.
  • bisect_left returns the index (i) where is just the left position of the searching element (e). If returns i, each element in a[:i] is smaller than e.
  • time complexity is $O(\log{n})$ for both bisect and bisect_left.
  • could only be applied to sorted sequence, no error raised if applied to non-sorted sequence.
  • key parameter is provided since Python 3.10, which is super useful.

Module bisect also offers us two insert methods:

>>> from bisect import insort_left, insort
>>> a = ['a','b','c1','c2','d']
>>> insort(a, 'c0', key=lambda x:x[0])  # insort right
>>> a
['a', 'b', 'c1', 'c2', 'c0', 'd']
>>> insort_left(a, 'c3', key=lambda x:x[0])
>>> a
['a', 'b', 'c3', 'c1', 'c2', 'c0', 'd']

Pay attention:

  • Both insort and insort_left are $O(n)$ operations, just like insert something in the middle of a list.
  • After inserting by insort or insort_left, the order keeps, even for those elements with same value. No element would be inserted in between of elements with same values.
  • Could Only be applied to mutable and sorted sequences.

One useful scenario of binary search: You have a sorted sequence, and you need to put all elements into a few ordered bins. Each bin represents a range, and each range might not have the same width, but you could get all edge points. Instead of taking out each element and comparing with all edge points, you could employing binary search algorith to search each edge point in the sorted sequence to get split indices. Once you get all split indices, you’ve successfully separate the sorted sequence into each bin.

heapq (Priority Queue)

Python’s heapq module is an implementation of priority queue algorithm. It is stored in a list and min-heap implmentation, which means the top element ([0]) is the smallest one. It is the base for queue.PriorityQueue.

  • heapq.heapify(x), $O(n)$ in-place linear operation, x must be a list, default comparison __lt__ is used.
  • heapq.push(heap, newitem), $O(\log{n})$
  • heapq.pop(heap), $O(\log{n})$, if heap is empty, raise IndexError. If you only need to read the top element, no need to pop, read heap[0].
  • heapq.heappushpop(heap, newitem) is faster than heapq.push and heapq.pop. But the returned element might be the one just pushed.
  • heapq.heapreplace(heap, newitem) is faster than heapq.pop and heapq.push. Pop first. So, it could not be applied on an empty heap. And the returned element is guaranteed from the original heap before push.

Since the implementation of heap module is based on list, how could push and pop operation reach $O(\log{n})$ time complexity?

When push, it’s not inserting in the middle, the append method is used, and then pop up (sift up). When pop, the last element is swap to the position of index 0, and then heapify down (sift down).

dataclass

Python’s dataclass decorator gives us a convenient way create a class holding various data, like a struct in C. A bunch of boilerplate functions could be automatically added when decorated, such as the default three: __init__, __repr__ and __eq___.

from dataclasses import dataclass

# order=True gives you __lt__, __le__, __gt__ and __ge__.
# frozen=True gives you read-only instance, no assigning operation.
@dataclass(order=True, frozen=True)
class Point:
    x: int  # This is called a field.
    y: int = 0

p1 = Point(1, 2)
p2 = Point(1, 2)
p3 = Point(3)

print(p1)         # Point(x=1, y=2)
print(p1 == p2)   # True
print(p1 == p3)   # False
print(p1.x)       # 1
print(p1 > p3)    # False
# p2.x = 5, raise

When the fields are mutable objects such as list and dict, it’s a little tricky to define:

from dataclasses import field

@dataclass
class xyz:
    # items: list[int] = []  # DO NOT DO THIS !!!
    items: list[int] = field(default_factory=list)
    settings: dict[str,str] = field(default_factory=dict)

copy.deepcopy

The copy method in mutable object performs shallow copy, which means only the references of elements inside these mutable object are copied. If you want to a deep copy, import built-in module copy and call copy.deepcopy.

import copy

a = [[1,2],[3,4]]
b = a.copy()  # shallow copy
c = copy.deepcopy(a)  # deep copy

functools

  • functools.lru_cache

Python’s functools.lru_cache is a decorator that provides a Least Recently Used (LRU) cache for function calls. It’s used for memoization, which means it stores the results of expensive function calls and returns the cached result when the same inputs occur again, avoiding redundant computations. When the memory is used up, the least recently used item would be removed to new memory item. It is suitable for those functions do lots of computation.

zip

Built-in function, very handy while loopings over more than one sequences at the same time.

>>> # transpose a matrix
>>> a = [[1,2,3],[4,5,6],[7,8,9]]
>>> list(zip(*a))
[(1, 4, 7), (2, 5, 8), (3, 6, 9)]

itertools

A few handy iteration algorithms in itertools module.

  • itertools.chain

Iterating more than one iterables in a chain way:

>>> import itertools
>>>
>>> a = 1,2
>>> b = [3]
>>> c = (i for i in range(4,6))
>>> for i in itertools.chain(a,b,c):
...   print(i)
...
1
2
3
4
5
  • itertools.combinations

Generate all combinations by specifying items and combination length:

>>> for c in itertools.combinations('abcd',3):
...   print(c)
...
('a', 'b', 'c')
('a', 'b', 'd')
('a', 'c', 'd')
('b', 'c', 'd')
  • itertools.permutations

Generate all permutations by specifying items and permutation length:

>>> for p in itertools.permutations('abc',2):
...   print(p)
...
('a', 'b')
('a', 'c')
('b', 'a')
('b', 'c')
('c', 'a')
('c', 'b')

Solve Combination and Permutation by Backtracking Algorithm

  • itertools.product

Generate Cartesian Product:

>>> for p in itertools.product('ab','12'):
...   print(p)
...
('a', '1')
('a', '2')
('b', '1')
('b', '2')
  • itertools.groupby

Create groups based on user-defined criterior (don’t be confused with group by clause in SQL):

>>> a = 'a','ab','b','abc','c','bcd','acd'
>>>
>>> # groupby length
>>> result = {}
>>> for k,g in itertools.groupby(a,len):
...     result.setdefault(k,[]).append(tuple(g))
...
>>> result
{1: [('a',), ('b',), ('c',)], 2: [('ab',)], 3: [('abc',), ('bcd', 'acd')]}
>>>
>>> # groupby first letter
>>> result = {}
>>> for k,g in itertools.groupby(a,lambda x:x[0]):
...     result.setdefault(k,[]).append(tuple(g))
...
>>> result
{'a': [('a', 'ab'), ('abc',), ('acd',)], 'b': [('b',), ('bcd',)], 'c': [('c',)]}

Speedup Tricks

  • Comprehension is Faster Than Loops

Comprehension could be used to construct list, set or dict object. They are called list comprehension, set comprehension and dict comprehension respectively. This concise syntax provides CPython interpreter a chance to implement them in C level. The overhead of repeatedly calling append or add methods for each element is removed, and CPython could pre-allocate enough memory to the whole returned object.

$ python -m timeit -p 'a = [i for i in range(10000)]'
1000 loops, best of 5: 346 usec per loop
$ python -m timeit -p -s 'a=[]' 'for i in range(10000):  a.append(i)'
100 loops, best of 5: 733 usec per loop

$ python -m timeit -p 'a = {i:i for i in range(10000)}'
500 loops, best of 5: 654 usec per loop
$ python -m timeit -p -s 'a={}' 'for i in range(10000):  a[i]=i'
500 loops, best of 5: 777 usec per loop
  • Avoid Function Calling while Initializing Common Objects

Unless you have to:

$ python -m timeit -p 'a=[]'
10000000 loops, best of 5: 28.7 nsec per loop
$ python -m timeit -p 'a=list()'
5000000 loops, best of 5: 83.3 nsec per loop

$ python -m timeit -p 'a=()'
20000000 loops, best of 5: 15 nsec per loop
$ python -m timeit -p 'a=tuple()'
5000000 loops, best of 5: 45 nsec per loop

$ python -m timeit -p 'a={}'
10000000 loops, best of 5: 29.4 nsec per loop
$ python -m timeit -p 'a=dict()'
5000000 loops, best of 5: 77.4 nsec per loop
  • Using Tuple to be Constant Data Structure

Initializing a tuple is cheaper than an equal list:

$ python -m timeit -p 'a = 1,2,3,4'
20000000 loops, best of 5: 14.8 nsec per loop
$ python -m timeit -p 'a = [1,2,3,4]'
5000000 loops, best of 5: 54.9 nsec per loop

OK… Happy coding with Python!