Lazy sorting in python

A common need when working with data, is to find the top 10 in a larger list. An easy solution is to use the built-in sort function to sort the entire list, and then just take the first 10 items. While this works, it feels wasteful. Other than knowing what is the tail, you’re not interested in the tail of the list at all.

January 2017, Bora M. Alper posted his take on this problem: `lazysort`. Using python generators, he sorted the list using dropsort as needed, returning just the part of the list that was requested. While Bora’s idea is quite neat, sadly his benchmark showed that it couldn’t actually compete with using python’s built-in sort function. This is not very surprising, cpython’s `sorted` function uses the state-of-the-art TimSort algorithm, and it is implemented in optimized C.

But his approach gave me inspiration. While I feel dropsort is sort of a gimmick, Bora might actually be on to something here. During my time as a game developer, I built an early-out version of quicksort, which returned as soon as the head of the list was sorted. We should be able to build a something similar using python generators.

A generator-based quicksort could look like this:

```def partition(unsorted):
sublist_length = len(unsorted)
if sublist_length <= 1:
return unsorted, ()

pivot = unsorted[0]
lhs = []
rhs = []

for idx in range(1, sublist_length):
finger = unsorted[idx]
if finger < pivot:
lhs.append(finger)
else:
rhs.append(finger)

lhs.append(pivot)
return lhs, pivot, rhs

def quicksort(unsorted):
if len(unsorted) < 2:
yield from unsorted
else:
lhs, rhs = partition(unsorted)
yield from quicksort(lhs)
del lhs  # release memory used by lhs
yield pivot
yield from quicksort(rhs)
```

For the uninitiated: quicksort works by splitting lists in two parts: one list with all items smaller than some pivot value, one with all items larger than that pivot value, and then repeating that procedure on both lists. Splitting the lists is the task of the partition function.

When comparing my implementation to lazy sort and the built-in `sorted` function, you can clearly see that while Bora’s `lazysort` can’t compete, his idea is sound. (This benchmark was based on retrieving the top 10 from a list of integers).

The chart is dominated by `lazysort`, obscuring the fact that partial quicksort is indeed faster than the built-in `sorted`.

Update /u/imps-p0155 mentioned in his comment that using `heapq.heapify` would provide even better performance. He’s absolutely right:

Further Analysis

Now that we know we can beat `sorted`, let’s see why this is. A lazy sorting algorithm only has benefit when it can decide not to do some work. In quicksort’s case, not having to sort the right partition several times in a row is a huge benefit. But where’s the cut-off when sorting items?

If you just need the first few percent, lazy `quicksort` can help out, but it has nothing on using `heapq`, which breaks even at ~45%, and has a worst case (100% iteration) premium of only 200% over `sorted`. Conclusion: while partial quicksort is an interesting gimmick, it has no added value over `heapq`.

Possible performance improvements

I’m doubtful optimization of `quicksort` will make it compete with `heapq`. Partial `quicksort`s best case (only retrieving one item) still requires performing `log(n)` partitions; O(n log(log(n)) performance, and it’s worst case (retrieving all items) is still O(n log(n)). The `heapq` based solution has a generalized performance of O(n + m log(n)), which is always better than `quicksort`.

Possible feature improvements

Currently support for `sorted`s `key` and `reverse` parameters is lacking, but those are trivial to add.

Lazy Quicksort could fake an list-like interface. The current implementation only sorts the head, but the principle applies to any part of the sorted list. This is something you can’t achieve with heapq, which can only operate at the end of the lists. This could speed up computation of median values, and percentiles, although with my test data, it will probably only outperform `heapq` for the 45-55 percentile, which is a really narrow niche.

Code

All code used for blog-post is on github

Avatars for non-users in Django

In Internet discussions, showing user icons really helps to show who’s saying what. Humans need more time to read a users name than they need to recognize a picture. As the old saying goes: a picture is worth a thousand words - that’s quite an impressive compression ratio.

Many ...

Using filesystem transport with Celery

Celery is an asynchronous job queue. It is used to build distributed applications — application which run on (potentially) multiple hosts, but it is also useful if your application runs on a single host.

Transports

Celery distributes and schedules work by passing messages around through an AMQP broker such as RabbitMQ ...

Java vs Python

I’ve been asked to compare Python to Java quite a few times now, far more often than other combinations of languages. While most of these questions came from trolls, some of them were sincere. In one instance, business decisions were made based on my recommendation.

If anything ...