CLM_BinaryTrees
TL;DR: This library implements three different binary search trees in C:
- The Classic Binary Search Tree (
bs_tree
functions) - The Red Black Tree (
rb_tree
functions) - The Splay Tree (
sp_tree
functions)
All operations are implemented using top-down, single-pass, iterative
algorithms to avoid using parent pointers, recursion or any kind of
explicit or implicit limit on the size of the trees (other than the available
RAM).
The library uses void pointers rather than macros (at the expense of
a little bit of time and memory) to be easier to understand, modify
and maintain.
Content
Introduction
A Binary Search Tree1 is a versatile linked data structure designed to
store and retrieve data quickly.
They can be used to implement, among others, the Dictionary2, the Set3
and the Priority Queue4 abstract data types.
The basic primitives of a Binary Search Tree are:
Insert(elem)
- Store a new element in the treeRemove(elem)
- Remove an element from the treeSearch(elem)
- Find an element in the tree
These primitives can also be implemented using a Hash Table and, indeed, it
will be more efficient to use a Hash Table in many cases, but at the cost of
providing a good hash function every time you use the data structure (which
is hard) instead of providing just a comparing function (easier in general).
In addition, with a Binary Search Trees you can implement other primitives:
Find_Min()
- Find the smallest element stored in the treeFind_Max()
- Find the biggest element stored in the treeRemove_Min()
- Remove the smallest element stored in the treeRemove_Max()
- Remove the biggest element stored in the treeFind_Next(elem)
- Find the smallest element that is bigger than elemFind_Prev(elem)
- Find the biggest element that is smaller than elem
That cannot be implemented efficiently using a Hash Table and that add a
lot of functionality to this data structure.
Binary Search Trees
There are many variants of the Binary Search Tree data structure.
All of them share the following common assumptions:
- The data is stored in nodes and each node is linked to 2 other nodes
(traditionally called left and right) forming a rooted binary tree. - The elements stored in the tree form a totally ordered set. That means
that given any two of them (A and B) then either A < B, A = B or A > B. - The left subtree of a node contains elements that are smaller than
the element stored in the node. Likewise, the right subtree contains
elements that are bigger than the element stored in the node. - Althought it is not strictly required, it is very common to assume that
that all elements are different, which means that inserting an element
twice will overwrite the information asociated to that element.
As long as we are free to define the contents of the elements stored in the
tree and to provide a custom comparing function, the last 2 points (the
symmetric order property and the uniqueness property respectively) do
not supose any practical limitation:
- We can always define an arbitrary (but consistent) total ordering for
the data stored in a computer. In general it is not even difficult. - We can include a counter field in the data stored in the tree and
increment or decrement it acordingly to implement a multiset.
The efficiency of all operations made in a Binary Search Tree usually depend
on the distance between the element that we are manipulating and the root
node of the tree. On well behaved Binary Search Trees, this distance is
proportional to the (base 2) logarithm of the number of stored elements.
It turns out, however, that the assumptions stated above do not impose a
rigid configuration, so two different Binary Search Trees can contain the
same elements and greatly differ in shape:
2 4
/ \ / \
1 3 2 6
\ / \ / \
4 1 3 5 7
\
5
\
6
\
7
Although both trees contain the same elements and satisfy all the assumptions
stated above, the left tree (rooted at 2) is quite degenerated while the one
on the right (rooted at 4) is perfectly balanced.
If we define the height of a tree as the maximum distance between any of its
nodes and the root node then, a Binary Search Tree is balanced if its
height is proportional to the logarithm of the number of elements stored in
it and all operations will be fast.
In contrast, when a tree is degenerated the height of the tree is
proportional to the number of stored elements and all opeartions will be
painfully slow.
For a tree with 1 million of elements, a linear performance will require
about 1 million of steps whereas a logarithmic performance can accomplish
the same with just 20 steps.
Although a Binary Search Tree will turn out naturally balanced if we
introduce the elements uniformly at random, the fact is that for most
applications of Binary Search Trees we cannot guarantee the required degree
of randomness and it is worth to perform some extra operations to ensure that
the tree remains fairly balanced.
The most common self-balancing Binary Search Tree variants are:
- AVL trees5 are the oldest balanced search trees and the most rigidly
balanced ones. However, there is not any top-down single-pass deletion algorithm
for them which makes them less popular nowadays. - Treaps6 are efficient and relatively easy to implement but require the
generation and storage of pseudo-random numbers in each node. - B-trees7 is a huge family of variants that include the 2-3 trees8
and the 2-3-49 trees as a particular cases. Their main drawback is that they
either use different sized nodes or waste memory using always the biggest node
type. - Red Black trees10 simulate the behavior of 2-3-4 trees by
using binary nodes with and additional field called the color. They are less
rigidly balanced than AVL trees (which makes possible to use top-down
single-pass insertion and deletion algorithms) but tend to perform like
them, making Red Black trees the most popular self-balancing variant
nowadays. - Finally, Weight Balanced trees11 are less known but more versatile than
Red Black trees, however, they require to store the size of each subtree in
every node, which imposes an artificial upper limit in the size of the tree.
With the popularization of the 64 bit computers this is no longer a problem and
because of the memory alignment of modern compilers they use the same memory
as the theoretically more efficient Red Black trees.
All these variants require to store extra information in the nodes of the
tree and to perform some rebalancing operations every time an element is
inserted or removed.
There are other solutions to avoid the problems associated to degeneracy without
storing additional information in the nodes of the tree. Two of the most
interesting ones are:
- Scapegoat trees12 are Binary Search Trees that perform an expensive
rebalancing operation when they detect degeneracy in a subtree.
The amortized cost of such operation is constant and it will only be
applied after a insertion or a deletion. Their only drawback is that
rebalance operation needs to know the size of the rebalanced subtree and
I would rather preffer algorithms that do not explicitly keep track of
the size or the height of the tree. - Splay trees13 are Binary Search Trees that perform a rebalancing
operation called Splaying each time an element is accesed. That means
that all operations (insert, remove, search, findmin, etc…) will be
more expensive than they are in the other Binary Search Tree variants.
In exchange, however, the _Splaying operation will move the accessed
element to the root of the tree, which means that accesing again to this
element will be cheaper. Furthermore, any node that has been accesed
recently or that is close in value to the last accesed element will be
cheaper to find and manipulate. The self-adaptability property of Splay
trees works well with many not so random access sequences that we often
find in real world projects and the compressing nature of the Splaying
operation will ensure that Splaying tree operation run in amortized
logarithmic time which makes them interesting for many aplications.
Finally, Splay tree functions are simpler, cleaner and easier to code.
Sumarizing:
- If you are sure that your insertion and deletion sequences will be random
you should use the simple Binary Search Tree functions because they are
fast and require less memory. - If the operations that you are performing have any kind of temporal
locality (elements accesed recently will be accesed again soon) or
spatial locality (elements similar to elements accesed recently will tend
to be accesed) you should use a Splay tree. - In any other cases you must sacrifice a little bit of extra memory to
gain the security of using a bomb-proof Red Black tree.
This Library
Since the C programming language does not have a standar library of abstract
data types, it is very convenient to use this kind of swiss-army-knife data
structures that are able to implement a wide range of the most commonly used
containers.
It is difficult, however, to find good generic implementations of this data
structure that do not rely on parent pointers (which waste memory) or
recursion (which wastes time and imposes an artificial limit on the amount
of data stored in the tree). As a result I decided to write my own library.
It started as a personal project and, as such, some of the design decisions
made here will not fit your needs. Let me list the objectives that I pursued
while coding this library so you can judge better:
- To write a standalone generic library in C that implements one or more
variants of the Binary Search Tree data structure. - To avoid any explicit or implicit limit on the size of the tree by using
iterative algorithms that do not require this kind of data as input. - To avoid the use of parent pointers by using single-pass top-down
algorithms. - To avoid using hacks or hardware/software dependend tricks that may stop
working, make the library difficult to use or slow down the code in a future. - To keep the code as clear, simple and well documented as possible so I can
use it for many years as the underlying data structure of my future projects. - To be memory and time efficient as long as it does not conflict with the
previously stated objectives. - To provide testing and visualizing tools in the library to help the final
users to understand what is happening inside this data structure and to debug
any modification made in the code.
So, in particular:
- I would rather use void pointers than macros to attain genericness while
keeping the library simple. - The final user of the library will not be required to deal with fingers,
iterators, traversers or callback functions (other than the comparing
function) because increasing the traversing time from O(n) to O(n log n) is
usually not a big deal and the resulting code is simpler and more robust that
way. - Binary Search Trees and Splay trees will have a memory overhead of 3 pointers
(left, right and data) per stored item while Red Black trees will have a
memory overhead of 3 pointers and a char (left, right, data and color).
This is good enough for most projects and quite competitive if you take into
account the amount of functionality provided. - The elements stored in the tree need to be created and destroyed outside
the tree. This allows the user to store the same element in multiple data
structures without wasting memory. This also avoids the mandatory use of
copy and destroy functions. - I can not provide an efficient select function without storing the number
of elements of each subtree. So you either modify the library to store that
information or use the find_min
and find_next
functions to retrieve
the k-th element in O(k log n) time. - I will provide the common Set Functions (Union, Intersection, Difference and
Symmetric Difference) for all tree variants but their efficiency will vary. - I will code all three variants in this library using the same syntax so you
only need to change the prefix of the functions to try another variant:
- Classic Binary Search Tree functions use the
bs_tree
prefix. - Red Black tree functions use the
rb_tree
prefix. - Splay tree functions use the
sp_tree
prefix.
Since you can mix and match Classic Binary Tree functions and Splay tree
funtions I used the same bs_tree
and bs_node
structs in both
cases (although I typedef them as sp_node
and sp_tree
to maintain
the change-the-prefix-to-change-the-implementation property stated above).
In contrast, althought most of the Red Black tree functions are, indeed, equal
to their Classic Binary Tree counterparts, you cannot mix them because Red
Black functions require that the nodes contain an additional color field
and because they expect the tree to be already balanced.
References:
1) – https://en.wikipedia.org/wiki/Binary_search_tree
1) – https://en.wikipedia.org/wiki/Associative_array
1) – https://en.wikipedia.org/wiki/Set_(abstract_data_type)
1) – https://en.wikipedia.org/wiki/Priority_queue
1) – https://en.wikipedia.org/wiki/AVL_tree
1) – https://en.wikipedia.org/wiki/Treap
1) – https://en.wikipedia.org/wiki/B-tree
1) – https://en.wikipedia.org/wiki/2–3_tree
1) – https://en.wikipedia.org/wiki/2–3–4_tree
1) – https://en.wikipedia.org/wiki/Red–black_tree
1) – https://en.wikipedia.org/wiki/Weight-balanced_tree
1) – https://en.wikipedia.org/wiki/Scapegoat_tree
1) – https://en.wikipedia.org/wiki/Splay_tree