How to Select a Random Node from a Tree by@suraj-regmi

January 1st 2020 534 reads

Data Scientist, The World Bank -- the views, content here represent my own and not of my employers.

Today, I am remembering my childhood moments with my father. My father is a dreamer and he always encouraged me to dream bigger. His encouragement for me to dream and his support for me in pursuing my dreams have shaped greatly what I am today.

I was born and raised in the days when gadgets and the internet were not common. At that time, my father loved to scribble when he was free, thanks to the unavailability of smartphones and the internet in my home. Sometime, I would find him writing a poem and other times, I would find him designing an interior architecture of a house (for fun). He loved cards. He used to see if the day is fortunate by playing the cards. I think cards speak the truth. The day he could not crack the cards fortune puzzle, he used to spend the whole day trying to make it right.

I remember my father playing with “Regmi ancestry tree”, sometime in his big notebook and sometime in chart papers. He was fond of drawing the tree with a marker pen and giving a rectangular boundary to each person (node). His edges would be linear block edges with the outgoing arrows. Remembering my father’s love with the tree and thanks to my love for the computer science tree, I tried to build the tree using Python.

*Regmi ancestry tree created using Python in 2017*

Today, I want to take a random “Regmi” from there (just for fun and not any genetic study) and was thinking of a programmatic way of doing it. It turns out it is a popular problem of computer science — getting a random node from a tree with equal probability. Just as how I wrote a blog about getting a random line from a file in one pass previously with code and proof of correctness, I do the same for getting a random node from a tree.

For the selection of node to be completely random process, each node should have an equal probability of being selected. As the total number of nodes is denoted by *n*, the probability of any node being selected should be *1 / n*.

Here, the *O(n)* time and *O(n)* space solution will be presented with the proof of correctness at the end.

- Traverse through the whole tree and save the total number of nodes for all the subtrees in the tree. The total number of nodes for all the subtrees can be saved in any format and that takes the space
*O(n)*. - Traverse through the tree from the root and traverse on the following rule. Suppose the left subtree and right subtree of the node that we currently are on have
*b*and*c*elements respectively.

If*b = c = 0*, return the node.

Otherwise, get a random integer (*i*) in the range*[1, b + c + 1]*.

If*i ≤ a*, go left.

If*i = a+1*, select the element.

If*i > a+1*, go right.

Continue the process until a node is returned. The node returned is the random node.

Let’s suppose we have the tree as shown in the figure below. For simplicity, we have made all the nodes’ value unique.

*A binary tree*

The function *count* counts the number of nodes in all the subtrees and saves in the attribute *size* of the *Node* instance. The function *get_random_node* gets the random node of the tree. We have returned here the value of the random node.

```
import random
# one pass in the whole tree to count the size of all the subtrees
def count(root):
if not root: return 0
return count(root.left) + 1 + count(root.right)
# function to get the random node
def get_random_node(root):
a = root.size
b = root.left.size if root.left else 0
c = root.right.size if root.right else 0
if b == c == 0:
return root.val
rand_num = random.randint(1, a)
if rand_num <= b: return get_random_node(root.left)
elif rand_num == b + 1: return root.val
else: return get_random_node(root.right)
```

Let’s simulate the random events and see the percentage frequency of each node's value.

As expected, the percentage frequency is nearly the same for all the node values.

We do the proof of correctness here by induction.

The total number of nodes in the tree is *n*.

For any node in the tree, let *a*,* b* and *c* be the number of nodes in the subtree (containing the node), left subtree and right subtree respectively.

Differentiation is made between the selection and arrival process as the selection is selecting (and returning) the node whereas arrival is arriving at the node while traversing. Arrival at some node does not guarantee the selection of the node but arrival at some node is mandatory for the selection of the node.

First, let’s prove the randomness of the arrival process. Based on the randomness of the arrival process, we prove the randomness of the selection process.

Let’s suppose the arrival process is random up to some node. So, the probability of we coming at that node is *a / n*.

Now, the probability of arriving at the left subtree of that node is the probability of arriving at the node multiplied by the probability of choosing the left subtree.

So, the probability of arriving at the left subtree:

```
= a / n * b / a
= b / n
```

Similarly, the probability of arriving at the right subtree:

```
= a / n * c / a
= c / n
```

So, if the arrival process at some node is random, the arriving process is random on its children too. At the root node, the arrival probability according to our formula is *n / n = 1*. As we always start from the root node, the process satisfied the calculated probability.

Hence, it is proved that the arrival process is random.

Now, let’s prove that the selection process is random.

As the arrival process is random, the probability of arrival at some node is:*no of nodes in the subtree / total number of nodes in the tree (n)*

Now, the probability of the node being selected is:

```
= probability of arrival at that node * probability of selecting that node
= no of nodes in the subtree / n * 1 / no of nodes in the subtree
= 1 / n
```

As the probability of selection of any node is *1 / n*, the selection process is random.

*Proved**.*

I remember a quote from Nelson Mandela, “It always seems impossible until it is done” and think about the algorithmic/mathematical puzzles. There are a number of mathematical/algorithmic puzzles that have taken decades to get solved. Some are still to be solved and many smart minds are burning the midnight oil to solve them.

For example, you can find “Millennium Problems” posted by Clay Institute which have one million USD cash prize for the solutions. Once a problem is solved, the solution is studied by many other researchers, scientists, and engineers. The solution is refined, optimized and studied more by other researchers and scientists. The solution is used as far as practicable by engineers in as many applications.

What do the problem-solvers see in the problems, solutions, proofs, optimizations, and implementations whether they are researchers, scientists or engineers? — Beauty. And it starts with some small problem, small algorithm, small proof. If you found this or any other algorithm beautiful, comment here. I would love to know which algorithm’s beauty are you most fascinated by.