Segment Trees !

Nothing influences people more than a recommendation from a trusted friend” – Mark Zuckerberg

This post is a part of IIT-M CSE “Concept-Sunday” series .

src : sunkist.com/tips/how-to-slice-oranges/; all-free-download.com/free-vector/half-circle-shape.html

Our topic for today is segment tree – which is a famous tree data structure for range queries and range update problems, typically on an array. We will also look at the concept of lazy propagation associated with segment trees.

Given an array, consider the problem of finding sum of values in a given index range. For example if we have have array of size N, we are supposed to find sum of values between indices i and j ( 0 <= i <= j <= N – 1 , considering array indexing starts from zero). We call such a query for consecutive set of indices as a range query.

Now this problem can simply be solved by iterating from index i to index j and calculating the running sum; in this case update is simply a matter of indexing the array to update required position. However range query here takes O(N) worst case time and O(1) time is required for updation.

Next, another way to solve this problem is by maintaining an extra array which stores sum of values from index 0 to current index (cumulative sum). So the j ‘th index in this new array will have sum of values in original array from [0 – j]. Now in this case range query(sum) from index i to index j can be solved by taking difference of extraArray[j] and extraArray[i] i.e in O(1) time; but update here requires updating the extra array which will require O(N) time.

We can choose either of the approach based on our requirements; if range queries are more than update requests then we go by the second approach, else if update requests are more then we go by the first approach. But the question is can we better ? Solving both query and update in O(1) looks like much of an expectation and against the concept of trade-off. Then, in between O(1) and O(N) is O(log N) which seems reasonable. And the first data structure that comes to our mind after seeing O(log N) is a tree – which always comes to our rescue. Naturally we will turn back to tree data structure which will help us solve both range query and update problem in O(log N) time.

Segment Tree – Definition :

Such a tree that helps us solve range queries and update requests in O(log N) time is a segment tree. Each node in the segment tree stores information about a segment of data(array) or an interval of the array. So that traversal of this tree will give us required information about any segment in logarithmic complexity; also updates in such a tree can be achieved in logarithmic time.

It is also known as a static data structure as structure of tree cannot change once data structure is initialised. This also implies that we cannot add or delete elements from the array, only operation available to us in terms of modification to array is updating an existing element.

 

orig

Original Array

 

segment_min

Segment tree for finding minimum over a range of array indices.

segment_sum

Segment tree for finding sum over a range of array indices.

For the problem of finding sum over a range, segment tree can be described as :

  1. Each leaf node stores a value from original array.
  2. Each internal node stores sum of all leaf nodes in the sub-tree rooted at this internal node.
  3. Effectively each node stores sum of the values stored in its left and right child.
  4. Value of the leaf nodes belonging to a sub-tree are consecutively placed in the original array.

Such a tree is a full binary tree and N leaf nodes and (N – 1) internal nodes, where N is the total number of elements in the original array.

In the figure above root node stores sum of entire array[0 : 5] ; its left child stores sum of first half [0 : 2] and right child stores sum of the second half [3 : 4]. We can further divide [0 : 2] into [0 : 1] and [2 : 2]; [3 : 4] can be divided as [3 : 3] and [4 : 4]. Thus each leaf node is responsible for a single value and a level above leaf is responsible for 2 values and so on.

Implementation :

This division approach also tells us how to construct such a segment tree. We compute recursively, the value of each node starting from the root node by partitioning the array into two equal halves each time. Values returned from each half are summed to form value for the current node. The basis condition is hit whenever partition has a single value; this single value forms a leaf of the segment tree.

Since this is a full binary tree and static too, we can use an array representation to store such a segment tree which has its own advantages like easy to understand – level order storage of keys(values). When using array representation, the two child nodes of a node array[i] can be accessed using array[2 * i] and array[2 * i + 1] (nodes are stored from index 1 and not 0); parent of array[i] is array[i / 2].

Segment Tree Creation:

  • A Segment Tree can be built using recursion (bottom-up approach ).
  • Start with the leaves and go up to the root and update the corresponding changes in the nodes that are in the path from leaves to root (leaves represent a single element).
  • In each step, the data of two children nodes are used to form an internal parent node.
  • Each internal node will represent a union of its children’s intervals.
  • Merging may be different for different questions. For example in case of finding minimum each internal node will contain the minimum of its child values, whereas in case of finding sum for given range each internal node will contain the sum of the values of its children.
  • So, recursion will end up at the root node which will represent the whole array.

The two operations supported by segment tree are :

Range Query(startIndex, endIndex) : Here we sum the value of  all the nodes that only cover leaf(values) within given range.

Update(index, newVal) : Here we update value of all the nodes that have given index in its range(Note that each node in segment tree is responsible for a range of values).

Lazy Update in Segment trees :

Another concept that is closely associated with segment tree is lazy update which is a valuable addition to range update. If there are updates on a range of array indexes, for example if value x is to be added to all values at indexes from i to j in array, then the update function  has to be called for every index from i to j.

The key idea is to postpone some update operations until they are actually needed. Because we recursively perform update operations till we reach the leaf node, we can stop at a node when all the nodes in the sub-tree rooted at this node needs to be updated. This case arises when range of the node is completely within the desired update range.

In lazy update we create a lazy array lazy[] which has same size as that of the segment tree size, and its values are initialised to 0 which indicates no pending updates. If the value of lazy[x] > 0, it indicates that there is a pending update on x th index of segment tree and that update value is equal to lazy[x].

If the range of updation is [i, j] such that 0 < < < N – 1.

  • If current segment tree node has any pending update, that is if lazy[]  has some  values > 0,  then first add that pending update to current node.
  • If current node’s range lies completely in update query range
    • Update the current node.
    • Postpone updates to children by setting  lazy value for children nodes in the lazy   array.
  • If current node’s range overlaps with update range, follow the same approach as that of simple update.
    • Recur for left and right children.
    • Update current node using results of left and right calls.

 

EOBP !

Advertisements

Tries !

Try to be a rainbow in someone’s cloud” – Maya Angelou

This post is a part of IIT-M CSE “Concept-Sunday” series .


src : https://stackoverflow.com/questions/5434813/longest-prefix-matches-for-urls; indianagrace.org

Well we are back to data-structures and we will be discussing about one of the most simple and useful data structure Trie.  Primarily Trie is used for searching strings and is a efficient information retrieval data structure; however despite of  its usefulness it has failed to find its place in standard library(C++ at least).

Trie – Definition :

Wikipedia says, a trie, also called a digital tree (radix tree or prefix tree) is a kind of search tree — an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Simply said trie is a data structure that stores strings by decomposing them into characters. The characters form a path through a tree until the leaf node of the tree which represents a string uniquely.

Trie – Why:

As a direct application, consider a basic spell checker; for each word in text, spell checker goes through its dictionary to find for existence of corresponding word and flags an error if the word is not found. Tries are very efficient for such existence queries on strings and were designed for this very purpose.

Now if the only motive of trie  is to efficiently store strings why cant we simply go for standard binary trees or balanced binary trees. To answer this question lets look at the complexity of search in both the scenarios. If maximum length of the string is W and N is the total number of strings then a height balanced tree would take O (W log(N)) to search for string. Whereas a trie achieves the same in at a complexity of O ( W ) but having said that this speedup comes with a cost – cost in terms of storage(each node in trie consumes considerable space as compared to normal tree node).

From an application like spell checker we obviously expect speed, so the natural choice is trie as we choose time over time-space trade off.

Trie – Search :

From implementation perspective trie consists of nodes and edges. Each node has array(we can use linked list representation instead of array representation to save on space but that will add to search time) of 26 pointers which point to the child nodes so each node can have a maximum of 26 children. These 26 pointers correspond to 26 letters of the alphabet and can be directly indexed for each child character. Therefore a search through a trie is a matter of moving along the tree(starting from root) based on the sequence of characters in the search string. In the above trie to search for “IITM” we start from root move down to find first ‘I’ then another level down to find another ‘I’ and the search completes at forth level.

Similarly during insertion we move down the root based on the characters of string to be inserted and reach a leaf node/non-leaf node where extra nodes needs to be added to compensate for missing characters. For example to add “IITB” we trace down to reach ‘T’ and then add a child node for ‘B’; now ‘T’ has two child nodes for ‘M’ and ‘B’ .

Trie – Applications :

As stated earlier a Trie can be used to search in a dictionary; it can also be used to find number of strings matching a given prefix. Based on these concepts we can build real world applications like spell checker; it can also be used in auto-complete feature commonly found in messaging and mailing applications. Tries have also been studied extensively in ip address lookup algorithms(at router), with its variations like binary trie, path compressed trie, multi-bit trie, etc.

google

gmail-2.png

 

spellcheck

NOTE: Trie can be used in above applications.

EOBP!

Probability Distributions !

In the end we only regret chances we didn’t take” – Unknown

This post is a part of IIT-M CSE “Concept-Sunday” series .

src : http://www.clipartpanda.com/cards-clipart; https://www.analyticsvidhya.com/probability

From past couple of weeks we have been focusing on data-structures and specifically trees. This week we are going to move away from our tradition and discuss about some common probability distributions which is perhaps bread and butter of any data scientist. It also forms the basis for machine learning and deep learning which is gaining lot of traction these days. This post will serve as a short refresher for binomial, uniform, poisson and normal distribution. The post assumes that you are well versed with basics of probability.

Probability Distribution:

Probability distribution and frequency distribution are fundamental to statistics and two faces of the same coin. Its worthwhile to describe the former in comparison with the latter. While frequency distribution dictates how many times an event has occurred; probability distribution says how many times an event should have occurred. So the chance of occurrence of a particular event in a given setup is described by the probability distribution.

There are different distributions with each having its own use, we will specifically focus on binomial, uniform, poisson and normal distribution.

Binomial Distribution (p):

The binomial distribution describes the distribution of outcomes from a series of trials where each trial meets certain conditions.

  • The experiment consists of n repeated trials.
  • The outcome of each trial may be classified as either a success or a failure(hence the name Binomial).
  • Each trial is independent of other trials.
  • The outcomes are mutually exclusive, meaning there can never be a mix of outcomes. Only one outcome can occur at a time. For example when we flip a coin , the we can either obtain a head or a tail(assuming the coin never stands :P)
  • The probability of success in a trial remains constant in all the trials. For example while tossing a coin if the probability of obtaining a head is 0.5, then for the next trial this probability remains the same.

The number of success x in n trials of a binomial experiment is called the binomial random variable. We need to find the probability of obtaining x successes in n trials given the probability of success in one trial.

The binomial probability distribution is given as,

bino_pdf

where ,

  • p  = the probability of a success.
  • x  = the number of success, may range from 0, 1, 2 …. n
  • (1 – p)  = probability of failure(q)

Here is an example of a binomial distribution of obtaining x heads in 20 trials given that the probability of obtaining a head is p; x can take values [0,20]. We can see that with p=0.5 the chance of getting 10 heads has highest probability where as with p=0.75 there are high chances of getting 15 heads.

Uniform Distribution (a, b) :

A uniform distribution, also called a rectangular distribution, is a probability distribution that has constant probability.

The discrete uniform distribution is a symmetric probability distribution whereby a finite number of values are equally likely to be observed; every one of n values has equal probability 1/n. Another way of saying “discrete uniform distribution” would be “a known, finite number of outcomes equally likely to happen”.

Rolling a single die is one example of a discrete uniform distribution; a die roll has six possible outcomes: 1,2,3,4,5, or 6. There is a 1/6 probability for each number being rolled.

Let X represent a random variable taking on the possible values of {0-9}, and each possible value has equal probability. This is again a discrete uniform distribution and the probability for each of the 10 possible value is P(X = x ) =  1/ n = 1/10 = 0.1

If we use a uniform random number generator(rand function – matlab) to generate 20000 numbers between [1, 10], we can see that the frequency of occurrence of each number 1 to 10 is almost the same which is approximately equal to 20000/10 = 2000.

uniform-dist

On similar lines continuous uniform distribution is defined and its pdf is given by,

uniform_pdf

Poisson distribution (λ):

Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. In short it is used to model the number of events occurring within a given time interval.

Suppose we track the number of emails we receive each day and see that on an average we receive 10 mails per day. Also if either of these emails do not affect the arrival of other emails i.e they are independent of each other, then we can arrive at a reasonable assumption that number of emails received per day obeys Poisson distribution. Other examples that may follow a Poisson: the number of phone calls received by a call center per hour or the number of decay events per second from a radioactive source.

The formula for the Poisson probability mass function is

poisson_pdf

λ is the shape parameter which indicates the average number of events in the given time interval.

For example if mean number of calls to a fire station on a weekday is 8, what is the probability that on a given weekday there would be 11 calls?  applying x = 11 and = 8 in the above formula, we get probability = 0.072.

poisson_5_10_15

Poisson distribution for λ = 5, 10 and 15 – curves are right skewed

Normal Distribution (µ, σ):

The Normal Distribution is the most important probability distribution in statistics and probability theory. It is also known as Gaussian Distribution(named in honor of the great German mathematician Carl Friedrich Gauss) and Bell Curve (as the shape of the normal curve is bell shaped).

The basis for the use of normal distribution to approximate any distribution in an application is the central limit theorem which states that the sum of a large number of identically distributed random variables, each with finite mean and variance, is normally distributed.

Normal probability density function :

gaussian_pdf

Parameter µ equals the mean or average value of the population, and σ (standard deviation) tells us how much the numerical data values of the population are spread out. As σ increases, the more spread out the population data values are. From a quality control point of view, a primary objective of a manufacturing process is to keep the σs of the parts dimensions as small as possible.

The plot below shows normal distribution for different values of mean and variance.

normal-dist

EOBP.

Definitions are directly taken from http://www.wikipedia.org/ .

Red Black Trees !

A journey of thousand miles begins with a single step” – Unknown

This post is a part of IIT-M CSE “Concept-Sunday” series .


src : http://wallpaperswide.com/lone_red_tree-wallpapers.html ; http://www.blacktreepublishing.co.uk

Well, color makes a difference – no I am not at all being racist; you ask a data structure guy and he will surely convince you that color does matter. From back to white to red – some special data structures internally use colors to organise(or distinguish) the data. Red Black trees belong to this special class, but having said that I still wonder why did they name it red-black trees and not white-black trees. Some say inventors made use of red and black pens extensively whereas others claim that they liked the red color prints coming out their printers.

We don’t know what is right and what is wrong; the truth is last week we talked about AVL trees and this week we are are back with a similar setup! Today we will discuss about Red Black trees which have been gained popularity for right reasons 🙂

Red Black trees are yet another balanced Binary search tree, with different balancing strategy that fares well in some areas as compared to AVL trees. From now on wards we will refer to Red Black trees as RB trees wherever necessary. RB tree does better at insertion and deletion while AVL tree is  better at searching. So if your application needs faster insertion and deletion RB trees fit in well whereas if your application is query intensive you are better off with our good old AVL trees. AVL trees have somewhat strict balancing strategy that makes insertion and deletion a costly affair but as a result height of the tree is better balanced, which makes search/query faster. However RB trees with their weaker balancing strategy complete insertion and deletion in one single pass(log n) through the tree.

Back with AVL trees we used the balance factor to check balance; with RB trees, we use the colour of the nodes and certain properties to check and maintain the balance of the tree.

Red Black Tree is characterised by following properties :

  1. Every node is either red or black
  2. Root and leaves (nil’s ) are black
  3. There can never be  2 consecutive red nodes  (alternatively mentioned as , a red node always has a black parent. )
  4. All simple paths from a node ‘x’ to a descendent leaf have the same number of black nodes in them.

Insertion in a Red Black Tree

Since the structure of RB tree is same as that of the binary search tree, searching for the location to insert is same as BST / AVL. The goal however here, is to maintain the Red Black Tree properties  .

If the tree is empty simply create a single black node and mark it as root node.

And if the tree is non-empty

  • Search for the location, and insert the node
  • colour the node- Red
  • Now restore red-black tree properties (if required )

Searching and inserting the node follows the same procedure as BST; there is no rocket science in colouring node red either 😉 So the only thing that’s important here is algorithm involved in restoring the red-black tree properties.  The only property that may be violated is property 2 and property 3. It may happen that the newly added node is the only node making a red node as root node(we have already talked about this case earlier) or it may happen that they added node is child of a red node violating property 3.

The idea behind colouring the newly inserted node with Red , is that only property 3(property 2 violation is trivial) can be violated which can then be solved by either

  • Recolouring , or by
  • Rotation (as in AVL trees )

We move the violation in the RB Tree upwards and solve it using the above mentioned options. Next we dissect the restoration of red-black tree into several case and consider each case in detail.

Suppose, 

  • x is the newly added node 
  • p refers to the parent node
  • u refers to the uncle node
  • g refers to the grandparent node.

Case 1 : If the color of x’s parent is black :

In this case we don’t have to worry, because the tree still satisfies the properties 🙂 

Case 2: If the color of x’s parent is red : (Note that we have now violated rule 3).

   Case 2.A :If the color of x’s uncle is red :

   Note : the grandparent will be black in this case

   We perform following steps here:

  •  Recolour x’s parent and uncle with Black.
  •  Recolour the grandparent to Red.  (This will violate Rule 3 for the grandparent )
  •  Now consider the grandparent to be the newly inserted node , and repeat Case 2.
case1

If both parent and uncle are red- color parent and uncle as black and grand parent as  red.

Case 2.B: If the color of x’s uncle is black

(NOTE : In this case  the grandparent  has 2 children, one of them is red who is the parent of the newly inserted node , and the other is uncle whose color is black. We now know that the grandparent node has paths to leaves of the tree with unequal black nodes in them . Hence in order to preserve the properties of RB Tree, we need some operations)

We consider four sub cases here and rotation cum recolouring  diagram for each case will ensure that the RB Tree properties are satisfied.

1. LL Case : p is to the left of g, and x is to the left of p

ll_case

LL Case : Perform right rotation followed by color exchange between g and p

2. LR Case : p  is to the left of g , and x is to the right of p

lr_case
LR Case : Perform left rotation, the situation now becomes same as the LL case

3. RR Case : p is to the right of g, and x is to the right of p

rr_case.PNG

RR Case : Perform left rotation followed by color exchange between g and p

4. RL Case : p is to the right of g, and x is to the left of p

rl_case

RL Case : Perform right rotation, the situation now becomes same as the RR case

Feel free to check out these pages for animations of Red Black Tree operations

Content credit : Ganesh K .