ECE358 Exercise - Finding Maximal Matches in Genome Sequences
Bonus: Finding Maximal Matches in O(n) Time (Done)
Bonus: Edit Distance Aligner in O(n) Space (Done)
Description:
To find maximal matches on the two strings A and B of size >= k, we first hash the string A by associating each consecutive sub-strings of length k with the position of the sub-string (denoted by beginning and end points).
Using this hash, we use consecutive sub-strings of length k in B as key to determine if matching sub-strings exist and where these sub-strings are in A. The position of each of the matching sub-string in A and the position of the sub-string in B are used together to create match objects. The end point of B is used as a key to store the list of match objects in a second hash. (ie the hash returns the list of all matches that have B ending in position n, where n is the key passed in.)
These matches are extended using the following rules:
If there are matches in B that ends in n, look at the list of matches in B that ends in n-1.
If there is no match, just move on.
After this is done with all consecutive sub-strings of length k in B, the remaining entries in the second hash are the Maximal Matches.
Implementation:
The hash table we used were provided by C++’s unordered map. A match class was defined with the fields a1, a2, b1, b2. Where a1, a2 are the beginning and end points of sub-string in A, and b1, b2 are the beginning and end points of sub-string in B.
Complexity:
O(m+n-k) Time: We assume that k is big enough so there is on average of one match entry per key in the first hash table. This assumption also implies that there will be only one entry per key in the second hash table. (There is only one match to start with. Any combining will result in both a deletion and insertion of an entry. So the total is never more than one). Thus checking and extending the matches is constant time (only need to access the hash entries). This constant work need to be done a number of time directly proportional to the size of (n or m) - k. Where m and n are the number of characters in the string A and B.
O(m+n-k) Space: The first table need to store m-k entries and the second table need to store n-k entries.
Description:
Using the list of maximal matches, we first sort them by their lengths. (defined by a2 - a1 + 1 of the match object). By greedily taking the longest consistent matches first, we create a list of anchors. The list of anchors are stored in a self balancing binary search tree using a1 as the value of the node (AVL implementation provided by professor Francois Pitt as lecture material in CSC190 during 2011 winter semester)
Implementation:
The list of maximal matches were sorted by their length using C++’s sort function for lists. A number of sub-functions were made to check the consistency of any two interval by checking if they overlap and checking if they crisscross. The list of maximal matches were then examined one by one, and compared with the existing anchors using the following algorithm.
After all the anchors have been chosen, convert the avl tree to a list.
Complexity:
Description and Implementation:
Using the O(n) space edit distance algorithm discussed in lecture. Using the recursion relationships:
m[i%2][j] = min( if match (m[(i-1)%2][j-1]) else m[(i-1)%2][j-1]+1, m[(i-1)%2][j]+1, m[i%2][j-1]+1 ) for forward calculations
and m[i%2][j] = min( if match (m[(i+1)%2][j+1]) else m[(i+1)%2][j+1]+1, m[(i+1)%2][j]+1, m[i%2][j+1]+1 ) for backward calculations
The alignment algorithm is called recursively, with the base cases 1) either string being empty or 2) one character aligned with multiple characters.
If neither of the base cases applies:
Complexity:
Description and Implementation:
After reading in the strings from the .mfa files, call the function from part A to return a list of maximal matches. Calling the part B function on this list of maximal matches, we obtain a list of anchors. This list of anchors are then sorted by their starting position of one of the strings. The anchors are then looped through, using the function from part C on the sub-strings between the anchors to find the edit distance and the edited string. The edit distances are then added together and the edited strings were combined with the anchors.
Complexity: