Problem E
C'mon K-mers!
A useful operation on strings, specially when analyzing certain types of data, like genomic data, is to find all the substrings of length $k$ in a string. These substrings are called the string’s $k$-mers and we are typically interested in finding out how many times each $k$-mer appears within the string. For example, consider the following string:
aaabaab
This string has three unique $2$-mers:
aaabaab aa aa ab ba aa ab
More specifically, the $2$-mers are aa (appearing three times), ab (appearing two times), and ba (appearing once). Notice how counting the $k$-mers considers overlapping occurrences of each substring. For example, the first two occurrences of aa above overlap with each other.
We could do a similar analysis for the string’s $3$-mers:
aaabaab aaa aab aba baa aab
In this case, there are four unique $3$-mers: aaa (appearing once), aab (appearing twice), aba (appearing once), and baa (appearing once).
In this problem, you will write a program that, given a potentially large string and some $k$-mers, will determine how many times each $k$-mer appears in the string. Unlike other problems on this exam, your solution must not only be correct but also efficient (we provide specific bounds below).
Input
The input contains a string and the $k$-mers we want to count in that string. This string and the $k$-mers will contain only lowercase letters (a through z). Because the string can be very long, it may be broken up into multiple lines. In particular, if the string has $40$ or less characters, it will appear in a single line. If the string has $41$ or more characters, it will be broken up into multiple lines, with every line (except possibly the last one) having $40$ characters.
The first line of input contains four positive integers: $L$, $N$, $k$, $Q$ (each separated by a single space). This first line is followed by $L$ lines containing the input string, which will have $N$ characters in total. As noted above, if $N\leqslant 40$, the input string will appear in a single line (and $L$ will be 1). If $N>40$, the input string will be broken up into multiple lines.
The $L$ lines specifying the string are followed by $Q$ lines containing the $k$-mers we want to count in the string. Note that an input file will use the same value of $k$ for all the $k$-mers, with that value specified in the first line. So, if the first line specifies $k=3$, then the $Q$ lines will all contain $3$-mers. Note that the $k$-mers are not guaranteed to be unique; the same $k$-mer may appear more than once in these $Q$ lines.
You may assume the following bounds:
\[ \begin{array}{l l l} 1 < & L& \leqslant 3,750\\ 1 < & N& \leqslant 150,000\\ 1 < & k& \leqslant 100\\ 1 < & Q& \leqslant 150,000\\ \end{array} \]Output
For each $k$-mer specified in the input, print the $k$-mer and the number of times it appears in the string (separated by a single space). The $k$-mers must appear in the same order they appear in the input.
Complexity
Your solution must be able to produce the expected output by making at most one pass through the string (not counting the initial loading of the string from the input). More specifically, your solution should run in $O(k\cdot N + k\cdot Q)$. Solutions that run in $O(k\cdot N \cdot Q)$ will result in a Time Limit Exceeded when run on our secret test cases. Think carefully about what data structure(s) and algorithm(s) to use in your solution.
Sample Input 1 | Sample Output 1 |
---|---|
1 7 2 3 aaabaab ab ba aa |
ab 2 ba 1 aa 3 |
Sample Input 2 | Sample Output 2 |
---|---|
1 7 3 5 aaabaab aaa aab aaa baa xyz |
aaa 1 aab 2 aaa 1 baa 1 xyz 0 |
Sample Input 3 | Sample Output 3 |
---|---|
5 170 3 4 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaa aab aba baa aaa |
aab 1 aba 1 baa 1 aaa 165 |