Hide

# Problem EC'mon K-mers!

A useful operation on strings, specially when analyzing certain types of data, like genomic data, is to find all the substrings of length $k$ in a string. These substrings are called the string’s $k$-mers and we are typically interested in finding out how many times each $k$-mer appears within the string. For example, consider the following string:

aaabaab


This string has three unique $2$-mers:

aaabaab

aa
aa
ab
ba
aa
ab


More specifically, the $2$-mers are aa (appearing three times), ab (appearing two times), and ba (appearing once). Notice how counting the $k$-mers considers overlapping occurrences of each substring. For example, the first two occurrences of aa above overlap with each other.

We could do a similar analysis for the string’s $3$-mers:

aaabaab

aaa
aab
aba
baa
aab


In this case, there are four unique $3$-mers: aaa (appearing once), aab (appearing twice), aba (appearing once), and baa (appearing once).

In this problem, you will write a program that, given a potentially large string and some $k$-mers, will determine how many times each $k$-mer appears in the string. Unlike other problems on this exam, your solution must not only be correct but also efficient (we provide specific bounds below).

## Input

The input contains a string and the $k$-mers we want to count in that string. This string and the $k$-mers will contain only lowercase letters (a through z). Because the string can be very long, it may be broken up into multiple lines. In particular, if the string has $40$ or less characters, it will appear in a single line. If the string has $41$ or more characters, it will be broken up into multiple lines, with every line (except possibly the last one) having $40$ characters.

The first line of input contains four positive integers: $L$, $N$, $k$, $Q$ (each separated by a single space). This first line is followed by $L$ lines containing the input string, which will have $N$ characters in total. As noted above, if $N\leqslant 40$, the input string will appear in a single line (and $L$ will be 1). If $N>40$, the input string will be broken up into multiple lines.

The $L$ lines specifying the string are followed by $Q$ lines containing the $k$-mers we want to count in the string. Note that an input file will use the same value of $k$ for all the $k$-mers, with that value specified in the first line. So, if the first line specifies $k=3$, then the $Q$ lines will all contain $3$-mers. Note that the $k$-mers are not guaranteed to be unique; the same $k$-mer may appear more than once in these $Q$ lines.

You may assume the following bounds:

$\begin{array}{l l l} 1 < & L& \leqslant 3,750\\ 1 < & N& \leqslant 150,000\\ 1 < & k& \leqslant 100\\ 1 < & Q& \leqslant 150,000\\ \end{array}$

## Output

For each $k$-mer specified in the input, print the $k$-mer and the number of times it appears in the string (separated by a single space). The $k$-mers must appear in the same order they appear in the input.

## Complexity

Your solution must be able to produce the expected output by making at most one pass through the string (not counting the initial loading of the string from the input). More specifically, your solution should run in $O(k\cdot N + k\cdot Q)$. Solutions that run in $O(k\cdot N \cdot Q)$ will result in a Time Limit Exceeded when run on our secret test cases. Think carefully about what data structure(s) and algorithm(s) to use in your solution.

Sample Input 1 Sample Output 1
1 7 2 3
aaabaab
ab
ba
aa

ab 2
ba 1
aa 3

Sample Input 2 Sample Output 2
1 7 3 5
aaabaab
aaa
aab
aaa
baa
xyz

aaa 1
aab 2
aaa 1
baa 1
xyz 0

Sample Input 3 Sample Output 3
5 170 3 4
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaa
aab
aba
baa
aaa

aab 1
aba 1
baa 1
aaa 165

Hide