Hide

Problem B
Where Are My Genes?

Genes are the region of DNA sequences which code for protein molecules. Proteins are the workhorses of life, responsible for all structure and activity in a cell. However, almost 99% of DNA in a genome is often referred to as “junk DNA” and does not code for proteins. An important problem in computational biology is to help identify the 1% of a gene that may code for a protein.

As you may know, DNA is composed of four bases: cytosine (C), guanine (G), adenine (A), and thymine (T). One way to characterize genes is by comparing the ratio of their C-G content to the background C-G content of the entire genome. A higher C-G ratio may indicate a gene-rich section of the genome that is of more interest to study.

The C-G ratio is typically expressed as a percentage value using this formula:

\[ \frac{C+G}{C+G+A+T} \cdot 100 \]

Where $C$, $G$, $A$, $T$ are the number of times each base appears in a sequence. For example, sequence GATTACA contains one C base, one G base, three A bases, and two T bases. Its C-G ratio would be $\frac{2}{7}\cdot 100$, or $28.5714286\dotsm $.

In computational biology, gene sequences are typically stored in the FASTA format, a file format used to exchange information between genetic sequence databases. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (‘>’) symbol in the first column.

For example, a DNA sequence in FASTA format would look like this:

>gi|568815587: Homo sapiens chromosome 11, GRCh38 Primary Assembly
ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA
GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC

The word following the > symbol is the identifier of the sequence, and the rest of the line is the description. All lines of DNA sequence contain only the characters C, G, A, and T, representing the four chemical bases of the genetic code. Sequence lines are no longer than $80$ characters in length. No blank lines are allowed in the middle of the FASTA input.

In this problem, you will take a gene sequence specified using the FASTA format, and will compute its C-G ratio. For simplicity, we will not use the exact FASTA format, but something very similar to it. More specifically, we will forgo the line descriptor and replace it with an integer representing the number of lines of sequences we will be analyzing. So, the FASTA example shown above would actually look like this:

2
ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA
GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC

Input

The input starts with a line containing a single integer $N$ ($1\leqslant N \leqslant 100$), followed by $N$ lines. All these lines will have the same length, between $1$ and $80$ characters, and will contain only the characters C, G, A, and T.

Output

The output is a single floating point number: the C-G ratio according to the formula shown above.

Your output does not have to match our output character by character; it will be enough for the value you print to be accurate to within an absolute or relative error of $10^{-3}$. For example, in the third sample output, if your program prints out 28.57143, this would also be an acceptable answer because the difference with the expected answer is less than $10^{-3}$. To accomplish this, we recommend that you do not format or round the floating point number: simply print it out with a few extra decimal places to ensure your answer is as close as possible to the accepted answer. This can be accomplished in most programming languages by simply printing the floating point number without specifying any particular formatting.

Sample Input 1 Sample Output 1
4
GGGATA
CCCATA
AAAAAA
TTTTTT
25.0
Sample Input 2 Sample Output 2
2
ATTTA
ATATT
0.0
Sample Input 3 Sample Output 3
1
GATTACA
28.5714286
Sample Input 4 Sample Output 4
2
ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA
GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC
53.57142857142857

Please log in to submit a solution to this problem

Log in