During my childhood I was highly passionate about biology, and especially genetics. I liked to solve inheritance genetics problems and understand mechanisms of molecular biology. Later I started to learn programming and with time I became a Data Scientist, partially forgetting about my childhood passion. During my Master I re-found my passion in biology, understanding that I can combine my passion to biology and my experience in Data Science by pursuing the Bioinformatics field. The National Human Genome Research Institute defines the field as following: Bioinformatics, as related to genetics and genomics, is a scientific sub-discipline that involves using computer technology to collect, analyze and disseminate biological data and information, such as DNA and amino acid sequences or annotations about those sequences. Bioinformatics, as related to genetics and genomics, is a scientific sub-discipline that involves using computer technology to collect, analyze and disseminate biological data and information, such as DNA and amino acid sequences or annotations about those sequences. Bioinformatics, as related to genetics and genomics, is a scientific sub-discipline that involves using computer technology to collect, analyze and disseminate biological data and information, such as DNA and amino acid sequences or annotations about those sequences. National Human Genome Research Institute. National Human Genome Research Institute. National Human Genome Research Institute. From 1972, after sequencing of the whole genome of the bacteriophage MS2, by Walter Fiers’ Laboratory, the amount of the biological data, especially sequences of DNA, RNA and proteins, has grown exponentially. More technologies, methods and algorithms were developed to analyze this data. After the 2020’s epidemic of CoVID-19 the biotechnologies has became a critical and main-stream field. Before the 2000’s the majority of biologists and geneticists were doing the research in vivo or in vitro, because they weren’t equipped with informational technology and knowledge to apply computer science algorithms in their work. in vivo in vitro Starting with the “Atlas of Protein Sequence and Structure” create by Margaret Dayoff’s team, continuing with the invention of BLAST in 1990 the bioinformatics started to develop as a separate sub-field allowing the use of Computer Science algorithm in analyzing biologic data. “Atlas of Protein Sequence and Structure” BLAST During my exploration of this field, I observed that many packages are implementing mostly the basic things, letting the implementation of algorithms to the user, which very often is pretty tedious, or are implemented as full tools or web applications that are either hard to run or not flexible enough. That’s why I decided to create a package in Python implementing algorithms and offering the user the capacity to use the algorithm in Python and focus on experimentation. This series of articles will show different application of drosopyla in solving different use-cases from the Bioinformatics field, together with explanation of different biological details related to this use-cases. To be able to use this package firstly let’s install it by running the following command: drosopyla !pip install drosopyla !pip install drosopyla The code of life. In the Hershey-Chase experiment in 1952, it was demonstrated that the DNA (Deoxyribonucleic Acid) is the carrier of the hereditary information. In the following year two scientists published the most influential paper in the history biology - “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nuclei Acid”. The main two authors - James Watson and Francis Crick offered the structure of the DNA and inferred some of their properties from it. This sub-chapter will explore the structure of DNA and some of their properties and products on which is based this library. DNA (Deoxyribonucleic Acid) “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nuclei Acid” The DNA is a two strand macro-molecule of alter nucleotide bases on phosphor-carbohydrate skeleton. The skeleton is made from alternating phosphate and sugar groups. In the case of DNA the sugar is 2-deoxyribose, a pentose (five-carbon carbohydrate). Figure 1 shows the chemical structure of the double strand of DNA. The phosphates are bound to 2-deoxyribose’s 5th or 3rd carbons atom, known as 3’-end (ending with a hydroxyl group) and 5’-end (ending with a phosphate group). This definition will be latter important to differentiate between the strands. These strands are anti-parallel, meaning that the direction of nucleotides in one strand is opposite to their direction in the other strand. To both backbones of molecule are bonded the nucleotide bases namely: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T). The Adenine and Guanine are purines - five or six-membered heterocyclic compounds while Cytosine and Thymine are pyrimidines, six-membered heterocyclic compounds. The comparative structure of the purine, pyrimidine and the nucleotides bases is presented in Figure 2: The structure of the nucleotide bases is allowing them to form so called Hydrogen bonds - a special type of dipole-dipole attraction between molecules, formed from the attractive force between a hydrogen atom covalently bonded to a very electronegative atom such as N, O or F and another very electronegative atom. Because of the molecular structure of the nucleotides Cytosine is forming three hydrogen bonds to the Guanine and Adenine is forming with Thymine two hydrogen bonds. This provides two important properties: By counting the frequency of the Cytosine and Guanine in a DNA sequence can be inferred its stability because this pair has three hydrogen bonds. Because hydrogen bonds are weak bonds they an be easily destroyed to separate the strands for DNA replication of RNA synthesis. By counting the frequency of the Cytosine and Guanine in a DNA sequence can be inferred its stability because this pair has three hydrogen bonds. Because hydrogen bonds are weak bonds they an be easily destroyed to separate the strands for DNA replication of RNA synthesis. The Hydrogen bonds formed between the nucleotide bases is shown in Figure 3: DNA keeps the hereditary information by encoding the templates for protein synthesis. However the DNA is kept in nucleus while most of the proteins are synthesized on ribosomes. The information from the DNA should somehow exit the nucleus and interpreted in the ribosomes. ribosomes RNA To achieve this the DNA is converted to a new type of nucleic acid - Ribonucleic Acid. This type of macro molecule is synthesized from DNA by an enzyme called RNA polymerase, which is synthesizing the RNA in step-wise manner in the 5’ → 3’ direction. From biochemical point of view this process happens by repetitive reactions of attaching nucleoside triphosphate to (NTPs) by following the DNA template and is called transcription. The synthesis happens by creation of a transcription bubble, because RNA polymerase is moving along the DNA, separates it for a short segment (~14 bp) forming a temporarily DNA-RNA hybrid. This processes is shown in the Figure 4: Ribonucleic Acid RNA polymerase RNA molecule has several differences from the DNA. In the first place it is single strand molecule compared to DNA. Also in the backbone instead of a deoxyribose it contains as ribose carbohydrate molecule. A short region of a RNA is showed in the Figure 5: However, the most important difference from the DNA molecule are the nucleotide bases. Adenine, Cytosine and Guanine are remaining in the RNA molecule, while the Thymine is changed into Uracil, another nucleotide base specific to RNA molecules. The difference between the Thymine and Uracil is in one methyl group and is presented in the Figure 6: Uracil So, technically from the information point of view the synthesis of RNA can be simply represented in the following way: 5’ - ACTTGCTA - 3’ → 5’ - ACUUGCUA - 3’ Types of RNA. The biological systems during the millennia of evolution developed several types of RNA, some of them are listed below: Messenger RNA (mRNA) - responsible for encoding protein sequence and used in it’s synthesis. Transfer RNA (tRNA) - RNA molecule responsible of providing amino acids during the protein synthesis. Ribosomal RNA (rRNA) - RNA molecule caries the synthesis of proteins. Messenger RNA (mRNA) - responsible for encoding protein sequence and used in it’s synthesis. Transfer RNA (tRNA) - RNA molecule responsible of providing amino acids during the protein synthesis. Ribosomal RNA (rRNA) - RNA molecule caries the synthesis of proteins. There are also more types of RNA, but for this article are not important. The types of RNA listed above are involved in the synthesis of proteins. First the mRNA after leaving the nucleus is getting attached to ribosomes where the process of protein synthesis is happening. Ribosomes are organelles made out of proteins and rRNA, which are floating freely in the cell’s cytoplasm or are getting attached on Endoplasmatic Reticulum. During this process the mRMA is passing through the ribosome providing the template on which the peptide chains are created. The process of protein synthesis is showed in Figure 7: As shown in the Figure 7 during the process of peptide synthesis the mRNA is passing through the ribosome, to which tRNA with an amino acid is connecting and it’s amino acid is added to the chain of already existing amino acids. tRNA are reusable molecules, they adding the amino acid to the chain, then they are going to get another amino acid, then participate in the synthesis of another peptide. The bonding of an amino acid to the tRNA are carried out by the aminoacyl-tRNA syntheses (aaRSs) where aa stands for the amino acid bonded. Another important important segment of the tRNA molecules is the anti-codon region, which is responsible of binging to the mRNA following the law of complementarity. In this way the tRNAs molecules are the translators from the nucleic to the protein language. Figure 8 shows structure of tRNA encoding Phenylalanine. aminoacyl-tRNA syntheses (aaRSs) Proteins We talked about how proteins are created, but we didn’t explored what they are. Proteins are biological macro molecules which can be described as polymers of amino acids. Amino acids are organic compounds which are having a carboxyl (-COOH) and an amino (-NH3) groups. If both groups are on the same carbon atoms then they are called alpha-amino acid, on neighboring carbons - beta-amino acids and so on. The Figure 9 shows the basic structure of amino acids. Because of the presence both the carboxylic and amino groups, amino acids can create chains by linking together through the peptide bonds. These amino acids polymers are called peptides. The Figure 8 shows the reaction of creating peptide bonds. Compared to Nucleic Acids, which are created from 4 monomers (in the majority of cases), the vast majority of the proteins are formed from 22 proteinogenic amino acids. 20 amino acids from the 22 are found in every organisms, while Selenocysteine (U) and Pyrrolysine (O) are synthesized only is special conditions. More than that there are many non-proteinogenic amino acids that are obtained from modifying proteinogenic amino acids, one such example being Hydroxyproline (Hyp) being part of collagen, and which is synthesized from Proline (P). During this article by default while talking about amino acids and peptides, will be taken into account only the main 20 amino acids. All amino acids are encoded by RNA codons, 3-mers of nucleotide bases. The codes for encoding 20 proteinogenic amino acids is presented in Figure 11: As shown in the Figure 11 same amino acid can be encoded by multiple codons, for example Leucine (L) being encoded by 6 different combinations. This property is called Redundancy of the genetic code. Without this property every Single Nucleotide Permutation (SNP) would affect the organisms in a critical way. Redundancy of the genetic code Also some codons seems to not encode any amino acids, so called stop codons. In the majority of cases they are triggering the end of peptide synthesis, with the exception of Selenocysteine which is encoded by UGA in the presence of specific factors, and Pyrrolysine is encoded by UAG, also in the presence if specific factors, which are out of scope of this article. Additional Methionine very often is the first amino acid in the peptide, that why it’s encoding codon is named as the starting codon. Selenocysteine UGA Pyrrolysine UAG Methionine Usually proteins have 4 levels for structure: primary, secondary, tertiary and quarterly. Primary level is represented by the sequence of amino acids. The secondary structures are represented by highly regular 3-d structures created by intermolecular interactions - alpha-helixes and beta-sheets. Tertiary structure are refering to the 3-dimensional structure created by a single protein molecule (a single polypeptide chain). Finally Quaternary structure is the three-dimensional structure consisting of the aggregation of two or more individual polypeptide chains (subunits) that operate as a single functional unit (multimer). polypeptide chain multimer The Python Implementation. After this long lecture of biology we are ready to move to the package modeling of all biological sequences. The main biological sequences implementation are located in the sequences module. All sequences following the same interface, having similar logic for iterable data structure access. To show this functionalities it will be used the DNA Sequence as an example, however they are share between all biological sequences: Common functionalities: Creation of the biological sequences data types: Creation of the biological sequences data types: Before using any data structure from the package they should be imported and created. Below is showed the process of creation of a DNA Sequence. # Importing the DNA data type. from drosopyla.sequences import DNASequence # Creating the dna object. dna = DNASequence("ACGTCGATGCTAATGCAG") # Printing out the dna object. print(dna) # Importing the DNA data type. from drosopyla.sequences import DNASequence # Creating the dna object. dna = DNASequence("ACGTCGATGCTAATGCAG") # Printing out the dna object. print(dna) The output of the listing is showed below: <DNA: ACGTCGATGCTAATGCAG> <DNA: ACGTCGATGCTAATGCAG> All data types in the library when printed are also showing the biological sequence. Getting the string version of biological sequence. Getting the string version of biological sequence. All biological sequences can be converted into strings by calling the string() function as shown below: # Getting the string version of the sequence. dna_string = dna.string() print(dna_string) # Getting the string version of the sequence. dna_string = dna.string() print(dna_string) ACGTCGATGCTAATGCAG ACGTCGATGCTAATGCAG Indexation of the biological sequences data types. Indexation of the biological sequences data types. Being sequences, all biological data types can be indexed and their values can be accessed by indexing: # Getting the second element of the dna. print(dna[1]) # Getting the last 10 nucletides of the sequence. print(dna[:-10]) # Getting every second nucleotide of the sequence. print(dna[::2]) # Getting the second element of the dna. print(dna[1]) # Getting the last 10 nucletides of the sequence. print(dna[:-10]) # Getting every second nucleotide of the sequence. print(dna[::2]) <DNA: C> <DNA: ACGTCGAT> <DNA: AGCAGTAGA> <DNA: C> <DNA: ACGTCGAT> <DNA: AGCAGTAGA> IMPORTANT: All sequences in the package cannot be changes after creation because they were created only for analysis purpose. IMPORTANT: All sequences in the package cannot be changes after creation because they were created only for analysis purpose. Getting the number of monomers in the sequence. Getting the number of monomers in the sequence. Additionally as in the case of usual sequences it is possible to get the number of monomers in the sequence by using the function len() # Getting the lenght of the sequence. len(dna) # Getting the lenght of the sequence. len(dna) 18 18 Monomer frequency. Monomer frequency. One of simplest form of analyzing biological sequences, is to count the frequency of monomers of sub-sequences in the string. To count the frequency of a sub-sequence should be used the function count_elements() which takes a string as an input. # Getting the number of Adenine nucleic bases in the DNA. print(dna.count_elements("A")) # Getting the number of times the substring "GC" is present in the DNA. print(dna.count_elements("CG")) # Getting the number of Adenine nucleic bases in the DNA. print(dna.count_elements("A")) # Getting the number of times the substring "GC" is present in the DNA. print(dna.count_elements("CG")) 5 2 5 2 Getting the location of specific sub-sequences in the biological sequence. Getting the location of specific sub-sequences in the biological sequence. To get all starting indexes of a specific sub-sequence in a sequence can be achieved using the function get_locations() by providing the sun-sequence. For example to get all locations of the “CG” pairs in the biological sequences. # Getting all locations of the "CG" in the DNA sequence. print(dna.get_locations("CG")) # Getting all locations of the "CG" in the DNA sequence. print(dna.get_locations("CG")) [1, 4] [1, 4] Getting all k-mers of the biological sequence. Getting all k-mers of the biological sequence. In bioinformatics, k-mers are sub-strings of length k contained within a biological sequence. They are useful in finding important patterns in sequences with potential biological importance. The Figure 12 shows all the k-mers obtained from a sample sequence of DNA. To obtain the list k-mers from the a biological sequences it is required to call the get_kmers() function and provide a value for k as shown below: # Getting the 6-mers of the DNA sequence. kmers = dna.get_kmers(6) print(kmers) # Getting the 6-mers of the DNA sequence. kmers = dna.get_kmers(6) print(kmers) ['ACGTCG', 'CGTCGA', 'GTCGAT', 'TCGATG', 'CGATGC', 'GATGCT', 'ATGCTA', 'TGCTAA', 'GCTAAT', 'CTAATG', 'TAATGC', 'AATGCA', 'ATGCAG'] ['ACGTCG', 'CGTCGA', 'GTCGAT', 'TCGATG', 'CGATGC', 'GATGCT', 'ATGCTA', 'TGCTAA', 'GCTAAT', 'CTAATG', 'TAATGC', 'AATGCA', 'ATGCAG'] Take into account that the k-mers are returned as a list of string and the list if non-unique. One last common function that is shared between sequences is the iterate which is allowing to iterate through a sequences with a defined window size. Take into account that the iteration step of this function is the size of the window and the yielded objects are of class of the base function. As its unique argument the functions take the parameter L which is the size of the window. The listings below are showing the code snippet and the output for running the iterate function: # Iterating through the sequence with a window size = 6 for window in dna.iterate(6): print(window) # Iterating through the sequence with a window size = 6 for window in dna.iterate(6): print(window) <DNA: ACGTCG> <DNA: ATGCTA> <DNA: ATGCAG> <DNA: ACGTCG> <DNA: ATGCTA> <DNA: ATGCAG> Specific DNA functionalities. DNASequence class inherits all the functions listed above and it also implements specific functions that are implementing the properties of DNA taking into account it’s double strand nature. The first such functionality is getting the complement of the sequence complementary property of the DNA. To find the complement of the DNA sequence can be used the function get_complement as listed below: DNASequence get_complement # Getting the complement of the DNA sequence. dna_complement = dna.get_complement() print(dna_complement) # Getting the complement of the DNA sequence. dna_complement = dna.get_complement() print(dna_complement) <DNA: TGCAGCTACGATTACGTC> <DNA: TGCAGCTACGATTACGTC> As you can see this function changes T to A, C to G and vice versa. However this functions doesn't return the second strand. The original sequence is the 5’ → 3’, to get the 3’ → 5’ strand the user should used the get_reverse_complement function. # Getting the reverse complement of the DNA Sequence. reverse_complement = dna.get_reverse_complement() print(reverse_complement) # Getting the reverse complement of the DNA Sequence. reverse_complement = dna.get_reverse_complement() print(reverse_complement) <DNA: CTGCATTAGCATCGACGT> <DNA: CTGCATTAGCATCGACGT> Take a moment to look more carefully to the output, the generated sequence is the complement of original DNA sequence, just reversed to be in the 5’ → 3’ conformation.As explained before the DNA molecule is the template for the synthesis the RNA molecules. DNASequence data structure implements a function for generating the RNA sequence from the DNA object. This function is called get_mrna and it returns an RNASequence object, about which were are going to talk in the next subchapter. This function is applied in the following manner: # Getting the RNA sequence from the DNA templete. rna = dna.get_mrna() print(rna) # Getting the RNA sequence from the DNA templete. rna = dna.get_mrna() print(rna) <RNA: ACGUCGAUGCUAAUGCAG> <RNA: ACGUCGAUGCUAAUGCAG> The transcription from DNA to RNA represents the first step of the so called Central Dogma of Life. It was first stated by Francis Crick in 1957 and then published in 1958: The Central Dogma. This states that once "information" has passed into protein it cannot get out again. In more detail, the transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein may be possible, but transfer from protein to protein, or from protein to nucleic acid is impossible. Information here means the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein. The Central Dogma. This states that once "information" has passed into protein it cannot get out again. In more detail, the transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein may be possible, but transfer from protein to protein, or from protein to nucleic acid is impossible. Information here means the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein. protein protein it cannot get out again. nucleic acid nucleic acid precise However simpler reformulation is stated as “DNA make RNA, and RNA makes protein” and is illustrated in the Figure 13: In the Figure above one the left are shown the biological terms for converting one molecule to another, while the right side shows the functions in the package implementing these biological processes. RNA functionalities. The next stage of the Central Dogma are RNA molecules. In drosopyla they are represented by the data structure RNASequence. The RNASequence objects can be created by two way, by using the constructor or by running get_mrna() function on a DNASequence object as shown before. # Importing the RNASequence from package. from drosopyla.sequences import RNASequence. # Creating the RNA object. rna = RNASequence("ACGUCGAUGCUAAUGCAG") print(rna) # Importing the RNASequence from package. from drosopyla.sequences import RNASequence. # Creating the RNA object. rna = RNASequence("ACGUCGAUGCUAAUGCAG") print(rna) <RNA: ACGUCGAUGCUAAUGCAG> <RNA: ACGUCGAUGCUAAUGCAG> As a bridge between the DNA and Proteins the RNA can be transcribed into a protein sequence. To get the protein (or peptide) sequence from the RNA the transcribe function should be used as shown in the code listing below: # Creating the protein object by transcribing the RNA sequence. protein = rna.transcribe() print(protein) # Creating the protein object by transcribing the RNA sequence. protein = rna.transcribe() print(protein) <Protein: TSMLMQ> <Protein: TSMLMQ> If the RNA object number of nucleotides is not dividable by 3, the function will raise an error, so before running this function ensure that the RNA sequence has a length dividable by 3 or trim the sequence to a required length. Also the example above uses the standard transcribing code where stop codons are represented by “|” symbol. As told before in some cases the code can be changed to integrate new amino acids. Selenocysteine (U) is encoded by UGA and Pyrrolysine (O) is encoded by UAG is specific condition. To include these amino acids it is required to provide a new code as shown bellow. # Importing the standard code. from drosopyla.utils import RNA2AMINO_ACIDS # Creating the RNA sequence object. rna = RNASequence("UGAUCGAUGCUAAUGUAG") # Transcribing the RNA using the normal genetic code. protein_1 = rna.transcribe() print(protein_1) # Updaring the RNA to Amino acid code with the two new amino acids. new_rna2aa = RNA2AMINO_ACIDS new_rna2aa.update({"UGA" : "U", "UAG" : "O"}) # Transcribing the RNA using the updated genetic code. protein_2 = rna.transcribe() print(protein_2) # Importing the standard code. from drosopyla.utils import RNA2AMINO_ACIDS # Creating the RNA sequence object. rna = RNASequence("UGAUCGAUGCUAAUGUAG") # Transcribing the RNA using the normal genetic code. protein_1 = rna.transcribe() print(protein_1) # Updaring the RNA to Amino acid code with the two new amino acids. new_rna2aa = RNA2AMINO_ACIDS new_rna2aa.update({"UGA" : "U", "UAG" : "O"}) # Transcribing the RNA using the updated genetic code. protein_2 = rna.transcribe() print(protein_2) <Protein: |SMLM|> <Protein: USMLMO> <Protein: |SMLM|> <Protein: USMLMO> As you can see from the same RNA sequence it is possible to get two new different peptide sequences using different genetic codes. Finally from the RNA sequence it is possible to get the DNA sequence that it was translated into. This breaks the central dogma and it will be discussed in more details in the next article, however the following listing shows how to apply this function (using the last created RNA object). # Getting the DNA source of the RNA from RNA. dna_source = rna.back_translation() print(dna_source) # Getting the DNA source of the RNA from RNA. dna_source = rna.back_translation() print(dna_source) <DNA: TGATCGATGCTAATGTAG> <DNA: TGATCGATGCTAATGTAG> Protein functionalities. Finally, in the chain of biological substances are the proteins. They are encoded by the DNA and then translated from RNA. The amino acids in this package are represented by their single letter code, and all peptides as a sequence of letters. The protein sequences can be created by using their constructors as shown below or after using the RNASequence’s transcribe function as shown below: # Importing the Protein from the package. from drosopyla.sequences import Protein # Creating the protein sequence. protein = Protein("TSMLMQ") print(protein) # Importing the Protein from the package. from drosopyla.sequences import Protein # Creating the protein sequence. protein = Protein("TSMLMQ") print(protein) <Protein: TSMLMQ> <Protein: TSMLMQ> At the moment of writing of this article the Protein class implements one more function used for computing the mass of peptide - compute_mass. It calculated the molar mass of the peptide. However it takes into account only the masses of the ionized form of amino acids, so to compute the mass of a non-ionized peptide molecule, it is required to add to the result for this function the masses of a water = 18, because during formation of the peptide bonds water molecules are formed and its components are automatically excluded from ionized amino acids. This process is shown in the listing below: # Computing the mass of the ionised and non-ionised peptide. ionised_mass = protein.compute_mass() non_ionised_mass = protein.compute_mass() + 18 print(f"The molar mass of ionised peptides = {ionised_mass}") print(f"The molar mass of non-ionised peptides = {non_ionised_mass}") # Computing the mass of the ionised and non-ionised peptide. ionised_mass = protein.compute_mass() non_ionised_mass = protein.compute_mass() + 18 print(f"The molar mass of ionised peptides = {ionised_mass}") print(f"The molar mass of non-ionised peptides = {non_ionised_mass}") The molar mass of ionised peptides = 691 The molar mass of non-ionised peptides = 709 The molar mass of ionised peptides = 691 The molar mass of non-ionised peptides = 709 The masses above provide the solution to calculate the mass on a polypeptide that are formed only from the standard 20 amino acids. To include the other amino acids it is required to follow a procedure similar shown in the transcription case: # Importing the mapper with the molar masses of amino acids from the package. from drosopyla.utils import AMINO_ACID_MASS protein = Protein("USMLMO") # Adding the masses of the ionised Selenocysteine and Pyrrolysine. aa_masses = AMINO_ACID_MASS.copy() aa_masses.update({"U" : 168, "O" : 255}) # Calculating the masses of ionised and nonionised peptide. ionised_mass = protein.compute_mass() non_ionised_mass = protein.compute_mass() + 18 print(f"The molar mass of ionised peptides = {ionised_mass}") print(f"The molar mass of non-ionised peptides = {non_ionised_mass}") # Importing the mapper with the molar masses of amino acids from the package. from drosopyla.utils import AMINO_ACID_MASS protein = Protein("USMLMO") # Adding the masses of the ionised Selenocysteine and Pyrrolysine. aa_masses = AMINO_ACID_MASS.copy() aa_masses.update({"U" : 168, "O" : 255}) # Calculating the masses of ionised and nonionised peptide. ionised_mass = protein.compute_mass() non_ionised_mass = protein.compute_mass() + 18 print(f"The molar mass of ionised peptides = {ionised_mass}") print(f"The molar mass of non-ionised peptides = {non_ionised_mass}") The molar mass of ionised peptides = 885 The molar mass of non-ionised peptides = 903 The molar mass of ionised peptides = 885 The molar mass of non-ionised peptides = 903 Final example. Before finishing the article I wanted to show how to use these data structures to get useful insights from a sequence. For example CG content is one of simplest metric for analyzing data. CG content represents the relative value of counts of C and G divided to the length of sequence of DNA or RNA. The formula used is shown below: Final example. Before finishing the article I wanted to show how to use these data structures to get useful insights from a sequence. For example CG content is one of simplest metric for analyzing data. CG content represents the relative value of counts of C and G divided to the length of sequence of DNA or RNA. The formula used is shown below: This metric is useful for calculating the stability of the DNA molecule, because of triple bond between C and G. Also it is useful for finding regions with many protein-encoding genes and for finding the coding regions of gene in eukaryotes. Below is presented the implementation and use of such a function: # Defining the CG content function. def CG_content(sequence): return (sequence.count_elements("C") + sequence.count_elements("G")) / len(sequence) # Creating the DNA Sequence. dna = DNASequence("ACGTCGATGCTAATGCAG") # Calculating and printing the CG contnet. cg_content = CG_content(dna) print(f"CG content = {cg_content}") # Defining the CG content function. def CG_content(sequence): return (sequence.count_elements("C") + sequence.count_elements("G")) / len(sequence) # Creating the DNA Sequence. dna = DNASequence("ACGTCGATGCTAATGCAG") # Calculating and printing the CG contnet. cg_content = CG_content(dna) print(f"CG content = {cg_content}") CG content = 0.5 CG content = 0.5 The example presented above can also be used to compute the CG content for an RNA sequence too. Conclusion. During this article I tried to explain in a shorter form about the bioinformatics as a field and provide a short and informative introduction into biomolecules most studied by bioinformaticians. Also this article provided more biological context because I think that bioinformatics is easier to understand when underneath biology is take into account. Finally this article present the basic biological sequences data types and how to use them with the drosopyla package. This article is represents the start of a series of articles that will show how to use multiple modules from the package to perform bioinformatics analysis in python and of course the biochemical context. drosopyla PS: This article came out a little bit too long because I adore biology, and I hope it didn’t bothered you, and the following articles will be a little shorter.