Bioinformatics I: August 2016

Coursera Week 3:
Multiple sequence alignments allow you to see the evolution of a species, as well as figuring out which sequences are useful through which sequences are preserved.
To do multiple sequence alignments you need to create a scoring guideline. You must compare columns and then assign a numerical value to rank to homologous columns.
Algorithms that code for MSA (multiple sequence alignment):
Dynamic (better)
Multidimensional dynamic (worse)
Programs that do MSA are Clustal (Progressive MSA) or DIALIGN (Local MSA)

Progressive MSA progressively aligns more distantly related sequences. The sequence is not disturbed during alignment.
Clustal
Insert gaps if you need to in order to better align the sequences
If you place a gap within a sequence you must also add in a deficit for the gap. (subtract points from alignment score for gap insertion.)
Then create a guide tree based on how related the sequences are
These guide trees are phylogenic trees
Clustal is suffers from making the quickest solution rather than the best solution. It looks for temporary fixes (inserting gaps in the quickest place rather than the most strategic) rather than long term fixes. These temporary fixes eventually propagate and lead to poorer total alignment. In order to compensate for the errors made by Clustal there are iterative methods that go through the alignments and then identify the subgroups within the larger allignments that have been aligned with quick fixes, then it fixes the temporary fixes with long term fixes and then reinsert them into the total sequence to be realigned.
Once Iterative programs have fixed the temporary fixes, it goes through and then fixes the alignments again, it then goes through the newly aligned sequence it has just created and creates a phylogenetic tree, then it determines the MSA, scores the MSA and then compares that score with the original score of the Clustal alignment it fixed, and asks whether the score is better or not. If its better it goes back and realigns everything again and if its not better then its done. It runs in a circle constantly trying to better the alignment, each time it gets a better alignment it keeps realigning till it hits a point when the alignment cannot get better.
Dynamic substitution matrices are used in order to compare sequences once they are aligned. It uses lower value Blosum matrices for lower scored alignments. (These matrices are used in BLAST).

Then there is Local MSA (DIALIGN)
Which compares sequences of DNA within a global sequence (total sequence) which are in different places known as diagonals.

This is basically what DIALIGN does. It then weighs the worth of each diagonal, based on length.

To compare sequences which are related use Clustal
To compare sequences which are unrelated and have conserved regions that are consistent use DIALIGN
To compare sequences which are unrelated and have conserved regions that are non consistent use MAFFT
Protein is easier to align than DNA. DNA gets the score of 1 if it matches, 0 if it doesn't. But proteins have amino acids which are redundant so it's easier to find a match.
*Too many caps or insertions or columns that don't match means that something is wrong with the alignment.
Using MSA programs is a skill. It is very easy to get poor MSA results using Clustal (the most popular website.) While using any of the MSA programs it is very important that it is not full proof and you cannot trust the results produced.
When you use global alignments:
When the sequences can be aligned through the entire sequence. If the sequences are of different lengths then you can insert gaps in order to compensate for the different sizes and then align.
When you use local alignments:
When the sequence can only be aligned at certain areas of the sequence.
When you use NCBI downloads of sequences in order to input them into MSA programs it is important to note that their names are incredibly lengthy. You must learn Perl, Python or Ruby in order to rename the files and make things very simple.
MEGA:
MSA program MEGA is very good at taking DNA translating it into protein, aligning it and then retranslating it back into DNA. This is a very useful tool, however it must be used carefully because only a small percentage of DNA sequences code for protein. You also must make sure that the sequence you are inputting is the full sequence, if you start at a different point than the starting point then the sequence is read incorrectly because it is read in codons and it will code for the wrong protein.
DIALIGN
When you have sequences that are unalignable but have short conserved regions, it is best to use DIALIGN. DIALIGN is also very good at the DNA to protein conversion.
MAFFT
The best tool to use for general MSA problems that MEGA or Clustal struggle with. The setbacks presented with the Clustal process are fixed with MAFFT. MAFFT automatically accounts for the quality of the MSA score by looking at the number of inserts and lessening the MSA score accordingly. It also works at incredible speeds considering the amount of work it is doing.
Programing words:
Regex (regular expressions) this is programing a system to associate what you input as what it has in its system. For example, looking up obvi and having obviously come up. It makes it easier and more efficient for you to research. You will have to know Ruby, Perl or Python to do so. Crimson is a good way to work with regex.
***MSA is integral for solving the issue the lyme disease paper brought up. The genomes of all of the strains of lyme disease must be sequenced and after they are sequenced you need to run them through an MSA program in order to see which parts of the sequence are conserved throughout the species and which are not. This can help determine which sequences are essential to the functioning of lyme disease and which sequences are associated with the differentiation and adaption of the strains. Because MSA uses the scoring Matrices it is also vital to master those.

Bioinformatics I

Sunday, August 7, 2016

Bioinformatics Week 5