Bioinformatics I: 2016

Monday, December 19, 2016

Setting up Biopython

http://www.codesdope.com/python-introduction
Open IDLE
When running a new file and trying to add it to the old code (if you have a new code that you are messing up, but you don't want to screw up your old code with all this clutter of failures, you create a new file and then

In python you have color code schemes
Keywords (orange):
'False', 'None', 'True', 'and', 'as', 'assert', 'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield'

Outputs (blue):
Anything you write after print

This will be the output:

In order to have a command executed to create an output you need the >>> you cannot have anything run if you just run it like this

You can write a new code here in a new file of IDLE, have it not run bc it has no arrows, then while in the new tab you click on run, up at the top there, then click on run module, it will ask you to save it, save it. Then it will run it in the main IDLE tab that runs all your instructions.

To make a comment you need a # an then a comment, it will appear in red and it will not affect your code, it's just instructions for yourself.

You can define variables and then manipulate them with commands making them add, subtract, divide or multiply.

There are certain words that you cannot make variables, these are keywords and they appear in orange. Each one has a different function in python.

You can get the code to identify what you are inputting (I don't understand the use of this yet). If you write, type and then a word, it'll identify it as string or str, if you write a number it'll say integer or int, if you write a decimaled number it will write float.

Issues: because the website I am using to learn to code is made for python 2 and not python 3, some of the commands and syntax and just general rules for coding have been changed. So as I try to complete the excercises the website gives me, I need to modify the code and it takes many trial and errors before I manage to get it to complete the function I want it to. What I am currently trying to figure out is how to solve these two function because I've been working on getting them to work for the past hour and reading articles about the changes from 2 to 3 and I can't figure it out.

https://docs.python.org/3.0/whatsnew/3.0.html

What has changed in python from 2 to 3.

Help for beginner functions (colors) http://www.annedawson.net/Python3_Intro.htm
http://www.annedawson.net/Python3Programs.txt (example programs)

Sunday, December 11, 2016

Progress report 3

https://docs.google.com/presentation/d/1dQfjkiDvQALoH287uLhs4TDPdfTRDmXyto4qvGHUdmk/edit#slide=id.g1a0be35812_0_23

Sunday, December 4, 2016

Including Protein Structure and Docking sites in Website (RaptorX)

RaptorX is a free downloadable program which predicts the 3D protein tertiary model as well as possible binding sites. It also allows you to compare 2 or more protien structures through protein structure alignment. You send in your sequence (job) and then they will send you back "its secondary and tertiary structures as well as contact map, solvent accessibility, disordered regions and binding sites." In order to download the software you must be affiliated with an organization. Since I am not, after I have designed my website I need to "please send us your name, organization and your email address through contact us and we will do some internal setup so that you can download the software."
Official Website:
http://raptorx.uchicago.edu/
User guide:
https://www.oregon.gov/OMD/OEM/docs/plan_train/RAPTOR/RAPTOR_User_Guide.pdf
Protein structure, modeling and applications:
https://www.ncbi.nlm.nih.gov/books/NBK6824/

ProDy:
https://pypi.python.org/pypi/ProDy/1.8.2

Thursday, December 1, 2016

Matlab Bioinformatics

The MatLab bioinformatics is a toolbox which includes modules which allow you to read information from FASTA, SAM, CEL, CDF files. You can also access information GenBank and NCBI Gene Expression Omnibus. The program can open the files and then present the data found within the files as visuals ( sequence browsers, spatial heatmaps, and clustergrams). It also has "statistical techniques for detecting peaks, imputing values for missing data, and selecting features"

After reviewing the overview of the bioinformatics program on MatLab, I have come to the conclusion that it would be no help to me. The primary purpose of the program is to allow scientists to input their data, code for how they'd like the program to organize the data and then analyze the data. This would help the scientist come to conclusions about the data faster than had they just looked at completely unorganized data. Also helps them detect trends. I am not trying to analyze data at this point. I am trying to collect all the proteins functions and place them in one place. Later on, once I have designed the website I could use Matlab to analyze all the data I've collected and come to a conclusion myself. Scientists would ideally go through and take all the data I've compiled and then organize it themselves and come to a conclusion themselves. I am just the mediator, I allow them to access all the data in one place and then they analyze it.

What I take from this: I need to include the protein sequence in a FASTA file so that if researchers do what to put my data into MatLab and analyze the trends they can.

Process for strain name retrieval

1. Go onto Science direct
2. Click on the button for Bronx Science proxy
3. Search subspecies name borrelia... in this order

Borrelia burgdorferi

Borrelia garinii

Borrelia afzelii

Borrelia bavariensis

Borrelia valaisiana

Borrelia lusitaniae

Borrelia filandensis

Borrelia bissettii

Borrelia spielmanii

Borrelia carolinensis

Borrelia kurtenbachii

Borrelia andersonii

Borrelia americana

Borrelia turdi

Borrelia yangtze

Borrelia japonica

Borrelia chilensis

Borrelia parkeri

Borrelia duttonii

Borrelia hermsii

Borrelia turicatae

Borrelia recurrentis

Borrelia crocidurae

Borrelia mayonii

Borrelia persica

4. Click on every link that has both keywords in it (Borrelia+subspecies name)

5. Go to the page of the link and see if it is the full article (it should be since I have full access to the database)

6. Scroll down and look for the chart or pictures where it lists every strain used in the study. It will list both the strain and the subspecies (genospecies) the strain is apart of.
7. In order to identify whether the chart holds strain, look for key words like strain name, or species and isolate, or just isolate.
8. Copy any data in the column under the key words, if it split up by species then make sure you record them under the specific species.

7. Go onto my master list of strains and then find the subspecies that the strain you have found is listed under. Scan to see whether you have already collected that strain name.

8. If you have not yet collected that strain, copy it down under the proper subspecies.

9. Do this for every paper that includes the subspecies in it's title when you search it.

10. While going through the papers, if the chart includes a new subspecies that you don't have on your master list, add it to the master list along with any strains that fall under its category.

Tuesday, November 8, 2016

Progress report 2

https://docs.google.com/presentation/d/1URbdIgvTztwPtwtl2RHybWTiGNr_l4IEVECbvlAYRZQ/edit

Sunday, October 30, 2016

Literature+Prior Arts Survey

Prior Arts Survey
http://opm.phar.umich.edu/species.php?species=Borrelia%20burgdorferi
This database contains the structure and information on OspC protein of Borrelia burgdorferi

http://borreliabase.org/
This database contains multiple DNA sequences of the strains.

http://www.lyme-disease-research-database.com/
Database of technical papers on lyme disease.

http://biopython.org/
I will be using biopython to design my website

https://www.ncbi.nlm.nih.gov/pmc/
I get my information from PubMed papers

http://www.sciencedirect.com/
I get my information from technical papers on science direct along with PubMed

In order to figure out what a patent currently exists for, I first needed to figure out the different parts of my project that would be patentable. So first, a database on lyme disease related things. There are only 3 that exist and both are listed above, neither of them cover the territory that my database will. Second would be the sequences or the strains that I am collecting. According to a recent supreme court decision, sequences that exist in nature cannot be patented but sequences that are edited and are not found in nature can be patented. This will not be an issue for me because I am only using the natural genomes of Borrelia. Next in terms of patenting strains, you can only patent a strain if it does not exist in nature and you genetically engineered it. I will not be dealing with any genetically engineered strains. Nor am I creating any sort of lyme disease strains therefore a patent search in order to avoid infringing on someone else's patent would be pointless. I will be coding in a language that is free to public use and if I decide to use Matlab, we have paid for a years subscription so that won't be an issue. I looked through about 25 patents under the input bioinformatics lyme disease as well as database lyme disease and nothing that would overlap with my research came up. Most patents under that search were patenting a antibody which would detect lyme disease proteins in a western blot or a new type of antibiotic. I am not creating anything, only compiling existing data that is already available to the public so I do not see any part of my project that would infringe upon a patent.

Literature Survey
https://www.ncbi.nlm.nih.gov/pubmed/27588694
FlgE is a protein involved in the Borrelia hooking onto cells.

https://www.ncbi.nlm.nih.gov/pubmed/26480895
BbHtrA is a protease within Borrelia burgdorferi.

https://www.ncbi.nlm.nih.gov/pubmed/26438793
OspC is an outer surface protein on Bb which prevents it from being eaten by phagocytes.

https://www.ncbi.nlm.nih.gov/pubmed/26953324
OspA and OspB are both outer surface proteins on Bb and help it evade the immune system

https://www.ncbi.nlm.nih.gov/pubmed/27502325
Ip28-1 plasmid is responsible for the variation in the VIsE lipoprotien that helps Bb evade the immune system.

https://www.ncbi.nlm.nih.gov/pubmed/27161310
A variant of TP0435 lipoprotein which was thought to be found only within Treponema pallidum is found within Borrelia Burgdorferi and it allows the bacteria to adhere to a host cell.

https://www.ncbi.nlm.nih.gov/pubmed/26808924
BBK32 is another lipoprotein which blocks the recruitment of molecules within the complement system

https://www.ncbi.nlm.nih.gov/pubmed/26434356
BGA66 and BGA71 are both outer surface proteins which inhibit the complement system, the alternative pathway and the classical pathway.

https://www.ncbi.nlm.nih.gov/pubmed/26247174
Lmp1 aids in the adhesion of Bb to the host cell and allows the persistence of infection.

https://www.ncbi.nlm.nih.gov/pubmed/26181365
BAPKO_0422 is a protein that binds to the human factor H and inhibits the complement system.

https://www.ncbi.nlm.nih.gov/pubmed/24191298
CspA binds to human factor H and inhibits the complement system.

https://www.ncbi.nlm.nih.gov/pubmed/27725820
CspZ, ErpA, ErpC, ErpP, and p43 are all surface proteins that allow Bb to evade the immune system. Varient of CD59 binds to human factor H to inhibit complement system.

https://www.ncbi.nlm.nih.gov/pubmed/24702793
CspZ binds to CFH and CFHL-1 to inhibit complement system. CspZ is one of the 5 proteins that Bb creates that binds to CFH and CFHL-1.

https://www.ncbi.nlm.nih.gov/pubmed/25582082
ErpA, ErpC, ErpP all bind to CFH and CFHL-1 however ErpC may only bind to CFHR and not CFH.

https://www.ncbi.nlm.nih.gov/pubmed/20022381
CRASP-1/Bba68 works with OspE to bind to factor A and inhibit complement system.

https://www.ncbi.nlm.nih.gov/pubmed/14629271
p39, p41 in IgM IB, and p83/100, p39, Osp17 in IgG IB; in late LB: p39, p41 in IgM IB, and p83/100, Osp17, p21 and p43 in IgG IB are all proteins that can be detected at different stages during a lyme disease infection.

https://www.ncbi.nlm.nih.gov/pubmed/11599789
Osp17 and OspC actually induce a humoral immune response

https://www.ncbi.nlm.nih.gov/pubmed/19451251
CspA a surface lipoprotein binds to FH/FH-1 (human factor H and H like protein) and allows Borrelia to evade the immune system in an unknown manner. FH aids in the immune system by binding to defect human cells and marking them for destruction.

Borrelia Burgdoferi evades the immune system by preventing certain pathways of the immune response to work. It inhibits the complement system by creating proteins that bind to the human factor H as well as CHF and CHFL-1. These prevent the complement system from from recruiting other proteins to fight the infection. The proteins that bind to human factor H are a variant of CD59, variant of TP0435, BBK32, BGA66, BAPKO_0422, CspA and BGA71. Other proteins aid in the adhesion of the bacteria to the host cell, these include: Lmp, FlgE. The detection of lyme disease is different depending on how far the infection has progressed. In the earliest stages p39, p41 in IgM IB, and p83/100, p39, Osp17 in IgG IB. In late LB: p39, p41 in IgM IB, and p83/100, Osp17, p21 and p43 in IgG IB. It is crucial that we understand what proteins to detect at different stages so that we can have the most accurate test to see whether or not you have lyme disease. The current test is only 50% accurate. Knowing which proteins are present at which stages and which proteins do what will help researchers find better ways to treat and diagnose lyme disease.

Sunday, October 16, 2016

Problem+Collection of strains

257/300 strains found

Borrelia Burgdorferi: There are 5 subspecies and 300 strains in the world.

LB subspecies

I. Borrelia burgdorferi sensu stricto
ESP1
SON328
IP2
SON2110
HB19
IP1
B31
ZS7
20006
VEERY
MEN115
CA19
19535
MIL
Cat flea
21305
NY186
DK7
297
26816
SON188
IP3
Z136
35B808
NE56
27985
L5
64b
JD1
CA-11.2A
CA328
CA382
CA8
N40
72a
156a
W191-23
118a
297
29805
Bol26
94a
297vi
CT20004
CT27985
ECM-NY86
JD1
TB
VS219
WI91-23
PBre
1131/96
Bre13
PMeh
PIG
1408/94
A44S
VS293
VS130
VS215
CA-5
Charlie Tick
Geho
IRS
M14
NE38
NE50
BE1 (P1G)
VS2
VS44
VS73
VS82
VS106
VS108
VS115
VS134
VS146
VS161

II. Borrelia garinii
N34
20047
HFOX
PBR
FAR03
PBr
Far04
BgVir
NMJW
NMJW1
20047
K48
Ip90
HT59
NP81
NT25
HP1
HT19
NT24
NT31
HT2
SZ
IBS 3
IBS 7
IBS 6
587/94
VSBP
VSDA
PBi
IBS 8
Prab
PStg
114/95
IBS 9
A19S
A76S
A91C
A94C
T25
TN
A87SA
A87SB
A77C
VSBM
387
935T
AR-1
BITS
FAR01
FAR02
FIS01
G25
HP3
Ip89
M50
M63
NBS16
NE2
NE83
NT29
P/Br
PD89
VS3
VS102
VS156
VS244

III. Borrelia afzelii
DK3
BR53
ECM1
J1
B023
VS461
DK8
PKo
ACA-1
HLJO1
P/Gau
B fox
Tom3107
K78
IBS 5
IBS 4
634/93
163/98
1895/97
1436/97
PSp
PGo
P/GU
A67T
A48T
A17S
A20S
A26S
M7
934U
A100S
A39S
A42S
A45aS
A51T
A58T
A76S
ACA1
F1
IP3 (Iper3)
Iper
M55
NE36
NE39
Pwud I
SMS1
UM01
VS25R-Or
VS42R-R

IV. Borrelia bavariensis
PBi

V. Borrelia valaisiana
VS116
Tom4006
M19
M52
M53
AG1
AR-2
F10.8.94
Frank
M57
NE168
NE218
NE223

VI. Borrelia lusitaniae
BR41
IR345
POTIB1
POTIB2
POTIB#
VII. Borrelia filandensis
SV1
VII. Borrelia bissettii
DN127, cI9-2/p7
gom93-274
gom93-284
gom93-278
gom93-287
gom93-268
gom93-297
gom93-275
gom93-283
gom93-286
gom93-310
gom93-305
gom93-296
gom93-299
gom93-501
gom93-543
gom93-544
CA128
CA370
CA371
VIII. Borrelia spielmanii
A14S
IX. Borrelia carolinensis
X. Borrelia kurtenbachii
NS07-121
25015
IL96-255
IL97-236u
XI. Borrelia andersonii
21123
XII. Borrelia americana
XIII. genomospecies 2
XIV. Borrelia turdi
XV. Borrelia yangtze
XVI. Borrelia japonica
IKA2
COW611c
Fi340
FiAE2
FiEE2
HO14

XVII. Tunakii
XVIII. Borrelia chilensis
VA1
XIX. Borrelia parkeri
HR1
RF subspecies

I. Borrelia duttonii
Ly

II. Borrelia hermsii
MTW
YBT
CC1
DAH
HS1

III. Borrelia turicatae
91E135
IV. Borrelia recurrentis
A1
V. Borrelia crocidurae
Achema
VI. Borrelia miyamotoi (Closely related, Idk if it is RF)
LB-2001
VII. Borrelia persica Borrelia mayonii

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3214628/figure/F2/

http://www.nature.com/nature/journal/v390/n6660/full/390580a0.html
Subspecies (dont know if i trust this)
http://www.bacterio.net/borrelia.html

Genes for B. Persica
http://www.sciencedirect.com/science/article/pii/S1877959X15001193
Genes for B. bissettii
http://www.sciencedirect.com/science/article/pii/S1877959X10000701

Bb31

Where I got strains from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC154584/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4299872/pdf/nihms583086.pdf

http://www.sciencedirect.com/science/article/pii/S0378109797000244
Extract strains from:
http://www.sciencedirect.com/science/article/pii/S0378109700003906
http://www.sciencedirect.com/science/article/pii/S0378109701001537

Protiens
http://www.sciencedirect.com/science/article/pii/S1877959X12001240

Species
http://www.sciencedirect.com/science/article/pii/S1877959X11000379?np=y

Why lyme disease is an epidemic

The current screening for lyme disease misses half the cases. There is very little funding for lyme disease research and because of this very little is known about lyme disease. We need a site that organizing the current data on protein function and allows researchers to easily see where the gaps exist in terms of protein identification and function. Lyme disease can be transmitted from a mother to a fetus. You can get lyme disease by being bitten by a mosquito, a mite, a fly,

Johnson, Lorraine. "President Obama and Congress: A Call To Legalize Lyme Disease." Change.org. Lyme Disease.org, 9 Oct. 2014. Web. 29 Sept. 2016.

http://www.lymedisease.org/lymepolicywonk-two-tiered-lab-testing-for-lyme-disease-no-better-than-a-coin-toss-time-for-change-2/
https://wwwnc.cdc.gov/eid/article/16/7/pdfs/09-1452.pdf

After researching which coding language would be best for bioinformatics I have found that python will be the most useful, if I cannot figure python out I can try Perl.
http://biopython.org/wiki/Biopython

Thursday, September 29, 2016

Initial Project Planning

Network Diagram for the next 3 months

Updated Gantt Chart

Over the next three months I will be getting all the preliminary steps of creating a bioinformatics website out of the way. I will learn how to code in Javascript and CSS and then begin gathering the data that the website will organize. Finally, I will start the laying the foundations of the website.

Initial Project Planning:
Identify the Problem:
Lyme disease is constantly evolving and new strains are emerging. This makes it increasingly hard to treat because the strains are evolving to evade antibiotics. We need to sequence all existing strains and find which sections are increasing virulence and allowing the strains to evade the immune system and which sections are conserved.
Prior Art Survey:
There is a bioinformatics site which has compiled the sequences of 35 genomes of 8 LB species and 7 RF species. It is a manually curated site which includes comparisons of phylogeny, synteny and sequence alignments of orthologous sequences and intergenic spacers. (http://borreliabase.org/)
Literature Review:
Can be found in previous posts.
Solution Space Exploration:
Since a site already exists with with MSA, phylogeny and synteny. They do not have a complete genome for most strains. They also do not have the functions of each gene.
High Level Design:
I need to get in contact with the people who wrote the paper I read over the summer about lime disease before I start my project.

Tools:
If I plan to work off of the existing database for lyme disease I will need just a computer and possibly access to the NCBI database, in order to read the articles on lyme disease strains. I could access these through the library.
If I plan to work with sequencing that have not been sequenced yet, then I will need access to a lab or some sort of collaboration with a lab to sequence the rest of the genome.
I will need a mentor who knows about the lyme disease genome

Budget:
Most likely 0$

http://www.novusbio.com/diseases/lyme-disease
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-233
Bioinformatics Slide:
https://docs.google.com/presentation/d/17VzdZdIEADFSacXulYF5ocCSfWaYhgboPMJYiH3iaGA/edit?usp=sharing

Thursday, September 15, 2016

Bioinformatics Week 9

Progress on project compiled:
https://docs.google.com/presentation/d/17VzdZdIEADFSacXulYF5ocCSfWaYhgboPMJYiH3iaGA/edit?usp=sharing

Saturday, September 3, 2016

Bioinformatics Week 8

Next generation sequencing: Generates large amount of sequences.
The price of sequencing has dropped significantly since the start of sequencing.
454 Pyrosequecning

DNA is broken into smaller sequences.
Attach single peice of DNA is attatched to a bead
Single pieces of DNA on the beads are then applified
Sequencing reactions occur and the sequencing data is created.

Illumina Sequencing

DNA is broken into smaller sequences
Clusters of the DNA are created by adding adapters(stick together non congrous DNA)
Clusters are amplified in PCR
Then you sequence each individual cluster by adding in primers and nucleotides which are tagged with different colored florescences.

The current sequencing technologies fall short in some aspects. First, they all break the DNA into smaller segments and then sequence them. So you are sequencing small parts and it's harder to see the bigger picture. You have to try to piece back together the small sequences to figure out what the whole piece of DNA was. When there are repetitive regions within DNA it becomes more complicated to piece back the DNA sequences once you have sequenced it. You don't know where these sequences were located on the DNA. Putting the sequences back together is called assembly. There are different programs which assemble the sequences (ABySS, SOAPdenovo, Velvet). When you are given an assembly you are also given an assembly score. This assembly score comes with the number of contigs (overlapping sequences). The fewer contigs the better.

RNA-seq: Sequences mRNA. Can identify if alternative splicing events have occurred (removing introns, not removing introns, etc)

DNA is broken into smaller sequences
Clusters of the DNA are created by adding adapters(stick together non congruous DNA)
Sequence it

Metagenomics: Studying the sequences of microbial organism in their natural environments by just taking samples of their environment (soil, water, etc) and then you sequence it using next generation sequencing. Before you would have to remove the organism from its natural environment and then culture it in an artificial lab environment which could alter your results.

Friday, September 2, 2016

Bioinformatics Week 7

Selection analysis: Tries to identify whether natural selection is occurring or not as well as if the selection is to get rid of a certain sequence or promote a sequence. Then how this selection affects the species.
3 types of selection

Negative: Removes a detrimental mutation.
Positive: Bennificial mutation promoted.
Balancing/Diversfying seleciton: Favors the maintenance of multiple variations of a sequence in a diverse environment.

When measuring selection there are two standard techniques

Tajima's D: Based on a few standard principles. The genetic regions near a region which is being selected for get dragged into the selection process (genetic hitchhiking). The length of the genetic region is dependent on the rate of recombination. Theta equals to Pi when no selection is occurring. When theta is greater than Pi then positive selection is occurring. When theta is less than Pi then balancing selection is occurring. Advantages: Good however can be fooled by other factors (ex:bottleneck mistaken for positive selection). Therefore after you preform Tajima's D you must preform other tests to single out chances of error in your data.
dN/dS: This calculates the ratio of nonsynonmous mutations (change the protein) to synonemous mutations (doesn't affect protein). Very good for figuring out if there are specific sites which are being selected for and then figure out which codons are being selected for. No selection dN/dS=1, negative selection dN/dS<1, positive selection dN/dS>1.

There are two popular ways of quantifying selection (measuring variation)

Theta: Based on the number of variable(changed between species) sites in a sample
Pi: Based on the average number of differences between sequences. More sensitive to the frequency of a variation.

Bioinformatics Week 6

Coursera Week 5: Phylogenetics
To study Phylogenetics you create visual comparisons between DNA sequences or proteins in the form of a tree.

Mutation leads to speciation

There are rooted trees and unrooted trees. Rooted trees are the ones above, they expand from one point in one direction whereas unrooted trees can go off in many directions.

Homeoplasy is when 2 divergent species share a similar characteristic. There are different types of homeoplasy.

In order to conduct experiments involving phylogenetic you must have good sampling. Some of your samples need to be homologous, independent and variants of the original specimen which the tree is being based off of. Lastly, you need sequence alignments, and statistical support for their arrangement on the tree.

There are two tree building methods:
Distance methods

UPGMA
Neighbour Joining: Using blosum or PAM matrix to compare, then create a system to rate and scale the distance between the species based on their matrix score.
Good things: They are computationally fast, and there is a singular best tree found in the end.
Bad things: Sometimes there isn't a single best tree

Character based (discrete) methods

Maximum parisomony
Maximum likelihood: Evaluates the likelihood of every possible mutation that could occur within a phylogenetic tree for a species to arrive at where it currently is. Then it uses statistical analysis to figure out which has the highest likelihood and assumes that's the correct tree. There are 4 base pairs so in an unbiased model there is a .25 likelihood for one of the 4 base pairs to change to another base pair. Then you multiply it to the 10th with the power of how many nucleotides there are within the sequence you are analyzing and that is the likelihood of a certain mutation. Say you have a sequence 20 base pairs long, and you can say that a certain G substitution you are studying has a .25*10^20 chance of occurring. Then after that it calculates the chances of the this change occurring over time in this fashion (process portion). Advantages: Produces clear results, you can statistically analyze the results you receive, it also gives you the other likely options that it produces. Disadvantages: It is computationally intensive and cannot be applied to large datasets.

There is something called bootstrapping where you take all possible versions of your phylogenetic tree and then you calculate how many times certain species are grouped together.

Here we see that A and B have been grouped together 100% (this number is arbitrary) and C and D have been grouped together 75%. So it is very likely that A and B and then C and D diverge from a more recent ancestor. 70-90% the relationship is very probable. Anything less means it is a less probable relationship.

Sunday, August 7, 2016

Bioinformatics Week 5

Coursera Week 3:
Multiple sequence alignments allow you to see the evolution of a species, as well as figuring out which sequences are useful through which sequences are preserved.
To do multiple sequence alignments you need to create a scoring guideline. You must compare columns and then assign a numerical value to rank to homologous columns.
Algorithms that code for MSA (multiple sequence alignment):
Dynamic (better)
Multidimensional dynamic (worse)
Programs that do MSA are Clustal (Progressive MSA) or DIALIGN (Local MSA)

Progressive MSA progressively aligns more distantly related sequences. The sequence is not disturbed during alignment.
Clustal
Insert gaps if you need to in order to better align the sequences
If you place a gap within a sequence you must also add in a deficit for the gap. (subtract points from alignment score for gap insertion.)
Then create a guide tree based on how related the sequences are
These guide trees are phylogenic trees
Clustal is suffers from making the quickest solution rather than the best solution. It looks for temporary fixes (inserting gaps in the quickest place rather than the most strategic) rather than long term fixes. These temporary fixes eventually propagate and lead to poorer total alignment. In order to compensate for the errors made by Clustal there are iterative methods that go through the alignments and then identify the subgroups within the larger allignments that have been aligned with quick fixes, then it fixes the temporary fixes with long term fixes and then reinsert them into the total sequence to be realigned.
Once Iterative programs have fixed the temporary fixes, it goes through and then fixes the alignments again, it then goes through the newly aligned sequence it has just created and creates a phylogenetic tree, then it determines the MSA, scores the MSA and then compares that score with the original score of the Clustal alignment it fixed, and asks whether the score is better or not. If its better it goes back and realigns everything again and if its not better then its done. It runs in a circle constantly trying to better the alignment, each time it gets a better alignment it keeps realigning till it hits a point when the alignment cannot get better.
Dynamic substitution matrices are used in order to compare sequences once they are aligned. It uses lower value Blosum matrices for lower scored alignments. (These matrices are used in BLAST).

Then there is Local MSA (DIALIGN)
Which compares sequences of DNA within a global sequence (total sequence) which are in different places known as diagonals.

This is basically what DIALIGN does. It then weighs the worth of each diagonal, based on length.

To compare sequences which are related use Clustal
To compare sequences which are unrelated and have conserved regions that are consistent use DIALIGN
To compare sequences which are unrelated and have conserved regions that are non consistent use MAFFT
Protein is easier to align than DNA. DNA gets the score of 1 if it matches, 0 if it doesn't. But proteins have amino acids which are redundant so it's easier to find a match.
*Too many caps or insertions or columns that don't match means that something is wrong with the alignment.
Using MSA programs is a skill. It is very easy to get poor MSA results using Clustal (the most popular website.) While using any of the MSA programs it is very important that it is not full proof and you cannot trust the results produced.
When you use global alignments:
When the sequences can be aligned through the entire sequence. If the sequences are of different lengths then you can insert gaps in order to compensate for the different sizes and then align.
When you use local alignments:
When the sequence can only be aligned at certain areas of the sequence.
When you use NCBI downloads of sequences in order to input them into MSA programs it is important to note that their names are incredibly lengthy. You must learn Perl, Python or Ruby in order to rename the files and make things very simple.
MEGA:
MSA program MEGA is very good at taking DNA translating it into protein, aligning it and then retranslating it back into DNA. This is a very useful tool, however it must be used carefully because only a small percentage of DNA sequences code for protein. You also must make sure that the sequence you are inputting is the full sequence, if you start at a different point than the starting point then the sequence is read incorrectly because it is read in codons and it will code for the wrong protein.
DIALIGN
When you have sequences that are unalignable but have short conserved regions, it is best to use DIALIGN. DIALIGN is also very good at the DNA to protein conversion.
MAFFT
The best tool to use for general MSA problems that MEGA or Clustal struggle with. The setbacks presented with the Clustal process are fixed with MAFFT. MAFFT automatically accounts for the quality of the MSA score by looking at the number of inserts and lessening the MSA score accordingly. It also works at incredible speeds considering the amount of work it is doing.
Programing words:
Regex (regular expressions) this is programing a system to associate what you input as what it has in its system. For example, looking up obvi and having obviously come up. It makes it easier and more efficient for you to research. You will have to know Ruby, Perl or Python to do so. Crimson is a good way to work with regex.
***MSA is integral for solving the issue the lyme disease paper brought up. The genomes of all of the strains of lyme disease must be sequenced and after they are sequenced you need to run them through an MSA program in order to see which parts of the sequence are conserved throughout the species and which are not. This can help determine which sequences are essential to the functioning of lyme disease and which sequences are associated with the differentiation and adaption of the strains. Because MSA uses the scoring Matrices it is also vital to master those.

Sunday, July 31, 2016

Summer Research Week 5 (08/01 - 08/05)

Great progress! This week there will be two major focuses:

Bioinformatics

Bioinformatics Methods I, Coursera: Go through the materials of week 3.

Python & NLTK

In order to apply Natural Language Processing (NLP) to biomedical fields, you will have to learn a programming language- Python and a platform- Natural Language Toolkit (NLTK).
Go to https://www.python.org/downloads/ to download and install the most recent version (3.5.2) of Python.
Launch the "Terminal" in the Applications > Utilities folder. In the terminal, run the following commands: (Let me know if you have any problem. I have tried a few times, and finally made it work.)
- Install pip: run sudo easy_install pip
- Install NLTK: run sudo pip install -U nltk
- Install Numpy: run sudo pip install -U numpy
- Run Python: run python
- Test installation: type import nltk
Now, you can go to http://www.nltk.org/, and follow the "Some simple things you can do with NLTK".

Thursday, July 28, 2016

Bioinformatics Week 4

Lyme Disease:
Trying to determine which variations in genomes are stochastic(random) or selectively chosen by natural selection is difficult and takes a in depth understanding of the ecological and clinical conditions the strain has undergone. The natural selection that strains undergo is mostly due to host adaption.There are two hypotheses that surround how within the same region such different strains can arise.

Multiple Niche Polymorphism(MNP): The different strains are able to survive and thrive in the same region because they occupy different niches; different hosts, tissue types, and which organism carries the disease. Host adaption.
Negative Frequency-dependent Selection(NFD): The ability to evade a hosts immune system requires many varying strains because they have to adapt to their specific host. Immune-escape mechanisms.

In order to test which hypothesis is correct you can see whether the house keeping genes are maintained or not. A high amount of non-synonymous DNA in the house keeping genes would imply that NFD is the correct hypothesis and a high amount of synonymous data would suggest MNP is correct. There is also a chance that both hypothesi are correct it could be the working together of immune escape mechanisms and host adaption.
Mapping the genome and tracking adaptions is also helpful for predicting what future threats lyme disease could cause. Once you predict the variation in the strains then you could hopefully predict what changes have occured ecologically, climate wise, and migration wise.
Based on the current genomes that we have mapped we can recognize that some swapping of DNA has occurred. We can look at when recombination began occurring between strains based on how much recombination has occurred. If it is low, then it is recent, high then it started further back in history.
Sequencing genomes would also allow researchers to track the migration of a strain based on the crossing over that occurred between species and using the fact listed right above this. By comparing the date of migration and ecological events that occurred in the area they migrated from you can predict what will cause a strain to migrate.
With the strains of lyme disease become more and more diverse as well as more and more spread recombination will occur at a far faster rate creating an even more diverse selection of strains. The more diverse the more threatening lyme disease is to the public.
What bioinformatics can do for this field

Creating a linkage map between all the SNP's of the Borrelia genome using population genomics
Figuring out which genetic changes are associated with which strain and what that change does and what causes it using phylogenomics
Testing the ecological hypotheses mentioned in this paper (MNP, NFD) by estimating population size, migration rate, strain frequency and the crossing of genomes by using genome-level phylogeography.

Progress to keep in mind:

Bigger picture:
Understanding a highly adaptive bacteria genus like Borrelia will help researchers understanding how bacterial genomes evolve.

In order to tackle the problems listed above, I need to master population genomics, phylogenomics and phylogeography.

Key facts to understand about Borrelia:

Plasmids of bacteria differ but they have a set genome outside of the plasmid.
PF54 gene is responsible for evading immune system
OspC, dbpA,vls, b08, and a07 are all antigen genes responsible for diversifying selection

Statistics: MUST RESEARCH LATER ON

Bayesian

Markovian

Hidden Markov Models (HMMs) are statistical models for sequential data.
They are used for programing artificial intelligence, modeling biological sequences, pattern recognition (can be used in basic programing?)

Monday, July 25, 2016

Summer Research Week 4 (07/25 - 07/29)

You will get one more week to study in details about the Lyme disease review paper (Evolutionary Genomics of Borrelia burgdorferi sensu lato- Findings, Hypotheses, and the Rise of Hybrids). You might want to use some references to explore and expand your understanding to the next level. Don't forget to add any useful resource to the "Project Resource" page.

Friday, July 22, 2016

Bioinformatics Week 3

Lyme Disease

Lyme disease is constantly changing. There are many strains of the pathogen and these strains are constantly switching DNA, therefore their genome is constantly adapting and changing making it hard to treat. The diversity within the strains of pathogens is caused by the different species that host the disease, each strain adapts to its host and it's immune system. Overarching question: What are the mechanisms for adaption? What edits in the genome cause which adaptions? How to answer these questions: Constant genome sequences of the strains

Lyme disease has the most complex prokaryotic genome known. In databases there are a total of 30 strains of lyme. The issue to tackle is the growing number of differentiating genomes of lyme.

Researchers hoping to find the reason for the increased virulence within certain strains try to compare genomes, however the issue with genotyping the pathogens is that the majority of genetic differences between strains do not actually cause any phenotypic variation and are not found to affect the traits researchers are interested in.

You can compare the genomes based on what is the preferred host of that strain. This method of comparing the differing genomes of pathogens that pray on different species is a underdeveloped field in bioinformatics. Current databases which contain genome comparisons focus more on multiple gene sequences rather than the sequence for the antigen.

You can identify the function of a gene based on its genetic ancestors and then the ecological factors that would have caused it to change (phylogenomics). Then using this technique you can also predict the mutations that are to come.
Bacteria has a special mode of recombination for which special techniques of analyzation have not been made. They have two unique patters of linkage disequilibrium. The genealogy of a bacterial genome can be discerned using one coalescent tree because of a genome wide clonal frame. Second, the linkage disequilibrium between two touching SNPs occurs less frequently with an increased amount of gene conversion(a sequence of DNA is replaced by an identical sequence of DNA). The more recombination occurs within a genome the bigger the sample size you need in order to detect disease causing agents. Therefore, in order to detect what sequence allows the bacteria to have a higher virulence you need to have a large sample size and analyze their SNPs. Once again, the necessity for this discovery leads to a necessity of the sequencing of every genome within this bacterial family.

Genome sequencing discoveries:

Loss of OspB may be host adaption
a70 allows strain to evade immune system
PF54 gene most variable region of DNA, reason for vast speciation

Use Phylogenomic analysis to see which sequences are involved in adapting to host and ecological differences. There is also phylogenic footprinting which predicts which regions of a sequence are gene regulatory based on which are the most conserved regions of the sequence. This is dependent on the assumption that non coding and cis regulatory sequences undergo mutations at a slower rate than coding sequences. In order to have more accurate phylogenic foot printing results, it is necessary to be searching through as many genomes as possible. It is important to have a diverse data set when analyzing changes and continuities in genomes.

There is also population genomics which focuses on what ecological and evolutionary forces caused the change in genome.

Pan genomics

Pan genomics is the compilation of all genes within a species. It is the complete genome and it can be used to track the gains or loses within a genome using mathematical calculation. Pan genomics is studied using blast however that means that the values are dependent on the e value cutoffs as well as the length coverage of the Blast. There isn't much horizontal gene transfer within the genome of B. burgdorferi s.l., so the majority of changes within it's genome are due to duplications and losses of homologous genes and sequence variations within orthologous genes. Even though horizontal gene transfer between distantly related members of this bacterial family, it is necessary to sequence the genomes of numerous other members so that we can have a more complete idea of what genes control what and how they adapt to clinical and ecological circumstances. We can analyze the plasmids within closely related strains and then use that analysis to understand which genes aid in the organisms adaptions to it's environment. Using ortholog-ortholog comparison you can figure out the possible functions of genes that somehow confer resistance such as PF54. You can basically map the genes history by doing these comparisons.

Bioinformatics framework

Manual curation of information extraction has become impossible due to the increasing amount of literature.

BioTM includes

IR:information retrieval, IR processes your input and finds subsequent links and literature which are associated with your query.

IE:information extraction, IE takes the information you're looking for out of the paper? Uses NLP Natural language processing, DM dating mining. Or maybe it just highlights the information related to what you need.

HG:hypothesis generation, ?

Using NERs to sort literature is very unreliable due to the differing language used to depict the same entity. Rule based systems attempt to aid in bridging this gap by creating algorithms that understand rule like ase equates to enzyme.

Here are DM techniques used to normalize the nomenclature to enhance TM outcomes, these are very time consuming to create

Hidden Markov Models (HMM)
Naive Bayes
Conditional Random Fields (CRFs)
Support Vector Machines (SVMs)

The big thing now is Relationship Extraction, RE.

A big issue with BioTM is that they fail to work in tandem with the biomedical community that could greatly benefit from their tools. Also, the tools that are available often take expect programming skills which limit the pool of users. The TM industry is more focused on creating novel techniques rather than integrating their programs into the research world. Another issue is the privatization of papers which prevent TM systems from retrieving data from complete papers rather than just the abstracts.

HOW TO FIX THE PROBLEMS PRESENTED: @Note: The Ikea of Bioinformatics

Biologists and bioinformatics people need to work together to create widely usable programs through @Note
@Note is basically the baseline bioinformatics program which researchers can customize to fit their own needs. @Note demands basic skills like POS, Part-of-speech, tagging and lexicon based semantic tagging. However it will be able to perform many highly complex processes which will allow the researchers to have a access to very useful tools without the need for an experienced skill set.
The face page seems highly interactive and user friendly

@Note has 3 user possibilities

Biologists: Can use this program to retrieve biomedical texts, annotation and curation.
Text Miner: Can use this program to analyze texts.
Application developers: Can use this program to easily design applications for bioinformatics.

How it works:

You search for something, initially it searches the PubMed database and pulls up any articles that contain the query. Then if you don't have a subscription to PubMed or other subcriptional databases then it does a web crawl and drags up any articles open to the public containing your query and then any articles that are abstract only on PubMed but public on another site. Then it will automatically site the paper for the researcher. Although, this program does have this built in search method other parts of the program need to be built from scratch.

The program was coded using AIBench a program under Java. It has plug-in in well known tools like WEKA, YALE and GATE.