Bioinformatics I: Bioinformatics Week 3

Lyme Disease

Lyme disease is constantly changing. There are many strains of the pathogen and these strains are constantly switching DNA, therefore their genome is constantly adapting and changing making it hard to treat. The diversity within the strains of pathogens is caused by the different species that host the disease, each strain adapts to its host and it's immune system. Overarching question: What are the mechanisms for adaption? What edits in the genome cause which adaptions? How to answer these questions: Constant genome sequences of the strains

Lyme disease has the most complex prokaryotic genome known. In databases there are a total of 30 strains of lyme. The issue to tackle is the growing number of differentiating genomes of lyme.

Researchers hoping to find the reason for the increased virulence within certain strains try to compare genomes, however the issue with genotyping the pathogens is that the majority of genetic differences between strains do not actually cause any phenotypic variation and are not found to affect the traits researchers are interested in.

You can compare the genomes based on what is the preferred host of that strain. This method of comparing the differing genomes of pathogens that pray on different species is a underdeveloped field in bioinformatics. Current databases which contain genome comparisons focus more on multiple gene sequences rather than the sequence for the antigen.

You can identify the function of a gene based on its genetic ancestors and then the ecological factors that would have caused it to change (phylogenomics). Then using this technique you can also predict the mutations that are to come.
Bacteria has a special mode of recombination for which special techniques of analyzation have not been made. They have two unique patters of linkage disequilibrium. The genealogy of a bacterial genome can be discerned using one coalescent tree because of a genome wide clonal frame. Second, the linkage disequilibrium between two touching SNPs occurs less frequently with an increased amount of gene conversion(a sequence of DNA is replaced by an identical sequence of DNA). The more recombination occurs within a genome the bigger the sample size you need in order to detect disease causing agents. Therefore, in order to detect what sequence allows the bacteria to have a higher virulence you need to have a large sample size and analyze their SNPs. Once again, the necessity for this discovery leads to a necessity of the sequencing of every genome within this bacterial family.

Genome sequencing discoveries:

Loss of OspB may be host adaption
a70 allows strain to evade immune system
PF54 gene most variable region of DNA, reason for vast speciation

Use Phylogenomic analysis to see which sequences are involved in adapting to host and ecological differences. There is also phylogenic footprinting which predicts which regions of a sequence are gene regulatory based on which are the most conserved regions of the sequence. This is dependent on the assumption that non coding and cis regulatory sequences undergo mutations at a slower rate than coding sequences. In order to have more accurate phylogenic foot printing results, it is necessary to be searching through as many genomes as possible. It is important to have a diverse data set when analyzing changes and continuities in genomes.

There is also population genomics which focuses on what ecological and evolutionary forces caused the change in genome.

Pan genomics

Pan genomics is the compilation of all genes within a species. It is the complete genome and it can be used to track the gains or loses within a genome using mathematical calculation. Pan genomics is studied using blast however that means that the values are dependent on the e value cutoffs as well as the length coverage of the Blast. There isn't much horizontal gene transfer within the genome of B. burgdorferi s.l., so the majority of changes within it's genome are due to duplications and losses of homologous genes and sequence variations within orthologous genes. Even though horizontal gene transfer between distantly related members of this bacterial family, it is necessary to sequence the genomes of numerous other members so that we can have a more complete idea of what genes control what and how they adapt to clinical and ecological circumstances. We can analyze the plasmids within closely related strains and then use that analysis to understand which genes aid in the organisms adaptions to it's environment. Using ortholog-ortholog comparison you can figure out the possible functions of genes that somehow confer resistance such as PF54. You can basically map the genes history by doing these comparisons.

Bioinformatics framework

Manual curation of information extraction has become impossible due to the increasing amount of literature.

BioTM includes

IR:information retrieval, IR processes your input and finds subsequent links and literature which are associated with your query.

IE:information extraction, IE takes the information you're looking for out of the paper? Uses NLP Natural language processing, DM dating mining. Or maybe it just highlights the information related to what you need.

HG:hypothesis generation, ?

Using NERs to sort literature is very unreliable due to the differing language used to depict the same entity. Rule based systems attempt to aid in bridging this gap by creating algorithms that understand rule like ase equates to enzyme.

Here are DM techniques used to normalize the nomenclature to enhance TM outcomes, these are very time consuming to create

Hidden Markov Models (HMM)
Naive Bayes
Conditional Random Fields (CRFs)
Support Vector Machines (SVMs)

The big thing now is Relationship Extraction, RE.

A big issue with BioTM is that they fail to work in tandem with the biomedical community that could greatly benefit from their tools. Also, the tools that are available often take expect programming skills which limit the pool of users. The TM industry is more focused on creating novel techniques rather than integrating their programs into the research world. Another issue is the privatization of papers which prevent TM systems from retrieving data from complete papers rather than just the abstracts.

HOW TO FIX THE PROBLEMS PRESENTED: @Note: The Ikea of Bioinformatics

Biologists and bioinformatics people need to work together to create widely usable programs through @Note
@Note is basically the baseline bioinformatics program which researchers can customize to fit their own needs. @Note demands basic skills like POS, Part-of-speech, tagging and lexicon based semantic tagging. However it will be able to perform many highly complex processes which will allow the researchers to have a access to very useful tools without the need for an experienced skill set.
The face page seems highly interactive and user friendly

@Note has 3 user possibilities

Biologists: Can use this program to retrieve biomedical texts, annotation and curation.
Text Miner: Can use this program to analyze texts.
Application developers: Can use this program to easily design applications for bioinformatics.

How it works:

You search for something, initially it searches the PubMed database and pulls up any articles that contain the query. Then if you don't have a subscription to PubMed or other subcriptional databases then it does a web crawl and drags up any articles open to the public containing your query and then any articles that are abstract only on PubMed but public on another site. Then it will automatically site the paper for the researcher. Although, this program does have this built in search method other parts of the program need to be built from scratch.

The program was coded using AIBench a program under Java. It has plug-in in well known tools like WEKA, YALE and GATE.

Bioinformatics I

Friday, July 22, 2016

Bioinformatics Week 3

No comments:

Post a Comment