Bioinformatics I: July 2016

Sunday, July 31, 2016

Summer Research Week 5 (08/01 - 08/05)

Great progress! This week there will be two major focuses:

Bioinformatics

Bioinformatics Methods I, Coursera: Go through the materials of week 3.

Python & NLTK

In order to apply Natural Language Processing (NLP) to biomedical fields, you will have to learn a programming language- Python and a platform- Natural Language Toolkit (NLTK).
Go to https://www.python.org/downloads/ to download and install the most recent version (3.5.2) of Python.
Launch the "Terminal" in the Applications > Utilities folder. In the terminal, run the following commands: (Let me know if you have any problem. I have tried a few times, and finally made it work.)
- Install pip: run sudo easy_install pip
- Install NLTK: run sudo pip install -U nltk
- Install Numpy: run sudo pip install -U numpy
- Run Python: run python
- Test installation: type import nltk
Now, you can go to http://www.nltk.org/, and follow the "Some simple things you can do with NLTK".

Thursday, July 28, 2016

Bioinformatics Week 4

Lyme Disease:
Trying to determine which variations in genomes are stochastic(random) or selectively chosen by natural selection is difficult and takes a in depth understanding of the ecological and clinical conditions the strain has undergone. The natural selection that strains undergo is mostly due to host adaption.There are two hypotheses that surround how within the same region such different strains can arise.

Multiple Niche Polymorphism(MNP): The different strains are able to survive and thrive in the same region because they occupy different niches; different hosts, tissue types, and which organism carries the disease. Host adaption.
Negative Frequency-dependent Selection(NFD): The ability to evade a hosts immune system requires many varying strains because they have to adapt to their specific host. Immune-escape mechanisms.

In order to test which hypothesis is correct you can see whether the house keeping genes are maintained or not. A high amount of non-synonymous DNA in the house keeping genes would imply that NFD is the correct hypothesis and a high amount of synonymous data would suggest MNP is correct. There is also a chance that both hypothesi are correct it could be the working together of immune escape mechanisms and host adaption.
Mapping the genome and tracking adaptions is also helpful for predicting what future threats lyme disease could cause. Once you predict the variation in the strains then you could hopefully predict what changes have occured ecologically, climate wise, and migration wise.
Based on the current genomes that we have mapped we can recognize that some swapping of DNA has occurred. We can look at when recombination began occurring between strains based on how much recombination has occurred. If it is low, then it is recent, high then it started further back in history.
Sequencing genomes would also allow researchers to track the migration of a strain based on the crossing over that occurred between species and using the fact listed right above this. By comparing the date of migration and ecological events that occurred in the area they migrated from you can predict what will cause a strain to migrate.
With the strains of lyme disease become more and more diverse as well as more and more spread recombination will occur at a far faster rate creating an even more diverse selection of strains. The more diverse the more threatening lyme disease is to the public.
What bioinformatics can do for this field

Creating a linkage map between all the SNP's of the Borrelia genome using population genomics
Figuring out which genetic changes are associated with which strain and what that change does and what causes it using phylogenomics
Testing the ecological hypotheses mentioned in this paper (MNP, NFD) by estimating population size, migration rate, strain frequency and the crossing of genomes by using genome-level phylogeography.

Progress to keep in mind:

Bigger picture:
Understanding a highly adaptive bacteria genus like Borrelia will help researchers understanding how bacterial genomes evolve.

In order to tackle the problems listed above, I need to master population genomics, phylogenomics and phylogeography.

Key facts to understand about Borrelia:

Plasmids of bacteria differ but they have a set genome outside of the plasmid.
PF54 gene is responsible for evading immune system
OspC, dbpA,vls, b08, and a07 are all antigen genes responsible for diversifying selection

Statistics: MUST RESEARCH LATER ON

Bayesian

Markovian

Hidden Markov Models (HMMs) are statistical models for sequential data.
They are used for programing artificial intelligence, modeling biological sequences, pattern recognition (can be used in basic programing?)

Monday, July 25, 2016

Summer Research Week 4 (07/25 - 07/29)

You will get one more week to study in details about the Lyme disease review paper (Evolutionary Genomics of Borrelia burgdorferi sensu lato- Findings, Hypotheses, and the Rise of Hybrids). You might want to use some references to explore and expand your understanding to the next level. Don't forget to add any useful resource to the "Project Resource" page.

Friday, July 22, 2016

Bioinformatics Week 3

Lyme Disease

Lyme disease is constantly changing. There are many strains of the pathogen and these strains are constantly switching DNA, therefore their genome is constantly adapting and changing making it hard to treat. The diversity within the strains of pathogens is caused by the different species that host the disease, each strain adapts to its host and it's immune system. Overarching question: What are the mechanisms for adaption? What edits in the genome cause which adaptions? How to answer these questions: Constant genome sequences of the strains

Lyme disease has the most complex prokaryotic genome known. In databases there are a total of 30 strains of lyme. The issue to tackle is the growing number of differentiating genomes of lyme.

Researchers hoping to find the reason for the increased virulence within certain strains try to compare genomes, however the issue with genotyping the pathogens is that the majority of genetic differences between strains do not actually cause any phenotypic variation and are not found to affect the traits researchers are interested in.

You can compare the genomes based on what is the preferred host of that strain. This method of comparing the differing genomes of pathogens that pray on different species is a underdeveloped field in bioinformatics. Current databases which contain genome comparisons focus more on multiple gene sequences rather than the sequence for the antigen.

You can identify the function of a gene based on its genetic ancestors and then the ecological factors that would have caused it to change (phylogenomics). Then using this technique you can also predict the mutations that are to come.
Bacteria has a special mode of recombination for which special techniques of analyzation have not been made. They have two unique patters of linkage disequilibrium. The genealogy of a bacterial genome can be discerned using one coalescent tree because of a genome wide clonal frame. Second, the linkage disequilibrium between two touching SNPs occurs less frequently with an increased amount of gene conversion(a sequence of DNA is replaced by an identical sequence of DNA). The more recombination occurs within a genome the bigger the sample size you need in order to detect disease causing agents. Therefore, in order to detect what sequence allows the bacteria to have a higher virulence you need to have a large sample size and analyze their SNPs. Once again, the necessity for this discovery leads to a necessity of the sequencing of every genome within this bacterial family.

Genome sequencing discoveries:

Loss of OspB may be host adaption
a70 allows strain to evade immune system
PF54 gene most variable region of DNA, reason for vast speciation

Use Phylogenomic analysis to see which sequences are involved in adapting to host and ecological differences. There is also phylogenic footprinting which predicts which regions of a sequence are gene regulatory based on which are the most conserved regions of the sequence. This is dependent on the assumption that non coding and cis regulatory sequences undergo mutations at a slower rate than coding sequences. In order to have more accurate phylogenic foot printing results, it is necessary to be searching through as many genomes as possible. It is important to have a diverse data set when analyzing changes and continuities in genomes.

There is also population genomics which focuses on what ecological and evolutionary forces caused the change in genome.

Pan genomics

Pan genomics is the compilation of all genes within a species. It is the complete genome and it can be used to track the gains or loses within a genome using mathematical calculation. Pan genomics is studied using blast however that means that the values are dependent on the e value cutoffs as well as the length coverage of the Blast. There isn't much horizontal gene transfer within the genome of B. burgdorferi s.l., so the majority of changes within it's genome are due to duplications and losses of homologous genes and sequence variations within orthologous genes. Even though horizontal gene transfer between distantly related members of this bacterial family, it is necessary to sequence the genomes of numerous other members so that we can have a more complete idea of what genes control what and how they adapt to clinical and ecological circumstances. We can analyze the plasmids within closely related strains and then use that analysis to understand which genes aid in the organisms adaptions to it's environment. Using ortholog-ortholog comparison you can figure out the possible functions of genes that somehow confer resistance such as PF54. You can basically map the genes history by doing these comparisons.

Bioinformatics framework

Manual curation of information extraction has become impossible due to the increasing amount of literature.

BioTM includes

IR:information retrieval, IR processes your input and finds subsequent links and literature which are associated with your query.

IE:information extraction, IE takes the information you're looking for out of the paper? Uses NLP Natural language processing, DM dating mining. Or maybe it just highlights the information related to what you need.

HG:hypothesis generation, ?

Using NERs to sort literature is very unreliable due to the differing language used to depict the same entity. Rule based systems attempt to aid in bridging this gap by creating algorithms that understand rule like ase equates to enzyme.

Here are DM techniques used to normalize the nomenclature to enhance TM outcomes, these are very time consuming to create

Hidden Markov Models (HMM)
Naive Bayes
Conditional Random Fields (CRFs)
Support Vector Machines (SVMs)

The big thing now is Relationship Extraction, RE.

A big issue with BioTM is that they fail to work in tandem with the biomedical community that could greatly benefit from their tools. Also, the tools that are available often take expect programming skills which limit the pool of users. The TM industry is more focused on creating novel techniques rather than integrating their programs into the research world. Another issue is the privatization of papers which prevent TM systems from retrieving data from complete papers rather than just the abstracts.

HOW TO FIX THE PROBLEMS PRESENTED: @Note: The Ikea of Bioinformatics

Biologists and bioinformatics people need to work together to create widely usable programs through @Note
@Note is basically the baseline bioinformatics program which researchers can customize to fit their own needs. @Note demands basic skills like POS, Part-of-speech, tagging and lexicon based semantic tagging. However it will be able to perform many highly complex processes which will allow the researchers to have a access to very useful tools without the need for an experienced skill set.
The face page seems highly interactive and user friendly

@Note has 3 user possibilities

Biologists: Can use this program to retrieve biomedical texts, annotation and curation.
Text Miner: Can use this program to analyze texts.
Application developers: Can use this program to easily design applications for bioinformatics.

How it works:

You search for something, initially it searches the PubMed database and pulls up any articles that contain the query. Then if you don't have a subscription to PubMed or other subcriptional databases then it does a web crawl and drags up any articles open to the public containing your query and then any articles that are abstract only on PubMed but public on another site. Then it will automatically site the paper for the researcher. Although, this program does have this built in search method other parts of the program need to be built from scratch.

The program was coded using AIBench a program under Java. It has plug-in in well known tools like WEKA, YALE and GATE.

Sunday, July 17, 2016

Summer Research Week 3 (07/18 - 07/22)

You are making wonderful progress! This week we are going to focus on 3 aspects:

Finishing up any pending note-taking from the previous two weeks. There is no limit for the number of posts you can do in one week. You can separate them in any logical and convenient ways. Also, don't forget to view you posts after you posting them. I have fixed some format errors in your past two posts.
Read through a paper (A framework for the development of Biomedical Text Mining software tools) describing the basic framework of a text mining system for bioinformatics. You don't need to understand the details of every system/tool mentioned in the paper. However, try to capture the main idea about the whole framework and where and how these systems and tools can be put together to help people study biology. Hopefully, it will give you a bigger picture behind the paper you have read last week.
Lyme disease- this is the field that our research will be focusing on! Obviously, we are not going to build a general system solving all the biomedical problems. Instead, we will try to see how text mining can help people solve some problems in the research of Lyme disease. Professor Qiu in Hunter College gave me a review paper (Evolutionary Genomics of Borrelia burgdorferi sensu lato- Findings, Hypotheses, and the Rise of Hybrids) he wrote about Lyme disease. I believe that it will be a good starting point for us to explore this field. The paper is pretty long, so it is expected to extended this effort into next week.

Friday, July 15, 2016

Bioinformatics Week 2

Coursera 1
We have so much data which has accumulated over the course of many years of research however we now have too much data that we cannot have access to it all because databases for them are not aligned. We can create databases to organize things like post transcriptional structures of proteins, intron splice variants which would lead to diseases, metabolic pathways and types of metabolites or just a compilation of the literature which exists for a certain topic.
You can store data in a number of ways, problems=redundancy which takes up storage space which is expensive.
Flat file
Spread sheet
You can fix redundancy with relational databases which is where you separate out the data based on subject and then you draw a connection between the data bases. Ex)Name of person, job, number, age, department. Separate into name of person job and age. Then another table for number and department, so if you want to contact the person there is a connection between the first and the second data base and you can access the second data base through the first. This avoids redundancy. A relational database has tables and columns. Use SEQUEL or SQL to code these relational databases. Usually databases have more than one way of input to access the data, it can either be an identifier or an accession code. An identifier is string of letters and digits that is understandable to humans. However identifiers changes based on what the coders believe is best so it could change. However the accession code does not change however they are random numbers or letters which are not indicative of their entry as the identifiers are.

"So, the key point here is that to specify a sequence exactly in GenBank we use either its GI number or theAccession.Version. Conversely, to retrieve the most up to date sequence we can use the accession number without the version. And in that case, the most up-to-date sequence will be retrieved automatically." For nucleotide sequence databases there is Genbank and WGS which are the two main databases.

To search for sequences you can either use keywords OR you can search based on the sequence.

In gene duplication there branch off two chains, an alpha sequence and a beta sequence, all the sequences that precede the alpha are known as orthologs, so ever sequence that is alpha, and every sequence that is beta also is known as a ortholog, then the two branches from alpha and beta that branch towards each other are known as paralogs and all of the branches are known as homologs.

You can search across databases using Entrez/Gquery, this is the equivalent of google or firefox in the database world.

*Possibly interested in making a database of which diseases affect which metabolic pathways and what issues that change leads to.

Text Mining

Text mining is the process in which you retrieve evidence from previously published papers

either info retrieval (IR), machine learning (ML, natural language processing (NLP), or stat and comp linguistics (CL). Text mining is especially useful in human genomics. It is becoming very difficult for researchers to have the time to access the exorbitant amount of data on the topic they are researching so they need a program that will do it for them.

Set Backs for Text Mining:

Copy right issues, only a small percentage of papers are available for free
There is a bias in what is published, only positive findings are published rather than negative or no findings. So really in terms of data only half of the picture is displayed, the data that didn't support a researchers findings could support another researchers research however they would never have access to it.

To text mine, name entities (NEs) must be formed and also relationships between NEs's must be established. These name entities must also be assigned identifiers. Everything must only have 1 identifier and there must be no overlapping in the identifiers. Once you identify NEs you must see whether the other NEs in the paper have a relationship, just because they are in the same paper does not mean they have a relationship.

Ex) Here we have a sentence, first the keywords (NEs) are identified and then color coded based on function, green is a species, orange is a protein/gene and blue is a relationship phrase that establishes the relationship between the other NEs. These NEs are then normalized (given different of the same identifiers) and then the relationship between them is established.

NER (NEs recognition) is harder to do in research websites because many of the genes have multiple names and this is true for many things in science. Most things have at least one synonym. Therefore when assigning identifiers you must take into account all the differing names.

HUGO gene nomenclature committee is attempting to normalize the naming of genes so every gene has a specific name however not all genes have been assigned a name and all the literature before the HUGO committee still have many names for singular genes. So when creating a database which encompasses papers from years and years it is hard to assign just one identifier to each thing.

One example in attempting to normalize data, is that in order to differentiate between genes and proteins because the naming often over laps, is that proteins begin with an upper case letter whereas genes begin with a lower case letter. Some genes also have normal word names like not and that which make it hard for search engines to distinguish between the gene and the word.

Table of linguistics from

Term	Meaning
Anaphor	A word or phrase that refers back to an earlier word or phrase
Polysemy	The coexistence of many possible meanings for a word or phrase
Homonymy	Each of two or more words having the same spelling and pronunciation but different meanings and origins
Semantics	Relating the meaning in language or logic
Syntax	The arrangement of words and phrases to create well-formed sentences in a language
Part of speech	One of the traditional categories of words intended to reflect their functions in a grammatical context

There also lies and issue with the abbreviation within research papers, many papers abbreviate words and these abbreviations also happen to be the abbreviations to upwards of 60 other words. This makes it very hard for an algorithm to direct you to the proper entry.

Normalisation is "to compare an NE against a dictionary of synonyms and identifiers, and assign the matching identifier." Rule based approaches and string similarity metrics attempt to solve this problem of having more than 1 identifier per word or more than 1 word per identifier.

3 Approaches have been proposed to deal with this issue

Rule based: Assigning scores to identifiers, creation of a pool of words which are associated with an identifier
ML: You can train a program to distinguish incorrect from correct identifiers
Hybrid: Using author association with genes to distinguish different genes. Usually a researcher will work with the same gene throughout their career.

Identifying a relationship between two NEs is hard because with english relationships can be phrased in many different ways making it hard to program an algorithm that must look for all of the ways. There have been suggestions of how to fix this problem including ML, rule based methods, RelEX, RLIMS-P.

The other issue with papers is that authors constantly suggest relationships ex)survivin may interact with ATF5 to induce apoptosis. Also there are relationships which negate the relationship like survivin does not interact with BcL.

Swanson's ABC model or Swanson Linking is when you make a discovery by reading and connecting different literature. This Swanson linking is one of the ways that could be greatly improved by having TM programs that work. TM will become a major player in scientific research broadening the horizons of what researchers begin to look into.

Coursera Week 2

When blasting proteins we must

Have scoring systems that takes into account the evolutionary change over time
Favor identical or similar amino acids
Not favor different or dissimilar amino acids
Take into account the abundance of certain amino acids

Two best approaches to aligning protiens

Blast and FASTA

In blast you establish a seed which has a threshold that an alignment must be above in order to be concidered a match, you can adjust this threshold as you please. The seed is a very short sequence which you crossmatch with other sequences. Once you have gotten other sequences with a seed which is close enough to your input seed (it exceeds the threshold) then you expand the sequence you are comparing (past the seed sequence) and then keep expanding until the sequence you are comparing your input to no longer is above the threshold and then you stop expanding. And thus the sequence and its appropriate length pops up as a match for a blast search.

Blast has multiple programs

Blastn=DNA v DNA=DNA

tBlastx=DNA v DNA=Protein

Blastx=DNA v Protein=DNA

tBlastn=Protein v DNA=Protein

Blastp=Protein v Protein=Protein

How can you tell if you results are similar to your query

Significant expect values (closer to 0)
reciprocal best hit
similar sizes
common motifs
reasonable multiple sequence alignment
similar 3D structures

Use bit scores to see how good your results are

There are two bit scores given by blast R and S:this is good

E values: Use this always if you can, E value is dependent on the size of the database as well as the length of the query sequence. Good e value=10^-20

"Substitution matrices describe the likelihood that a residue (whether nucleotide or amino acid) will change over evolutionary time" PAM and BLOSUM are the two most prominent subtitution matrices. In PAM the higher the number ex)PAM250 or PAM120 the better that matrices is for divergent sequences, the opposite is true for BLOSUM

Sunday, July 10, 2016

Summer Research Week 2 (07/11 - 07/15)

Bioinformatics

Start taking notes on Bioinformatics Methods I, Coursera. You have learned the NCBI Blast in AP Biology, so it's shouldn't be too hard for you to cover the contents of week 1 & 2. It's an excellent opportunity to revisit it and see whether there is something new you can pick up

Machine Learning & Biology

Read the following paper and take notes: What the papers say: Text mining for genomics and systems biology, Nathan Harmston, Wendy Filsell and Michael PH Stumpf, Human Genomics, 2010. Applying text mining to biological research in genomics and system biology is one of the areas that machine learning can potentially have major impact.

Saturday, July 9, 2016

Bioinformatics Week 1

Bioinformatics is the study of data and the creation of algorithms and statistics to analyze them.

Three sub disciplines of bioinformatics:

Creation of algorithms and statistics to sort data
The analyzation and interpretation of the sorting
The implementation of what you have learned in the sorting process

Bioinformatics started with Gregor Mendel and now continues with the mapping of organisms genomes.

Bioinformatics includes

Protein structure information, Dna information, gene information, patient information, clinical trial information as well as information about the metabolic pathways of many species.

Bioinformatics can be used to simulate cell environments which can help custom design drugs and personalize medicine. It can be used to create a better model of drug testing and replace the mouse. We have collected all this information about the human genome and other organisms however we have too much to analyze it manually, so we must create algorithms and programs that will analyze the database for us or at least organize it into a digestible portion. In terms of agriculture, bioinformatics can map the genome of plants and diseases that plague plants create plants that are resistant. GMO’s. Also resistant to droughts.

Bioinformatic tools must be developed around

DNA sequence which determines protein sequence
Protein sequence which determines protein structure
Protein structure which determines protein function

Computational biologists use the tools and systems that bioinformaticians create to solve biological problems by creating algorithms and theories on how to utilize them.

Bioinformaticist knows how to create the systems and how to use them and make it easier for others to use them, a bioinformatician only knows how to create the system.

There are primary databases which contain data and then there are secondary databases which contain analyzation of that data.

What data systems already exist

Nucleotide sequence database

Collection of research publications and medical journals

3D structures of proteins

Organism classification: BIODIVERSITY largely unmapped and no set data base for it, all on paper

Enzymes and their function

Important technical abilities

JAVA

PERL: String manipulation

Regular expression matching

File parsing

Data format interconversion

FASTA format: represents nucleotide or peptide sequences in which base pairs are represented in single letter codes. Series of lines that should not exceed 120 characters and are usually 80 characters long.

Possible topic for project

*The structure of pathogens and their connection to the structure of functioning protiens in the human body. Or structure of pathogens against other pathogens.