Friday, July 15, 2016

Bioinformatics Week 2

Coursera 1
We have so much data which has accumulated over the course of many years of research however we now have too much data that we cannot have access to it all because databases for them are not aligned. We can create databases to organize things like post transcriptional structures of proteins, intron splice variants which would lead to diseases, metabolic pathways and types of metabolites or just a compilation of the literature which exists for a certain topic.
You can store data in a number of ways, problems=redundancy which takes up storage space which is expensive.
Flat file
Spread sheet
You can fix redundancy with relational databases which is where you separate out the data based on subject and then you draw a connection between the data bases. Ex)Name of person, job, number, age, department. Separate into name of person job and age. Then another table for number and department, so if you want to contact the person there is a connection between the first and the second data base and you can access the second data base through the first. This avoids redundancy. A relational database has tables and columns. Use SEQUEL or SQL to code these relational databases. Usually databases have more than one way of input to access the data, it can either be an identifier or an accession code. An identifier is 
string of letters and digits that is understandable to humans.  However identifiers changes based on what the coders believe is best so it could change. However the accession code does not change however they are random numbers or letters which are not indicative of their entry as the identifiers are.
"So, the key point here is that to specify a sequence exactly in GenBank we use either its GI number or theAccession.Version. Conversely, to retrieve the most up to date sequence we can use the accession number without the version. And in that case, the most up-to-date sequence will be retrieved automatically." For nucleotide sequence databases there is Genbank and WGS which are the two main databases.

To search for sequences you can either use keywords OR you can search based on the sequence.

In gene duplication there branch off two chains, an alpha sequence and a beta sequence, all the sequences that precede the alpha are known as orthologs, so ever sequence that is alpha, and every sequence that is beta also is known as a ortholog, then the two branches from alpha and beta that branch towards each other are known as paralogs and all of the branches are known as homologs.

You can search across databases using Entrez/Gquery, this is the equivalent of google or firefox in the database world.

*Possibly interested in making a database of which diseases affect which metabolic pathways and what issues that change leads to.

Text Mining

Text mining is the process in which you retrieve evidence from previously published papers

either info retrieval (IR), machine learning (ML, natural language processing (NLP), or stat and comp linguistics (CL). Text mining is especially useful in human genomics. It is becoming very difficult for researchers to have the time to access the exorbitant amount of data on the topic they are researching so they need a program that will do it for them.

Set Backs for Text Mining:
  • Copy right issues, only a small percentage of papers are available for free 
  • There is a bias in what is published, only positive findings are published rather than negative or no findings. So really in terms of data only half of the picture is displayed, the data that didn't support a researchers findings could support another researchers research however they would never have access to it.
To text mine, name entities (NEs) must be formed and also relationships between NEs's must be established. These name entities must also be assigned identifiers. Everything must only have 1 identifier and there must be no overlapping in the identifiers. Once you identify NEs you must see whether the other NEs in the paper have a relationship, just because they are in the same paper does not mean they have a relationship.

Ex) Here we have a sentence, first the keywords (NEs) are identified and then color coded based on function, green is a species, orange is a protein/gene and blue is a relationship phrase that establishes the relationship between the other NEs. These NEs are then normalized (given different of the same identifiers) and then the relationship between them is established. 

NER (NEs recognition) is harder to do in research websites because many of the genes have multiple names and this is true for many things in science. Most things have at least one synonym. Therefore when assigning identifiers you must take into account all the differing names. 
HUGO gene nomenclature committee is attempting to normalize the naming of genes so every gene has a specific name however not all genes have been assigned a name and all the literature before the HUGO committee still have many names for singular genes. So when creating a database which encompasses papers from years and years it is hard to assign just one identifier to each thing. 
One example in attempting to normalize data, is that in order to differentiate between genes and proteins because the naming often over laps, is that proteins begin with an upper case letter whereas genes begin with a lower case letter. Some genes also have normal word names like not and that which make it hard for search engines to distinguish between the gene and the word.
Table of linguistics from

Term Meaning
Anaphor
A word or phrase that refers back to an
earlier word or phrase
Polysemy
The coexistence of many possible meanings
for a word or phrase
Homonymy
Each of two or more words having the
same spelling and pronunciation but
different meanings and origins
Semantics
Relating the meaning in language or logic
Syntax
The arrangement of words and phrases to
create well-formed sentences in a language
Part of speech
One of the traditional categories of words
intended to reflect their functions in a
grammatical context
There also lies and issue with the abbreviation within research papers, many papers abbreviate words and these abbreviations also happen to be the abbreviations to upwards of 60 other words. This makes it very hard for an algorithm to direct you to the proper entry.
Normalisation is "to compare an NE against a dictionary of synonyms and identifiers, and assign the matching identifier." Rule based approaches and string similarity metrics attempt to solve this problem of having more than 1 identifier per word or more than 1 word per identifier. 
3 Approaches have been proposed to deal with this issue
  • Rule based: Assigning scores to identifiers, creation of a pool of words which are associated with an identifier 
  • ML: You can train a program to distinguish incorrect from correct identifiers 
  • Hybrid: Using author association with genes to distinguish different genes. Usually a researcher will work with the same gene throughout their career. 
Identifying a relationship between two NEs is hard because with english relationships can be phrased in many different ways making it hard to program an algorithm that must look for all of the ways. There have been suggestions of how to fix this problem including ML, rule based methods, RelEX, RLIMS-P. 
The other issue with papers is that authors constantly suggest relationships ex)survivin may interact with ATF5 to induce apoptosis. Also there are relationships which negate the relationship like survivin does not interact with BcL. 
 Swanson's ABC model or Swanson Linking is when you make a discovery by reading and connecting different literature. This Swanson linking is one of the ways that could be greatly improved by having TM programs that work. TM will become a major player in scientific research broadening the horizons of what researchers begin to look into. 


Coursera Week 2
When blasting proteins we must 

  • Have scoring systems that takes into account the evolutionary change over time 
  • Favor identical or similar amino acids
  • Not favor different or dissimilar amino acids 
  • Take into account the abundance of certain amino acids
Two best approaches to aligning protiens
Blast and FASTA
In blast you establish a seed which has a threshold that an alignment must be above in order to be concidered a match, you can adjust this threshold as you please. The seed is a very short sequence which you crossmatch with other sequences. Once you have gotten other sequences with a seed which is close enough to your input seed (it exceeds the threshold) then you expand the sequence you are comparing (past the seed sequence) and then keep expanding until the sequence you are comparing your input to no longer is above the threshold and then you stop expanding. And thus the sequence and its appropriate length pops up as a match for a blast search. 
Blast has multiple programs 
Blastn=DNA v DNA=DNA
tBlastx=DNA v DNA=Protein
Blastx=DNA v Protein=DNA
tBlastn=Protein v DNA=Protein
Blastp=Protein v Protein=Protein
How can you tell if you results are similar to your query

  • Significant expect values (closer to 0)
  • reciprocal best hit
  • similar sizes
  • common motifs
  • reasonable multiple sequence alignment
  • similar 3D structures
Use bit scores to see how good your results are
There are two bit scores given by blast R and S:this is good
E values: Use this always if you can, E value is dependent on the size of the database as well as the length of the query sequence. Good e value=10^-20
"Substitution matrices describe the likelihood that a residue (whether nucleotide or amino acid) will change over evolutionary time" PAM and BLOSUM are the two most prominent subtitution matrices. In PAM the higher the number ex)PAM250 or PAM120 the better that matrices is for divergent sequences, the opposite is true for BLOSUM

No comments:

Post a Comment