Coursera 1
We have so much data which has accumulated over the course of many years of
research however we now have too much data that we cannot have access to it all
because databases for them are not aligned. We can create databases to organize
things like post transcriptional structures of proteins, intron splice variants
which would lead to diseases, metabolic pathways and types of metabolites or
just a compilation of the literature which exists for a certain topic.
You can store data in a number of ways, problems=redundancy which takes up
storage space which is expensive.
Flat file
Spread sheet
You can fix redundancy with relational databases which is where you separate
out the data based on subject and then you draw a connection between the data
bases. Ex)Name of person, job, number, age, department. Separate into name of
person job and age. Then another table for number and department, so if you
want to contact the person there is a connection between the first and the
second data base and you can access the second data base through the first.
This avoids redundancy. A relational database has tables and columns. Use
SEQUEL or SQL to code these relational databases. Usually databases have more
than one way of input to access the data, it can either be an identifier or an
accession code. An identifier is string of
letters and digits
that is understandable to humans. However identifiers changes based on
what the coders believe is best so it could change. However the accession code
does not change however they are random numbers or letters which are not
indicative of their entry as the identifiers are.
"So, the
key point here is that to specify a sequence exactly in GenBank we use either its GI
number or theAccession.Version. Conversely, to retrieve the most up
to date sequence we can use the accession number without the version. And in
that case, the most up-to-date sequence will be retrieved automatically."
For nucleotide sequence databases there is Genbank and WGS which are the two
main databases.
To search for
sequences you can either use keywords OR you can search based on the sequence.
In gene
duplication there branch off two chains, an alpha sequence and a beta sequence,
all the sequences that precede the alpha are known as orthologs, so ever
sequence that is alpha, and every sequence that is beta also is known as a
ortholog, then the two branches from alpha and beta that branch towards each
other are known as paralogs and all of the branches are known as homologs.
You can search
across databases using Entrez/Gquery, this is the equivalent of google or
firefox in the database world.
*Possibly
interested in making a database of which diseases affect which metabolic
pathways and what issues that change leads to.
Text Mining
Text mining is
the process in which you retrieve evidence from previously published papers
either info
retrieval (IR), machine learning (ML, natural language processing (NLP), or
stat and comp linguistics (CL). Text mining is especially useful in human
genomics. It is becoming very difficult for researchers to have the time to
access the exorbitant amount of data on the topic they are researching so they
need a program that will do it for them.
Set Backs for
Text Mining:
- Copy
right issues, only a small percentage of papers are available for
free
- There
is a bias in what is published, only positive findings are published
rather than negative or no findings. So really in terms of data only half
of the picture is displayed, the data that didn't support a researchers
findings could support another researchers research however they would
never have access to it.
To text mine,
name entities (NEs) must be formed and also relationships between NEs's must be
established. These name entities must also be assigned identifiers. Everything
must only have 1 identifier and there must be no overlapping in the
identifiers. Once you identify NEs you must see whether the other NEs in the
paper have a relationship, just because they are in the same paper does not
mean they have a relationship.
Ex) Here
we have a sentence, first the keywords (NEs) are identified and then color coded
based on function, green is a species, orange is a protein/gene and blue is a
relationship phrase that establishes the relationship between the other NEs.
These NEs are then normalized (given different of the same identifiers) and
then the relationship between them is established.
NER (NEs
recognition) is harder to do in research websites because many of the genes
have multiple names and this is true for many things in science. Most things
have at least one synonym. Therefore when assigning identifiers you must take
into account all the differing names.
HUGO gene
nomenclature committee is attempting to normalize the naming of genes so every
gene has a specific name however not all genes have been assigned a name and
all the literature before the HUGO committee still have many names for singular
genes. So when creating a database which encompasses papers from years and
years it is hard to assign just one identifier to each thing.
One example in
attempting to normalize data, is that in order to differentiate between genes
and proteins because the naming often over laps, is that proteins begin with an
upper case letter whereas genes begin with a lower case letter. Some genes also
have normal word names like not and that which make it hard for search engines
to distinguish between the gene and the word.
Table of
linguistics from
Term |
Meaning |
Anaphor
|
A word or phrase that refers back to an
earlier word or phrase
|
Polysemy
|
The coexistence of many possible meanings
for a word or phrase
|
Homonymy
|
Each of two or more words having the
same spelling and pronunciation but
different meanings and origins
|
Semantics
|
Relating the meaning in language or logic
|
Syntax
|
The arrangement of words and phrases to
create well-formed sentences in a language
|
Part of speech
|
One of the traditional categories of words
intended to reflect their functions in a
grammatical context
|
There also
lies and issue with the abbreviation within research papers, many papers
abbreviate words and these abbreviations also happen to be the abbreviations to
upwards of 60 other words. This makes it very hard for an algorithm to direct
you to the proper entry.
Normalisation
is "to compare an NE against a dictionary of synonyms and identifiers, and
assign the matching identifier." Rule based approaches and string
similarity metrics attempt to solve this problem of having more than 1
identifier per word or more than 1 word per identifier.
3 Approaches
have been proposed to deal with this issue
- Rule
based: Assigning scores to identifiers, creation of a pool of words which
are associated with an identifier
- ML:
You can train a program to distinguish incorrect from correct
identifiers
- Hybrid:
Using author association with genes to distinguish different genes.
Usually a researcher will work with the same gene throughout their
career.
Identifying a
relationship between two NEs is hard because with english relationships can be
phrased in many different ways making it hard to program an algorithm that must
look for all of the ways. There have been suggestions of how to fix this
problem including ML, rule based methods, RelEX, RLIMS-P.
The other
issue with papers is that authors constantly suggest relationships ex)survivin
may interact with ATF5 to induce apoptosis. Also there are relationships which
negate the relationship like survivin does not interact with BcL.
Swanson's
ABC model or Swanson Linking is when you make a discovery by reading and
connecting different literature. This Swanson linking is one of the ways that
could be greatly improved by having TM programs that work. TM will become a
major player in scientific research broadening the horizons of what researchers
begin to look into.
Coursera Week 2
When blasting proteins we must
- Have
scoring systems that takes into account the evolutionary change over
time
- Favor
identical or similar amino acids
- Not
favor different or dissimilar amino acids
- Take
into account the abundance of certain amino acids
Two best approaches to aligning protiens
Blast and FASTA
In blast you establish a seed which has a threshold that an
alignment must be above in order to be concidered a match, you can adjust this
threshold as you please. The seed is a very short sequence which you crossmatch
with other sequences. Once you have gotten other sequences with a seed which is
close enough to your input seed (it exceeds the threshold) then you expand the
sequence you are comparing (past the seed sequence) and then keep expanding
until the sequence you are comparing your input to no longer is above the
threshold and then you stop expanding. And thus the sequence and its
appropriate length pops up as a match for a blast search.
Blast has multiple programs
Blastn=DNA v DNA=DNA
tBlastx=DNA v DNA=Protein
Blastx=DNA v Protein=DNA
tBlastn=Protein v DNA=Protein
Blastp=Protein v Protein=Protein
How can you tell if you results are similar to your query
- Significant
expect values (closer to 0)
- reciprocal
best hit
- similar
sizes
- common
motifs
- reasonable
multiple sequence alignment
- similar
3D structures
Use bit scores to see how good your results are
There are two bit scores given by blast R and S:this is good
E values: Use this always if you can, E value is dependent
on the size of the database as well as the length of the query sequence. Good e
value=10^-20
"Substitution matrices describe the likelihood that a
residue (whether nucleotide or amino acid) will change over evolutionary
time" PAM and BLOSUM are the two most prominent subtitution matrices. In
PAM the higher the number ex)PAM250 or PAM120 the better that matrices is for
divergent sequences, the opposite is true for BLOSUM