Shared Flashcard Set

Details

Applied Bioinformatics: Final
University of Guelph BIOL*3300
88
Biology
Undergraduate 3
04/09/2017

Additional Biology Flashcards

 


 

Cards

Term
Algorithmic method
Definition
A method of making a tree, and evaluating a phylogeny. Uses a distance matrix, which may have overestimations of distance. A series of steps in an algorithm are followed for dealing with data, producing a tree. There are two methods of using the data matrix: neighbor joining and UPGMA. Produces a single answer, which cannot be evaluated relative to others. Computationally and conceptually simple. Straightforward assumptions, which are stringent and may be violated.
Term
Amelotin and enamelin
Definition
Proteins involved in tooth enamel. Encodes by genes that are potentially targeted by ancestral miRNA.
Term
Ancestral allele
Definition
The allele which is present in the common ancestor.
Term
Artificial neural network (ANN)
Definition
A mathematical model that is based on the brain neural networks. Functions to draw connections to various variables. Composed of an interconnected group of nodes where each node represents an artificial neuron. The set of input values have an associated weight corresponding to them, and inputs are fed into a summing function that sums the weights and maps results to an output. By conducting a feed-forward through the network, it results in the assignment of a value to each output node where the node has the highest value would be recorded. Models used in diagnosis of tumour cell stages, and predicament of survival rates after certain treatments for cancer.
Term
Back substitution
Definition
A substitution back to the original residue. Can alter measurements of distance. There are two changes in one taxa, to the same character, producing no differences. It masks changes. Common in genomes that have certain nucleotides at high frequency.
Term
Binary tree
Definition
A tree in which three edges meet in a node. There are n terminal nodes, where n is the number of species or molecules.
Term
Biological process ontology
Definition
Describes DNA metabolism. Ontology terms are nodes, arranged from general to specific. Genes are annotated with ontology terms. A biological process.
Term
Branches
Definition

Edges

A component of a tree. Those which connect nodes have an evolutionary divergence associated with them, generally denoted as ti. The sum of edge distance is an estimate of genetic distance.

Term
C-G rich island
Definition
A C is followed by a G. Exists in many genomes in regulatory regions, often promoters. Most sequences are very poor in CG, but in promoters there are many, forming an "island". Both C and G have three bonds rather than two, making DNA stronger. The cytosines are non-methylated to prevent mutagenesis.
Term
Chameleons
Definition
Lizards which live in trees and can change colour. Of interest for vicariance biogeography, because they can be found in Madagascar and Africa, but also in surrounding areas; they must have originated in Africa and dispersed outwards by ocean.
Term
Character
Definition
A position within a sequence, in molecular evolution. Has a finite number of character states. Thought of as independent variables. Can be any trait. Historically they were morphological traits such as body parts, or enzymes. To be useful in inferring relationships, they must vary. Assumes that characters are independent and homologous, and that nucleotide sequence data is unordered.
Term
Character state
Definition
Mutually exclusive states of characters, in molecular evolution. For nucleotides they are A, T, G, and C.
Term
Characters are homologous
Definition
An important assumption made in sequence alignment. All states are assumed to have been derived from a corresponding state observed in the common ancestor of those taxa. Alignment of sequences must achieve positional homology. All nucleotides at a given position with the matrix should be homologous. With morphological characters, wings of birds are not homologous with wings of bats.
Term
Characters are independent
Definition
An assumption made in sequence alignment. The state of character 1 is assumed to have no bearing on the state of character 2. Not entirely true, but reasonable. Roughly correct, with some exceptions. If a codon must code for a specific amino acid, the nucleotide identity at the first position will have a strong bearing on the identity of the second position.
Term
Clade
Definition

Monophyletic group

All the individuals that are derived from a common ancestor. All ancestral taxa from a given node. Considered to reflect true genealogies. Infamous for instability, with many names for groups.

Term
Coincidental substitution
Definition
There are two changes in two taxa, to different characters, producing one difference.
Term
Connexin 43
Definition
There is comparably high conservation of miRNA 1 binding sites across species.
Term
Convergent substitution
Definition
There are two changes in one taxa, to a different character, and one change in another taxa to that same character, producing no differences. It masks changes. Common in genomes that have certain nucleotides at high frequency.
Term
Correction method
Definition

Uses a matrix to represent sequence alignments. Starting with sequence alignments, records the frequency of nucleotide matches in a matrix.

Fxy = (nAA/N nAC/N nAG/N nAT/N nCA/N nCC/N nCG/N nCT/N nGA/N nGC/N nGG/n nGT/N nTA/N nTC/N nTG/N nTT/N)

Fxy = (a b c d e f g h i j k l m n o p)

Term
Corrections
Definition

Corrections for evolutionary distance.

1. Correction for different substitution rates in the genome. Some domains are more conserved than others, in order to remain functional. Changes occur equally, but natural selection takes out individuals with mutations in crucial domains. 

2. Correct for nucleotide frequency in the genome. Baterial genomes are high in CG. Substitutions between nucleotides do not occur at equal frequency.

3. Correct for analysis of protein-coding sequences. Genetic code is redundant. Calculate distance with non-synonymous and synonymous substitutions separately; the former produces a distance five times greater!

Term
Data matrix X
Definition
Assigns a character state, xij, to each taxon, i, for each character, j.
Term
DAVID
Definition

Database for Annotation, Visualization, and Integrated Discovery

An online gene annotation investigation website. Contains functions of genes and gene ontology.

Term
Derived allele
Definition
The allele which is not present in the common ancestor. Enriched for genetic disorders.
Term
Enriched
Definition
A relative term. Statistical significance of enrichment is calcualted with Fisher's exact test. If the proportion of miRNA target genes that are transcription factors is greater than the proportion of expressde genes that are transcription factors, then transcription factors are enriched in miRNA targets.
Term
Evolutionary distance
Definition
The mean number of substitutions per site that have occurred between two sequences since their divergence. The relatedness of individuals is a function of the similarities of their homologous characters. Differences observed may not represent the substitutions that have occurred in the past. There may be back or multiple substitutions, causing base pair differences that mask previous substitutions. The real number of substitutions per site is greater than observed differences. There are many corrections for distance.
Term
F matrix
Definition

Holds the best scores for sequence alignments between one sequence (x) ending at position i, and another sequence (y) ending at position j. A common gap penalty is -8. There are three possible transitions to move from one state to another. Optimal score of an alignment, F(i,j), is the highest possible value of a previous optimum score, and the transition probability to the new state at (i,j). Uses BLOSUM 50 tables and penalties to obtain scores. The best alignment of sequences ending in xi and yj is a previous alignment score, plus the transition score to a new alignment. To obtain the best alignment, start at the best score, and follow the traceback through the matrix. The number of columns and rows is n + 2. A very error-prone process.

F(i,j) = max {F(i - 1,j - 1) + S (xi,yj); F(i - 1,j) - 8; F(i,j - 1) - 8}

1. F(0,0) = 0. Adding both x1 and y1 residues together from a previous point.

2. F(1,0) = -8. Point back to the originating cell. Adding residue from x1 only to a previous alignment.

3. F(0,1) = -8. Point back to the originating cell. Adding residue from y1 only to a previous alignment.

Term
Fisher's exact test
Definition
Used to test if different attributes are independent within small samples. Similar to a chi square test. Used if the expected number is 5 or less. Determines if two variables are independent. Null hypothesis of independence is an odds ratio of 1. Evaluates the probability of observed odds ratio or more extreme, given the true odds ratio. Can calcualte the statistical significance of enrichment. Is transcription factor membership independent of whether a gene is an miRNA target or not?
Term
Gaps
Definition
Not usually considered when calculating the distance between sequences. Excluded at their associated position. This undercounts sequence differences, but nucleotide differences should give a good estimate of relatedness of two sequences relative to other sequences. Makes sequences less similar, but are typically ignored. Hard to model and give penalties. Makes a difference in overall score and comparison of sequences.
Term
Gene ontology
Definition
An annotation system for genes, to describe their molecular function, biological processes, and cellular compartmentalization. Hierarchical, with general and specific terms.
Term
Genes
Definition
Can be grouped based on biological pathways in which they participate, or with genes whose expression is correlated according to transcription factor control, and other criteria.
Term
Global alignment
Definition
Alignment of sequences from end to end. Done by Needleman-Wunsch algorithm.
Term
HIV
Definition
A virus which evolves rapidly. Every patient has a genetically diverse population of viruses. In France, there was a case where over 200 children became infected with HIV, and a few people who were blamed were sentenced to death. However, it was found later to have been an accident.
Term
Internal node
Definition
A component of a tree. Represents ancestors.
Term
Jukes and Cantor (JC)
Definition

A simple correction for hidden substitutions (back and multiple). Assumes all nucleotides have equal frequency, and equal rates of substitution betwen them. The deviation of dxy from D decreases as D approaches 0.75, where dxy is the corrected distance, and D is the population of nucleotides that have a difference.

D = 1 - (a + f + k + p)

dxy = - (3/4) ln (1 - (4/3)D)

Term
KEGG
Definition
A database of chemical pathways.
Term
Kimura 2 (K2P)
Definition

A parameter model that corrects for different transition and tranversion rates. Assumes equal nucleotide frequencies. If transitions and transversions occur at the expected ratio of 1:2, it produces the same calculation as Jukes and Cantor.

dxy = (1/2) ln (1 / (1 - 2P - Q)) + (1/4) ln (1 / (1 - 2Q))

P = frequency of transitions = c + h + i + n

P = frequency of transversions = b + d + e + g + j + l + m + o

Term
Local alignment
Definition
Alignment of subsequences. Shared domains between otherwise divergent proteins.
Term
Lopez-Valenzuela et al, 2012
Definition

"An ancestral miRNA-1304 allele present in Neanderthals regulates genes involved in enamel formation and could explain dental differences with modern humans."

Noted that the miR-1304 seed region is the site important for identifying transcript targets, and it differs between Neanderthals and humans. Which species has the derived allele? What are target genes of the human and Neanderthal miRNAs? Are genes targeted by human miRNA associated with a specific process? Examined known DNA and a new sequence from a 49,000 year old sample, to ensure that amelotin and enamelin genes have miRNA target sites in Neanderthals.

Term
Markov chain
Definition

A model which uses transition proabilities. Assumes probability of a residue depends on previous residues. Not intrinsically obvious how to use it. Helps obtain optimal alignments. The number of possible alignments can be astronomically high. It is impossible to evaluate them all for the best using a BLOSM matrix and gap penalties. Instead, captures the best alignment with fewer calculations. Calculates transition probabilities between different sequence residues. Works in logs, and sum of log P values.

P(x) = qxi Π(i = 2) rx(i-1),xi

log (P(x)) = log (qxi) + ΣN(i = 2) log (rx(i -1),xi)

x = sequence

qxi = initial probability of some residue or sequence

rx(i - 1),xi = transition probability between xi and x(i - 1). Depends on previous states

N = length of alignment x.

Term
MATRAS
Definition

MArkov TRAnsition of protein Structure evolution

A program designed to compare 3D protein structures. Unique because of its structural similarity score and various structure comparison methods. Allows researchers to infer functional properties of recently discovered proteins, classify protein structures, and view if certain protein structures were considered in evolution.

Term
Maximum likelihood
Definition

A criteria used in the optimization method of tree-making. The tree with the higest likelihood of probability is favoured. Assumes that substitutions are rare. Counts each change based on its probability. The goal is to obtain the phylogenetic tree, out of all possible trees, that would be the most likely to produce the observed sequences. Likelihood of a tree corresponds to the probability that a proposed evolutionary process and hypothesized history would give rise to the observed data.

P(data | model)

Term
MCM proteins
Definition
Bind to chromatin and possess ATP-dependent DNA helicase activity.
Term
Metzker et al, 2002
Definition

"Molecular evidence of HIV-1 transmission in a criminal case"

Richard Schmidt was accused of infecting his ex-girlfriend with HIV that had been drawn from a patient under his care. He had a sample of HIV-positive blood in his fridge. The ex-girlfriend later died of HIV. The HIV reverse transcriptase of the victim as well as Schmidt's patients were sequenced. A phylogenetic tree of the sequence showed that patients formed a paraphyletic group. The patient and victim's sequences were both monophyletic. Patient's sequences were closer to the victim's than other Louisiana HIV sequences. This supports the conclusion that Schmidt is guilty of murder.

Term
miRNA-1304
Definition
A genomic sequence that is conserved among primates, except humans. In Homo neanderthalensis, Pan troglodytes, Gorilla gorilla, Papio pygmaeus, and Macaca mulatta, the SNP is a G, but in humans it is an A. The Neanderthal allele is ancestral, and humans have the derived sequence; this supports the idea that the miRNA is important for human evolution. Functions of genes putatively targeted by ancestral and derived miR-1304 were investigated, and sorted into groups based on biological processes. Targets of derived miR-1304 are enriched for a set of diseases, and a number of processes. The SNP occurs in the seed region of the miRNA. Involved in neurological and cellular functions. Affects genes for amelotin and enamelin, which are for tooth development. Determined if predicted target genes are enriched in certain biological functions and molecular networks. Classified genes with similar biochemical functions and genetics that are invovled in the same biological processes. Involved in neurological and cellular functions.
Term
Molecular function ontology
Definition
Reflects categories of gene product functions. A gene product may be associated with more than one ontology term. A chemical process.
Term
Multiple substitution
Definition
A substitution in which two changes are counted as one. Can alter measurements of distance. There are two changes in one taxon, producing one difference.
Term
Neanderthal
Definition

Homo neanderthalensis

Diverged from humans 100,000 years ago. Became extinct 30,000 years ago. Had larger skulls than humans, but thinner tooth enamel.

Term
Needleman-Wunsch algorithm
Definition
Aligns sequences from end to end, globally. Requires N x N calculations at most, where N is the sequence length. Guarantees the best alignment given the scoring system. Based on a Markov chain, and uses an F matrix. Normally the answer is expressed as its log, because it is easier to deal with.
Term
Neighbor joining (NJ)
Definition
A method of using a distance matrix in algorithmic tree-making. Assumes additivity. The distance between taxa is represented by the sum of branches. Must agree with calculated distances in the matrix. Distances represented by trees will equal distances in the matrix, if assumptions are true. A popular algorithmic clustering technique. Distances do not have to be ultrametric to guarantee the correct tree, but they do have to be additive. For every four sets of terminal nodes, the sums of distances must be equal and larger than the third sum. Distance from any internal node to descendants may differ. Allows for unequal rates of evolution along lineages, which can occur in nature. Distances are more likely to be additive than ultrametric, so always use NJ over UPGMA, if it is available.
Term
Non-synonymous substitution
Definition
Substitutions which change the protein amino acid sequence. Five times less common than synonymous substitutions. Produce evolutionary distances five times shorter.
Term
Nucleotide sequence data is unordered
Definition
An assumption made in sequence alignment. Multistate characters may be ordered or unordered. Nucleotide sequence data can change from one state to any other state, with no intermediates between states.
Term
Odds
Definition
The probability of an outcome, divided by the probability of that outcome not happening.
Term
Ontology
Definition
The concepts and relationships that can exist for an agent or community of agents.
Term
Ontology.org
Definition
A database of gene functions.
Term
Optimization method
Definition
A method of making a tree and evaluating a phylogeny. Optimality criteria are defined. First a tree is produced, and then it is evaluated based on certain criteria. All possible trees are evaluated and the best one is used. One tests a range of hypotheses, or different phylogenies against optimality criteria. The number of possible trees can be quite high. It can be unrealistic to use. Huge amounts of literature. Optimality criteria include parsimony, maximum likelihood, distance methods, and probabilistic methods. The preferred method if there is time and computational power to use it. One can compare alternative phylogenies to one another.
Term
P value
Definition
The probability of observing outcomes that are more extreme than the data, given the null hypothesis is true.
Term
Parallel substitution
Definition
There are two changes in the same nucleotide, to the same character, producing no difference. It masks changes. Common in genomes that have certain nucleotides at high frequency.
Term
Paraphyletic group
Definition
Arise when a group or classification with a common ancestor excludes a derived lineage. Includes reptiles.
Term
Parsimony
Definition
A criteria used by the optimization method of tree-making. The tree which requires the smallest number of substitutions is the best. Counts each change equally. A simple explanation is better than a complex one. Ad hoc hypotheses should be avoided. The simplest explanation for shared attributes among taxa is that they have inherited these attributes from a common ancestor. Assumptions of homoplasy, where a character is shared between species but is not present in their common ancestor, is usually necessary.
Term
Phylogenetics
Definition
The evolutionary relationships between different entities.
Term
Polyphyletic group
Definition
Arise when two unrelated entities are classified as part of the same group. Includes new- and old-world vultures. Often formed because of convergent evolution.
Term
Probabilistic model
Definition

Assumes independence of each site

P(X | M) = Πn(i = 1) {xi}

log (P(x)) = Σn(i = 1) qxi

x = sequence

xi = probability of each nucleotide

Term
Reptiles
Definition
A paraphyletic group. Includes crocodiles, lizards, and turtles, but excludes birds which share a recent common ancestor.
Term
RNA sequencing
Definition
Experiments may compare gene expression between a treatment and a control. All genes in a certain pathway may be upregulated together. Genes with functional similarity are co-expressed.
Term
Polytomy
Definition
When more than three branches connect to an internal node.
Term
Root
Definition
A component of a tree. The ancestor of all sequences that comprise the tree. Most trees generataed are unrooted.
Term
Seed region
Definition
Defines the specificity of an miRNA. Changes in this region are important.
Term
Single substitution
Definition
There is one change in one taxon, producing one difference.
Term
Smith-Waterman algorithm
Definition

Identifies the optimal local alignment between two sequences. Gives the best subsequence alignments. Aligns local motifs that are shared. Can identify shared domains between otherwise divergent proteins. Traceback begins from any point in the matrix, not just in the cell within the final row and final column. Ends when a 0 is encountered. Doesn't align the whole protein together. Similar to the Needleman-Wunsch algorithm, except there is a fourth possibility, 0.

F(i,j) = max {F(i - 1,j - 1) + S(xi,yj); F(i - 1,j) - 8; F(i,j - 1) - 8; 0}

Term
Substitutions
Definition
Positions in sequences have differences in frequency of substitutions. Protein motifs and the first two positions in a codon evolve more slowly than other regions. If one treats all sites equally, there is underestimation in the number of substitutions because some sites may not change while others may change frequently. Synonymous and non-synonymous substitutions should be examined separately. Includes back, multiple, single, coincidental, parallel, and convergent substitutions. Includes transitions and transversions.
Term
Synonymous substitution
Definition
A substitution that causes no change in protein amino acid sequence. Five times more common than non-synonymous substitutions. Produces evolutionary distances five times greater.
Term
T-Coffee
Definition

Tree-based Consistency Objective Function For alignment Evaluation

A multiple alignment software that uses global and local alignment software inputs to output an alignment and a phylogenetic tree. Algorithm involves weighting two libraries by building a primary library, then building an extended library where alignments in the primary library are compared to other sequences. Extended alignment is the stop that contributes the most to accuracy. Very accurate, but requires a lot of computational time to run the program. Can be used to discover novel proteins via homology, and align known proteins and genomes to create phylogenetic trees.

Term
TargetRank
Definition
A miRNA target searching tool that can require perfect identity between a miRNA seed region and the target. Evaluates other attributes as well. Found 35 ancestral miR-1304 target genes, and 515 derived human targets.
Term
TargetScan
Definition
A miRNA target searching tool that can require perfect identity between a miRNA seed region and the target. Evaluates other attributes as well. Found 4 ancestral miR-1304 target genes, and 140 derived human targets.
Term
Terminal node
Definition

Leaf

Terminal taxa

A component of a tree. Represents sequences or organisms for which there are data. There is one possible tree with three terminal nodes, three possible trees with four, 15 possible with five, 10 million possible with 10, and 2.2 x 1020 possible with 20. The number of possible trees with n nodes is:

B(n) = Πn(i = 3) (2i - 5)

Term
Tooth enamel
Definition
A major difference betwen humans and Neanderthals. Neanderthals had thinner enamel. Ancestral miRNA of miR-1304 encodes amelotin and enamelin. Examined a 49,000 year old Neanderthal sample to ensure genes have miRNA target sites.
Term
Topology
Definition
A tree with a given labelling.
Term
Traceback
Definition
Arrows in an F matrix which keep track of the starting point used to calculate the best score. If it points up, add a Y residue and a gap. If it points left, add an X residue and a gap. If it points diagonally, add both residues. The pointer may point from one box to more than one other box; there may be more than one alignment with the same optimal score. Pick either one of them.
Term
Transition
Definition
Subsitutions of A ↔ G and C ↔ T. Changes from purine to purine, or pyrimidine to pyrimidine. Occur more than expected. Transitions that occurred in the past are more likely to be hidden than transversions. Increase the distance estimate relative to transversions, given that there are more potential historical substitutions that mask observed differences.
Term
Transversion
Definition
Substitutions of A ↔ C, A ↔ T, C ↔ G, and G ↔ T. Changes from purine to pyrimidine or pyrimidine to purine. Occurs less than expected.
Term
Transversions: transitions ratio
Definition
There are two possible transversiosn and one possible transition at every nucleotide, so a ratio of 2:1 is expected. However, the ratio is often around 1:2.
Term
Tree
Definition
A mathematical structure used to model actual evolutionary history of groups of sequences and organisms. Consists of nodes, including terminal and internal nodes, connected by branches, and a root. Axes can be rotated with no impact to topology, like a mobile. Trees can appear very different, yet represent the same data. There are two methods of making them: algorithmic and optimization methods.
Term
Unrooted tree
Definition
A tree with no root; the earliest point in time is unknown. There are (2n - 2) total nodes and (2n - 3) edges in an unrooted tree, where n is the number of species or molecules.
Term
UPGMA
Definition
A method of using a distance matrix in algorithmic tree-making. Formally appropriate only if the distance data are ultrametric, so that for any three taxa, two of the three distances are equal and at least as large as the third. Assumes there is a molecular clock, and that all sequences have diverged at an equal rate from their common ancestor. Length is viewed as evolutionary time. This assumption is almost never true with real data. Produces incorrect trees at some level.
Term
Vicariance biogeography
Definition
An evolutionary theory concerned with fragmentation of the environment. A major factor promoting biological evolution.
Term
Vultures
Definition
New- and old-world vultures have different origins and evolutionary histories, but both are considered to be vultures; this is a polyphyletic group.
Term
Clustal
Definition

A global multiple sequence alignment software. Can be customized by assigning penalties for gaps and mismatched nucleotides. Capable of multiple sequence alignments, phylogenies, alignment visualizations, and more.

1. Pairwise alignments done between sequences

2. Similarity matrix used for building guide trees to organize alignments

3. Combines and aligns sequences following branch order in the similarity tree

Term
Parsimony score
Definition
The sum of parsimony scores for each position in a sequence.
Supporting users have an ad free experience!