Shared Flashcard Set

Details

Applied Bioinformatics: Quiz 2
University of Guelph BIOL*3300
136
Biology
Undergraduate 3
03/18/2017

Additional Biology Flashcards

 


 

Cards

Term
-a
Definition

Anchor length

In TopHat, the minimum number of bases that need to be on either side of a junction for it to be reported: 8.

Term
A and T rich repeated sequence model
Definition

The probability of an A is 0.4, T is 0.4, C is 0.1, and G is 0.1 We can score A or T as 0.4/25 = 1.6, and G or C as 0.1/0.25 = 0.4. Calculated individually.

P ("AGCTAT" | M) = (0.1)2(0.4)4 = 0.000256.

Term
Actin
Definition
A highly expressed gene.
Term
Adapter
Definition
A known DNA sequence.
Term
Alignment score
Definition
Quantifies how good an alignment is. Higher scores have better alignment. Calculated by subtracting penaties for differences between read and reference sequences. The best score is 0. Minimum score threshold for an alignment to be kept depends on the read length. If a read aligns to several locations, the alignment with the best score is kept.
Term
Alignment software
Definition

Aligner

Software which does the first step of RNA sequence analysis: it aligns reads. Takes unmapped reads and attempts to find a genomic position for each read. Some reads will align across introns, and some reads will be polymorphic to the genome, caused by indels and SNPs. For a DNA sequence 100 bp long, the probability that a random 4-nucleotide sequence will align is 6.22315 x 10-61; a very small probability. One sequence may align to several sites due to gene duplications and translocations. A good alignment has very similar or identical sequences. If you are comparing an RNA sequence to the sequenced genome of the same organism, There should be a perfect alignment. If you are comparing to a reference genome, there will be around 1% of natural variation in humans. If there are too many SNPs and indels, the read won't align; this judgement is made by the software. Millions of sequenced DNA fragments are aligned, typically to a reference genome of the same species.

Term
Alpha diversity
Definition
Diversity within one sample. The number of different types of microbes in a single sample.
Term
Amino acid frequency
Definition
In order to calculate frequency of amino acid alignments among related proteins, one would need a large, random sample of confirmed alignments to estimate probabilities. Small samples would not accurately find the true frequency of alignments. It should be a set of aligned proteins that are similar to the set of known proteins; some protein families are more highly represented than others within the dataset of known alignments. Confirmation is difficult. Should sample proteins with common ancestors. Some amino aids diverge with different frequencies. Some residues are interchangeable due to size, shape, and charge.
Term
Amino acid sequence alignment
Definition

Derivatives of score ratios have been calculated for use in the context of sequence alignment of amino acids. The ratio can give the probability of two amino acid residues being aligned in two homologous sequences (pH) divided by the probability of two amino acids being aligned in two unrelated sequencdes (qR). Given a known set of homologous sequences, we can calculate pH. Given the frequency of amino acids, we can calcualte qR. We can score each amino acid.

S = log (pH / qR)

Term
Anchor length
Definition
A parameter of RNA sequence aligning. The minimum length of a divided sequence, over an interlocus. Typically 8 base pairs.
Term
Anchor mismatch
Definition
A parameter of RNA sequence aligning.
Term
Annotation file
Definition

.gff file

Optional to use with alignment software. Tells you what the genes are.

Term
Barcoded library
Definition
For multiplexing of multiple fragments in one machine. Consists of a DNA fragment, internal adapter, barcode, and two adapters on each end. Barcodes are used to identify samples.
Term
Beta diversity
Definition
Diversity between more than one sample. Compares alpha diversity of samples.
Term
Biological replicate
Definition
Multiple microarray hybridizations of different biological samples that have been subjected to the same treatment. Source of most variation is biological. Has more variation, and is more meaningful than technical replicates.
Term
BLAST
Definition
A more heuristic alignment algorithm. Used to quickly identify good alignments of query sequences against target sequences.
Term
BLOSUM matrix
Definition
Scores of amino acid alignments. Derived from unmapped alignments of conserved, short protein regions from known homologous proteins found in the BLOCKS database. Given one protein has one amino acid, what is the frequency of amino acids it the other protein? Assigns alignment scores based on known alignments. Corrected and normalized. A large set of diverse proteins are used to estimate values; often high abundancy proteins are used, such as globulin.
Term
BowTie2 aligner
Definition
Alignment software. Aligns short reads without splice junctions. Alignments are stored as BAM or SAM files. Those that don't map are saved in a separate file for TopHat. Has different penalties for different mismatches when calculating alignment score: base/base mismatch is -6, base/ambiguous character (N) is -1, indel for opening the gap is -5, and extending the gap is -3. The best possible score is 0, and the software sets a minimum acceptable score to accept alignments. Mixed mode allows alignment of a single mate, throwing out one of the pairs, keeping the other. Designed to align best with the human genome.
Term
C and G rich repeated sequence model
Definition

The probability of an A is 0.1, T is 0.1, C is 0.4, and G is 0.4. Used for thermophilic bacteria, where CG is enriched.

P ("AGCTAT" | M) = (0.1)4(0.4)2 = 0.000016

Term
CentiMorgan (cM)
Definition
The percentage recombination frequency per generation between two loci. Measures genetic map distance. Maximal distance is 50 cM, because recombination frequency cannot be higher than 50%; they are independently assorting. Markers less than 50 cM apart are assumed to be physically linked.
Term
Chimera
Definition
A mixture of two different miRNAs, in different segments. Can be many combinations. Only down-regulates mRNA with homology to the 5' sequence.
Term
ChIP-sequence
Definition

Chromatin ImmunoPrecipitation sequencing

A method of sequence analysis that looks for associations between DNA-binding proteins and the DNA sequences they bind to. This method is often used to analyze the effects of transcription factors on specific regions of DNA. This offers the exact location and sequence to which the regulatory factor binds, which can then be used for further analysis of these regulatory factors and their effects on gene expression. Combines chromatin immunoprecipitation with high-throughput DNA sequencing.

1. Cross-linking with formaldehyde

2. Shearing via cell disruption and sonication or enzymatic digestion

3. Immunoprecipitation and enrichement

4. Sequencing

Term
CIGAR string
Definition
Information found in a SAM file. Tells  you where mismatches and gaps are found in an alignment.
Term
CLUSTAL
Definition
A popular multiple sequence aligner that can be run as an application on a local machine.
Term
Concordant alignment
Definition
A paired end read that meets expectations for orientation and distance. Includes separate, overlap, and contain alignments which can happen with trimming.
Term
Convergent read
Definition
A paired end read where each end is nearby in the genome. Can help with alignment.
Term
Count tracking file
Definition
An output file of Cuffdiff. Mean estimated counts for replicates.
Term
Cuffdiff
Definition

Software which estimates gene expression and expression differences. Estimates gene and gene isoform expression, and does a pairwise comparison between samples to find significant changes in transcript abundance, splicing, and promoter use. Output files include FPKM tracking files, count tracking files, and read group tracking files. Differential splicing and promoter use is only calculated if the number of replicates is sufficient. Based on FPKM, compares the log-ratio of the gene's expression two conditions against 0.

Y = (FPKMa / FPKMb)

Term
Cufflinks
Definition

Abundance output

Reports abundance as FPKM. Normalizes for transcript length and library size.

Term
Cysteine (C)
Definition
Tends to be conserved. Changes into different amino acids infrequently. Maintained at a high frequency between homologs.
Term
Cytosine
Definition

In eukaryotes it is often methylated in a CG context, except around genes. Methylation occurs outside promoter regions is mutagenic, changing C into T. Using these regions, we can calculate transition frequencies. Different models can be made and compared by claculating and comparing P values for the same sequence. One can plot windows on the X axis and probabilities on the Y axis to visualise results.

rab = P(xi = b | xi - 1 = a) = nab/na

Term
Darwin's finches
Definition
It is proven that different beak shapes have different capacity for obtaining different foods. There are pointed beak and blunt beak populations. Birds from  both populations were sequenced, and the allele frequency differences were looked for: areas in the genome with high FST.
Term
Degrees of freedom
Definition

n - 1

n = number of replicates

Term
DICER
Definition
A dsRNA processing enzyme that produces sRNA.
Term
Discordant alignment
Definition
Paired end read for whoch each mate has a unique alignment and expectations for orientation and distance are not met. Interesting for structural variation detection. Searching for discordant alignments can be turned off.
Term
Divergent read
Definition
A paired end read where each end is far apart in the genome, even on different chromosomes.
Term
DNA library
Definition
It is constructed by isolating RNA or DNA (for RNA convert into cDNA), which is then fragmented and undergoes adapter ligation and size selection. Fragments from a heterogeneous genome are processed together. Includes fragment, barcoded, and mate pair libraries.
Term
Dotmatcher
Definition
A dot-plot program which is like dottup, but instead of requiring exact matches it uses a scoring matrix to score word pairs, and a threshold score to determine if a dot should be drawn or not.
Term
Dottup
Definition
A dot-plot program which draws a dot when words match exactly between two sequences.
Term
Endogenous hairpin genes
Definition
Exogenous sources of dsRNA, from bacteria or viruses. miRNA genes that are transcribed and processed by Dicer-mediated dsRNA processing. sRNAs guide RNA-induced silencing complex (RISC), implicating mRNA destruction and translational repression.
Term
Entropy (H)
Definition

A measure of uncertainty and randomness. If you are certain what the next nucleotide will be, entropy is low. Identified with the Shannon entropy score. Maximized when the probability of all nucleotides is equal; the maximum value is 2. A bit is the log base of 2 of a number. Highly conserved sequences have low entropy

H(X) = - Σi pxi log2 pxi

X = vector string of nucleotides

i = index for each site

pxi = probability of nucleotide xi.

Term
Exon
Definition
A coding sequence and untranslated regions.
Term
Experiment
Definition
In a simple, standard experiment, one idea is tested; two factors are compared, often treatment and control, or two different traits.
Term
FASTA file
Definition
Gives name and sequence.
Term
FASTQ file
Definition
Gives name, sequence, and quality.
Term
Flag
Definition
Information found in a SAM file. Tells you if the read is a reverse complement of the reference genome, and gives information about mate-pairs for pair-end reads.
Term
Foldit
Definition
A multiplayer interactive game version of Rosetta. Uses human spatial intuition to solve protein structures. Players get points based on final protein stability. Successful in increasing the activity of Diels-Alderase enzyme. Requires experienced Foldit users to wok, and a lot of time and people. The enzyme must be tested to confirm results.
Term
FPKM tracking file
Definition

An output file of Cuffdiff.

1. Isoforms.fpkm_tracking. Transcript FPKMs.

2. Genes.fpkm_tracking. Gene FPKMs, tracks the summed FPKM transcripts sharing a gene_id.

3. Cds.fpk_tracking. Coding sequence FPKM, tracks the summed FPKM of transcripts sharing a p_id.

4. Tss_groups.fpkm_tracking. Primary transcript FPKM, tracks the summed FPKM of transcripts sharing a tss.id.

Term
Fragment library
Definition
Consists of a DNA fragment with two adapters on each end.
Term
Fragments per kilobase of exon model per million (FPKM)
Definition

A measure of RNA abundance, controlling for length of exons in the sequence. Reported by Cufflinks. Used as a measure of gene expression. Attempts to normalize the number of reads mapped to a gene by the length of the gene and the total number of mapped library reads. Measures fragments; paired end reads count for two, and single end reads count for one. The larger the difference of FPKM, between two genes, the more confident that one gene is expressed than the other.

FPKM = 109 (C / NL)

C = number of read pairs from transcript

N = total number of mapped read pairs in the library

L = number of exon bases for transcript.

Term
FST
Definition

The frequency of heterozygotes expected if two populations are randomly mating, compared to the frequency of heterozygotes within each separate population. Maximum value is 1, indicating no shared alleles at all; the populations are completely different. Minimum value is 0, indicating no allelic differences; the populations are exactly the same. A value of 0.2 or greater is considered a meaningful amount of difference.

FST = (HT - HS) / HT

Term
Galaxy
Definition
An online bioinformatics toolbox.
Term
Gap score
Definition
When there is an amino acid deletion in an alignment. Usually estimated ad hoc, rather than using known alignments. Gaps are assigned an arbitrary penalty, usually -8.
Term
Gene isoform
Definition
Produced from the same locus, but differs in transcription start site (TSS), untranslated regions (UTR), or protein coding sequences (CDS).
Term
Genetic map
Definition
Shows the estimated order of genes, markers, and other DNA fragments on the chromosome. The closer together genes are placed, the smaller their genetic distance; they are linked. Genes, markers, and DNA fragments ordered on a map increase its usefulness.
Term
Genome reference
Definition
A mapping template. Consists of the genome sequene (fasta format), and the genome sequence annotation (gff format). Some species have their own online communities including yeastgenome.org and maizegdb.org. There are online repositories such as ensembl.org and NCBI.
Term
Genotyping-by-sequencing (GBS)
Definition

A bioinformatics pipeline.

1. Extract DNA

2. Digest with restriction enzymes

3. Sequence

4. Analyse

Term
Green fluorescent protein (GFP) reporter
Definition
An experiment done in the 1980s showed that when C. elegans expressing GFP reporter feeds on bacteria containing GFP double-stranded RNA, the expression is lost in almost all cells. A mutant C. elegans defective for a gene that encodes siRNA, expressing GFP, does not have silencing of GFP when it consumes the GFP dsRNA. This silencing is involved in RNA interference; this was the first clue as to the function of sRNA.
Term
Homologous sequence
Definition
Sequences that have been derived from a common ancestor. In homologous proteins, human and rat sequences are quite similar, and the probability of shared amino acids is high. Some amino acids are unlikely to be substituted with another because a substitution changes protein function.
Term
-i
Definition
In TopHat, the minimum intron length: 70.
Term
Information content (I)
Definition

The redution in entropy after some information is received. Represented by the total height of bases in a MEME output. First measured in communication cables that went across the ocean; you get a lot of random noise through the cables. The reduction in entropy once a signal has been received.

I(X) = Hbefore - Hafter

Term
Insertion/deletion (INDEL)
Definition
A difference in the genome caused by insertion or deletion.
Term
Intron
Definition
A non-coding region of the immature RNA transcript. Distance between paired end reads will be longer than expected. It is hard to align a sequence which spans an intron; fragments are in different locations. GC-AG is a common splice site. Genes do not splice over areas that are not splice sites.
Term
Isoform
Definition
Has the same mRNA from a gene, but different splice variants.
Term
Junction
Definition
Reads which are aligned in two stages: whole reads, and segments of reads. If two segments of a read align a close distance away from each other, or a middle segment fails to align, TopHat identifies a junction. GT-AG is the most common, followed by GC-AG and AT-AC splice sites. Works best with longer reads, 100 bp, and a large amount of reads, over 10 million.
Term
Kolomgorov-Smirnov test
Definition
Tests the probability that two distributions arise from one underlying distribution.
Term
-l
Definition
In TopHat, the maximum untron length: 500,000.
Term
LASSO
Definition
Least absolute shrinkage and selection operator
Term
Library
Definition
A collection of reads from a sample.
Term
Lim et al, 2005
Definition

"Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs"

Research goals were to determine if single miNRA can reduce transcript level of multiple mRNA molecules in a tissue specific fashion. Injected HeLa cells with miRNA and monitored gene expression, estimating mRNA abundance with a microarray Non-transfected HeLa cells were used as a control. Are the genes that are downregulated representative of a certain tissue? Is gene downregulation due to direct miRNA binding to mRNAs? Focused on two human miRNAs: miRNA-1 and miRNA-124.  Transcript levels were significant, p < 0.001. Concluded that miRNAs do affect transcript abundance. Obtained a matrix of gene transcript abundance for all genes, from 46 tissues, and ranked level of expression. Funded by a pharmaceutical company in hopes of developing an miRNA drug that can silence disease genes. Unfortunately, what they found was not specific enough.

Term
-m
Definition
In TopHat, the maximum number of mismatches allowed in the anchor region of a spliced alignment: 0.
Term
M
Definition

The log 10 ratio of raw signals of green and red fluorescence from each spot on a microarray. A high value is associated with low intensity, and may be spurious if the numbers are small. Used to determine if treatments have an affect. For each spot, the centre of the M ratio is given by its mean, and variability by its variance. If variance is high, the mean is not a good metric.

M = log10 (RED / GREEN) = log10 RED - log10 GREEN = log (Cy5 / Cy3)

Mean (x bar) = (Σ xi) / n

Variance (x) = s2 = (Σ (x - x bar)2) / (n - 1)

Term
MA plot
Definition

Plot M versus A.

A = (log10 Cy5 + log10 Cy3) / 2 = log10 √(Cy5 x Cy3)

Term
MAPQ
Definition
Information found in a SAM file. The quality score, of how similar reads are to the mapped position.
Term
Markov chain
Definition

Calculates the transition probabilities between different sequence residues. Can be used to evaluate the probability of a sequence, given a model of the sequence evolution.

P(X) = qxiΠN(i = 2)rx(i - 1)xi

X = sequence of amino acids, nucleotides, or states

qxi = probability of the first amino acid, nucleotide, or state

rx(i - 1)xi = transition probability between two adjacent amino acids, nucleotides, or states.

N = number of amino acids, nucleotides, or states in X.

Term
Mate pair library
Definition
Consists of two DNA fragments separated by an internal adapter, and two adapters on each end. Sequences are far apart in the genome.
Term
Maximum intron length
Definition
A parameter of RNA sequence aligning.
Term
Maximum mappable prefix (MMP)
Definition
The maximum length that can be read from one stretch of sequence before it does not mirror the reference. Found by STAR. Can be used as an anchor that can be extended.
Term
Messenger RNA (mRNA)
Definition
The molecule is larger than the protein for which it encodes. Only 1% of the genome is transcribed into mRNA: 30 million bp out of a 3 billion bp genome. Within a given cell, 10% of these genes are transcribed: 3 million bp.
Term
Micro RNA (miRNA)
Definition
Targets mRNA to induce silencing, affecting gene expression. Arise through imperfectly base-paired foldback structures in hairpin loops, pre-miRNAs that range from 70 to 300 nucleotides. This hairpin loop structure has been used extensively to computationally identify these genes. Acts in trans; acts at another locus, away from where it is. Tend to be shroter than siRNA, around 21 nucleotides. Includes miRNA-1 and miRNA-124. May bind to one or many genes, repressing translation, reducing transcript abundance, sometimes causing degradation. Binds to transcripts with sequence similarity, directly down-regulating them, or binds to a small number of single regulatory molecules, acting indirectly to up- or down-regulate other genes; usually it is a mixture of the two. Few transcripts have miRNA binding sites. Directly regulates many genes. Many potential miRNA binding sites are in evolutionarily conserved UTR regions. The 5' miRNA region often matches and binds to the 3'UTR of target transcripts. Even a 21-nucleotide miRNA may be similar to many off-target sites, so miRNA are not good as drugs.
Term
Microarray
Definition
Works by hybridization of two sets of RNA molecules to a known, complementary oligonucleotide or cDNA. 1 - 100 thousand genes are on a chip. Known RNA is labelled with green fluorescence, and the subject's RNA is labelled with red fluorescence. Output contains Cy3 (green) and Cy5 (red) signal values for each spot. Intensity of a spot's fluorescence is a function of the number of molecules hybridized to it. Difference in green and red fluorescence is expressed as M. Replications include biological and technical replicates.
Term
Minimum intron length
Definition
A parameter of RNA sequence aligning. Usually around 500 nucleotideds in plants, and 1,000 in humans.
Term
miRNA-1
Definition
A human miRNA expressed in heart and skeletal muscle. Downregulates 96 genes in HeLa cells. Downregulates transcripts that are low in heart and skeletal muscle, however transcripts are not particularly high or low in other tissues.
Term
miRNA-124
Definition
A human miRNA expressed in humans, in the brain. Downregulates 174 genes in HeLa cells. These genes are expressed in lower levels in the brain than in other tissues. The mean level of expression is lower in the brain relative to in other tissues. However, transcripts are not particularly high or low in other tissues. Mutant sequences were designed, one in the 5' region (matches conserved target sequence), and another in a non-matching sequence; the miRNA with shared 5' ends have similar changes in gene expression. The UGCCUU sites in mRNAs targeted by miRNA-124 were 44% homologous in mice.
Term
Model
Definition

A representation of something. Includes qualitative, quantitative, and probabilistic models. Output depends on the equation and parameters used. A simple model assumes that sequence sites are independent, and sequences occur at equal frequency.

y = ax1 + b

y = ax1 + bx2 + c

y = response variable

x = predictive variable

a, b, and c = parameters

Term
Multiple expectation maximization for motif elicitation (MEME)
Definition
Used to computationally search the 3' UTRs of down-regulated transcripts for motifs or conserved sequences. Input is a group of sequences. Output provides information content; all the over-represented patterns of transcript that are down-regulated.
Term
Needleman-Wunsch algorithm
Definition
One of the first dynamically programmed sequence alignment algorithms. First developed by Saul B. Needleman and Christian D. Wunsch, and published in 1980. An optimal matching algorithm and global alignment technique, optimizing alignment along the entire length of both sequences. Both alignments give the most accurate alignment results, but very computationally intense, especially in high-throughput modern sequence analysis. Gives exact answers but can be time-consuming to calculate, even with a computer.
Term
Null hypothesis
Definition
There is no difference in gene expression in transfected and non-transfected cells, and M will be zero.
Term
Operational taxonomic unit (OTU)
Definition
Assigned by QIIME. A grouping of microbes into a small taxonomic group, similar to a species. Filters raw sequencing data using quality scores. Samples are demultiplexed based on pre-assigned barcodes. OTUs are assigned based on 97% similarity, and mapped to sequence databases. Can be used to build phylogenetic relationships, and analyze and compare alpha and beta diversity.
Term
.out file
Definition

A file produced by TopHAt

Exit status = 0. This is good.

Elapsed time = 5.5 hours.

Memory used = 8.6 G.

Term
.output directory
Definition

A file produced by TopHat

Accepted_hits.bam. A list of read alignments. Needs to be converted to .sam to visualise what is here.

Align_summary.txt

Deletions-bed and insertions-bed. Lists the insertions and deletions reported by TopHAt.

Junctions.bed. Tracks the junctions reported by TopHat.

Prep_reads.info

Unmapped.bam

Logs (directory) -run.log, tophat.log, et cetera

Term
P value
Definition
The probability of observing a t statistic as large or larger than the one calculated, when the null hypothesis is true. The effect of variance on the p value in two genes.
Term
Paired end (PE)
Definition
A type of sequence read where the read occurs in both directions. Longer fragments can be sequenced, with small reads from each end of the fragment. Distance between ends can validate mapping during alignment. More ideal than single ends. May be convergent or divergent reads.
Term
Phenylalanine (F)
Definition
Frequently interchangeable with tyrosine over evolutionary time.
Term
Polydot
Definition
A dot-plot program which is like dottup, but performs all-against-all comparisons for a set of sequences.
Term
Principal components analysis (PCA)
Definition
A bioinformatics technique used to reduce the dimensionality of large, multivariate datasets. Uses a combination of related variables to allow for better identifiaction of the dataset's components. Maximizes variance while minimizing error. Identifies linear combination of variables with the widest range of values, given the values of the variables across all samples. Converts original variables into a second, summarized set. The first principal component searches for arrangements with maximum variance, and the second searches for arrangments perpendicular to the first. Can be used to correct for population stratification, and can help identify genes based on linkage disequilibrium of SNPs. Allows for efficient summarization of variance among many variables into just a few, resulting in easy visualization of data. Uses components as the axes of a graph and plots the linearly transformed value of each sample. Done with 2D and 3D scatterplots.
Term
Probabilistic model
Definition
Calculates the probability fo some outcome. Climate and economic models may calculate the probability of change in temperature or rates of inflation in the future. In bioinformatics it is used to understand the probability of some observed data being produced in a specific scenario. Assigns probabilities to different outcomes. Includes a set of assumptions. Output has a different probability depending on assumptions.
Term
Probability of a single sequence
Definition

A model which assumes that residues are independent, so the probability of a series of events is the produt of probability of each event. Useful for comparing different models that have different parameter estimates. Includes A and T rich repeated sequence model and random sequence model.

P(x | M) = Πn(i = 1)qxi

x = the sequence

| = "given"

M = model that is assumed

n = length of sequence

i = pointer for the residue in the sequence

qxi = the probability of that residue

Term
Profiles
Definition
A table of position-specific amino acid weights for a motif.
Term
Prosite
Definition
A database of protein families and motifs against which novel sequences can be compared.
Term
QIIME
Definition

Quantitative insights into microbial ecology

Pronounced "chime"

A free open-source software created on a Python platform. Able to demultiplex large amounts of microbial data generated by high-throughput sequencing technologies such as 454 and Illumina. Analyzes sequences from ribosomal 18S (for eukaryotes) and 16S (for prokaryotes) against databases of reference genomes in order to sort species contained within a sample to operational taxonomic units (OTUs). Allows for contrast of alpha and beta diversity. Python script is easy but limited. Has a very specialized purpose, and is good for only certain types of research.

Term
Qualitative model
Definition
A physial representation of an object, such as an eye, a leaf, or a building. A sketch or 3D model.
Term
Quantitative model
Definition

A mathematical description of a phenomenon.

Response = f(inputs)

Term
Quantitative trait (QT)
Definition
Phenotypes that vary continuously, such as height or blood pressure. Distributions imply they are controlled by multiple genes.
Term
Quantitative trait loci (QTL)
Definition
Studies attempt to find statistical association between markers and variation in phenotypic values. Markers are assigned LOD scores; the higher the score, the greater the chance that markers are associated with measured trait.
Term
R
Definition
A powerful, but challenging open-source statistical package.
Term
Random sequence model
Definition

Model 1

Assuming all nucleotides normally occur at equal frequencies. The probability of A is 0.25, G is 0.25, T is 0.25, and C is 0.25.

P ("AGCTAT" | M) = (0.25)6 = 0.000244.

Term
Read
Definition
A nucleotide sequence captured as data.
Term
Read group tracking file
Definition
An output file of Cuffdiff. Counts tracking files for each replicate.
Term
Reads per kilobase of exon model per million (RPKM)
Definition
Similar to FPKM, but measures reads instead of fragments.
Term
Recombination heatmap
Definition
Each marker is plotted against all other markers. Blue indicates low association, and red indicates high association. Markers have highest association with themselves.
Term
RNA
Definition
About 20,000 genes in the genome produce RNA transcripts, out of 35,000 genes.
Term
RNA interference (RNAi)
Definition
Mediated by sRNAs and DICER.
Term
RNA sequence analysis
Definition

Used to assay mRNA abundance within a sample. The current technology. A population of RNA fragments are converted into DNA with adapters on the ends, and then sequenced. There are 10s of millions of cDNAs, all attached to a single slide. Produces single and paired end reads. Around 10 - 20 million reads are made for different sequences. Abundance of mRNAs is estimated by number of reads. Introns are not sequenced. Some reads span exons, and need special alignment software such as TopHat, and then Cuffdiff.

1. Align everything that aligns well.

2. For remaining reads, see if they map if split over an interlocus. Parameters include anchor length, anchor mismatch, minimum intron length, and maximum intron length. SAM file output. Criteria attempt to minimize false negatives and false positives.

3. Assemble or summarize information to get an estimate of transcript abundance.

Term
Rosetta
Definition
Created by the Baker Laboratory at the University of Washington in 2005. Used to predict protein folding structures and design new proteins. Calculates the lowest energy protein design. Can be used to improve on pre-existing proteins. Calculates root mean square deviation (RMSD), and plots it against energy. Requires combined computatonal power of over 66,000 computers worldwide.
Term
Rubisco
Definition
A highly expressed gene.
Term
SAM file
Definition
The output of RNA sequence aligning. Information includes read ID, query ID, flag, reference ID, reference mapping position (bp), MAPQ, and CIGAR string.
Term
Score (S)
Definition

Can compare two models. Comparing the models for A and T rich vs random sequences, we can score A or T as 0.4/0.25 = 1.6, and C or G as 0.1/0.24 = 0.3. For "AGCTAT", S = (1.6)4(0.4)2 = 1.049.

S = (P(x | M)) / (P(x | M)) = ΣN(i = 1)log(pxi / qxi)

Term
Sequence motif
Definition
A pattern of nucleotides or amino acids that is conserved across diverse genomes, and are known for, or conjectured, to have a biological function. Useful for predicting the functions of proteins, identifying transcription factor binding sites, and identifying amino acids critical for function. A sequence with a specific function, shared with other genes. Conserved, found in all molecules with a certain function. Highly represented in a group.
Term
Serine (S)
Definition
Highly mutable.
Term
Shannon entropy score
Definition

Defines entropy for a string of nucleotides or amino acids. Probability of a nucleotide is estimated by counting the number of times a certain nucleotide occurs within the alignment, and dividing the number by the total number of nucleotides. The value is expressed in log2; multiply the natural log of p by 1.4427.

H(X) = - Σi pxi log (pxi)

pxi = cxi / (Σ(i = {A, G, C, T})cxi

i = an index for each possible nucleotide

xi = refers to that nucleotide of i

pxi = the probability of that nucleotide

c = the number of counts

Term
Single end
Definition
A type of sequence read where the read occurs in one direction. Small reads are hard to align to a reference genome and often have multiple mapping sites. Longer reads can overcome this limitation.
Term
Single nucleotide polymorphism (SNP)
Definition
A difference at a single position in the genome.
Term
Small interfereing RNA (siRNA)
Definition
Arise from exogenous dsRNA by DICER, and often require the activity of RNA dependent RNA polymerase (RDR). Foreign dsRNA can trigger gene silencing through siRNAs. Transfection into some animals causes a reduction in off-target mRNA transcripts of partial complementarity to the siRNA, as welll as intended transcripts. Suggests that they may influence transcript levels in animals by partial binding. Acts in cis; acts where it is, at that locus. Tend to be longer than miRNA, around 24 nucleotides.
Term
Small RNA (sRNA)
Definition
RNA moleucles 21 - 24 nucleotides long, encoded by the genome. Mediates RNAi. There is potential for use as a therapeutic agent, to silence disease-causing genes. Includes siRNA and miRNA.
Term
Splice variants
Definition
RNA transcripts which have alternative splicings.
Term
STAR
Definition

Spliced transcripts alignment to a referece

An RNA sequence alignment software. Effective when the transcript has introns, mismatches, and indels. High speed, sensitivity, and precision. Can read complex RNA arrangements. Low computational overhead, but uses more RAM and memory.

1. Seed searching for maximum mappable prefix

2. Clustering, stitching, and scoring

Term
T test
Definition

A standard method with which to calculate significant changes in microarrays. The t statistic indicates the importance of a mean depends on the spread of values used to compute it. Given that the null hypothesis is true, t values will fall within a range defined by the t distribution and degrees of freedom. Approximately normally distributed. The higher the number, the more confidence there is a difference between treatments.

t = (xbar (√n)) / s

T = (E (log Y)) / (var (log Y))

Term
Technical replicate
Definition
Microarray hybridizations or RNA sequencing outputs of the same biological sample. mRNA may be extracted more than once from a single individual.
Term
TopHat
Definition
Splice-aware RNA alignment software with sensitivity to potential introns. Uses BowTie to align reads to the genome over introns. Identifies exon-exon splice junctions. Can be used with or without an annotation file; you do not need to have a file that tells you where existing splice junctions are located. Reads are aligned to the genome, and split segment alignments (small exons) and coverage islands are merged with exons and aligned to junctions. Has extra criteria to reflect intron-spanning nature: -a, -m, -i, and -l. Produces an .out file and an .output directory. Often misses things.
Term
Transcription factor
Definition
Expressed at low levels. Around 10% of all cell proteins are regulatory molecules.
Term
Transforming growth factor alpha (TGF-alpha)
Definition
Secreted by many human tumours. Overexpression of the gene is associated with cancer in mammals. Encodes a transmembrane precursor from which the mature polypeptide is released by proteolytic cleavage. Competes with epidermal growth factor (EGF) for protein binding sites.
Term
Transition probability
Definition
The probability that a certain nucleotide (C) is followed by certain other nucleotide (A, C, T, or G). Different regions of the genome have different transition probabilities.
Term
Tryptophan (W)
Definition
Tends to be conserved. Changes to different amino acids infrequently.
Term
Tyrosine (T)
Definition
Frequently interchangeable with phenylalanine over evolutionary time.
Term
ChIA-PET
Definition

Chromatin Interaction Analysis by Paired-End Tag

A method of studying chromatin interactions genome-wide. Maps interacting regions with high resolutions, where they occur and at what frequency. Helps understand regulation of gene expression.

Supporting users have an ad free experience!