# Shared Flashcard Set

## Details

COMP3017 - Lecture 5 - Query Processing
COMP3017 - Lecture 5 - Query Processing
30
Computer Science
05/02/2014

Term
 Draw the diagram for Query Processing
Definition
 [image]
Term
 What are the types of query plans and their characteristics?
Definition
 Logical query plan Alegebraic representation of query Operators taken from relational algebra Is abstract Physical query plan Algorithms selected for each operator in plan Execution order for specific operators
Term
 Please show in the diagra where does query optimization take place
Definition
 [image]
Term
 What does the 'selecting logical query plan' stage consist of?
Definition
 It takes place after the parse query, before select physical query plan, and it consists of Generate Logical Query plan Rewrite logical query plan
Term
 What are Physical Plan Operators?
Definition
 Algorithm that implements one of the basic relational operations that are used in query plans   e.g. Relational algebra has join operator. How the join is carried out depends on >Structure of relations, > Size of relations, > Presence of indexes and hashes, etc
Term
 What is the computational model for Pyhsical plan operators?
Definition
 Need to choose good physical plan operators: Estimate "cost" of each operator Key measure of cost is the number of disk accesses (far more costly than main memory accesses) Assumption: arguments of operator are on disk, result is in main memory
Term
 What are the cost parameters of Physical Plan operators?
Definition
 M: Main memory available for buffers S(R): Size of a tuple of relation R B(R): Blocks used to store relation R T(R): Number of tuples in relation R (Cardinality of R) V(R,a): Number of distinct values for attribute a in relation R
Term
 What is a clustered 1) File, 2) relation and 3) index?
Definition
 Tuples from different relations that can be joined (on particular attribute values) stored in blocks together. [R1 R2 S1 S2] [R3 R4 S3 S4] Tuples from relation are stored together in blocks, but not necessarily sorted [R1 R2 R3 R4] [R5 R6 R7 R8] Index allows tuples to be read in an order that corresponde to physical order
Term
 What does scanning consist of?
Definition
 e.g. Read all tuples of relation R (that satisfy some predicate) Two variants Table Scan  Index Scan
Term
 What are the characteristics of Table scan?
Definition
 Tuples arranged in blocks All blocks known to the system Possible to get blocks one at a time I/O Cost B(R) disk accesses, if R is clustered T(R) disk accesses, if R is not clustered
Term
 What are the characteristics of Index Scan?
Definition
 An index exists on some attribute of R Use index to find all blocks holding R Retreive blocks for R I/O Cost B(R) + B(IR) disk accesses if clustered B(R) >> B(IR), so treat as only B(R) T(R) disk accesses if not clustered
Term
 What are the main characteristics of one-pass algorithms?
Definition
 Read data from disk only once Typically require that at least one argument fits in main memory Three broad categories: Unary, tuple at a time (i.e. select, project) Unary, full relation (i.e. duplicate elimination, grouping) Binary, full relation
Term
 Describe the one-pass unary, tuple at a time
Definition
 Algorithm: Foreach block of R: Copy block to input buffer Perform operation (select, project) on each tuple in block Move selected/projected tuples to output buffer Cost: In general, B(R) or T(R) disk accesses depending on clustering If operator is a select that compares an attribute to a constant and index exists for attributes used in select, <= 1
Term
 Describe the Unary, Full-relation, one pass algorithm and cost
Definition
 Algorithm (Normal): foreach block of R Copy block to input buffer update accumulator move tuples to output buffer Algorithm (Duplicate Elimination) foreach block of R: copy block to input buffer foreach tuple in block If tuple is not in accumulator copy to output buffer else copy to accumulator Requires M >= B(d(R))+1 block of main memory 1 block for input buffer M-1 blocks for accumulator (records each tuple seen so far) Accumulator implemented as in-memory data structure (tree, hash) If fewer than B(d(R)) blocks available, thrashing likely Cost is B(R) disk accesses
Term
 What are the characteristics of grouping in a unary, full relation?
Definition
 Grouping operators: min, max, sum, count, avg Accumulator contains per-group values Output contains per-group values Cost is B(R) disk accesses
Term
 What are the characteristics of the Binary, Full-relation?
Definition
 Union, intersection, difference, product, join   In general, cost is B(R) + B(S). (R, S are operand relations)   Requirement for one pass operation: min(B(R), B(S)) < M-1
Term
 What are the characteristics of the join, binary ful relation?
Definition
 Two relations, R(X,Y) and S(Y,Z), B(S)
Term
 What are the characteristics of nested-loop joins?
Definition
 Also known as iteration join Assumming that we're joining the relations R,S on attribute C, the algorithm is as follows: foreach tuple r in R foreach tuple s in S if r.C = s.C then output r,s pair
Term
 What are some factors that affect cost?
Definition
 Tuples of a relation stored physically together (clustered) Relations sorted by join attribute Indexes exist?
Term
 Solve a join between relations R1, R2 on attribute C through a tuple-based nested loop join with the following elements: T(R1) = 10,000 T(R2) = 5,000 S(R1) = S(R2) = 1/10 block M = 101 blocks
Definition
 Relations not contiguous One disk access per tuple Cost for each tuple in R1=cost to read tuple + cost to read R2 Total Cost =  T(R1)*(1+T(R2)) = 10,000 * (1+5,000) =  50,010,000 disk accesses It is possible to do better by using all available main memory (M=101), read outer relation R1 in chunks of 100 blocks, and read all inner relation2 (using 1 block) + join
Term
 Solve a join between relations R1, R2 on attribute C through a block-based nested loop join with the following elements: T(R1) = 10,000 T(R2) = 5,000 S(R1) = S(R2) = 1/10 block M = 101 blocks
Definition
 Cost to read one 100-block chunk of R1 = 100*1/S(R1) = 1,000 disk accesses Cost to process each chunk = 1000 + T(R2) = 6,000 Total cost = T(R1)/ 1000 * 6000 = 60,000 disk accesses It's possible to do better by reversing the join order ( i.e. R1 becomes the inner relation, R2 becomes the outer relation).
Term
 Solve a join between relations R1, R2 on attribute C through a join reordering with the following elements: T(R1) = 10,000 T(R2) = 5,000 S(R1) = S(R2) = 1/10 block M = 101 blocks
Definition
 We are reversing the join order ( i.e. R1 becomes the inner relation, R2 becomes the outer relation). Cost to read one 100-block chunk of R2=100*1/S(R2)=1,000 disk accesses Cost to process each chunk = 1000+T(R1) = 11,000 Total cost = T(R1)/1000*11,000 = 55,000 disk accesses It's possible to do better if the tuples in each relation are contiugous (i.e. clustered)
Term
 Solve a join between relations R1, R2 on attribute C through Contiguous Relations with the following elements: T(R1) = 10,000 T(R2) = 5,000 S(R1) = S(R2) = 1/10 block M = 101 blocks
Definition
 Tuples in each relation are contiugous (i.e. clustered) B(R1)=1000, B(R2)=500 Cost to read one 100-block chunk of R2=100 disk accesses Cost to process each chunk = 100+B(R1)=1100 Total cost=B(R2)/100*1,100=5,500 disk accesses We can do better if both relations are contiguous and sorted by C, the join attribute
Term
 Solve a join between relations R1, R2 on attribute C through a merge join with the following elements: T(R1) = 10,000 T(R2) = 5,000 S(R1) = S(R2) = 1/10 block M = 101 blocks
Definition
 Both relations are contiguous and sorted by C, the join attribute   Total cost = B(R1)+B(R2)=1000+500 = 1,500 disk accesses   Problem is if R1 and R2 aren't sorted by C, as it is needed to sort them.
Term
 Solve a join between relations R1, R2 on attribute C through a Two Pass Algorithm with the following elements: T(R1) = 10,000 T(R2) = 5,000 S(R1) = S(R2) = 1/10 block M = 101 blocks
Definition
 If R1 and R2 are not sorted by C, we need to sort them - this can be done with mergesort: Foreach 100 block chunk of R Read chunk  sort in memory Write to disk Read all chunks + merge + write out
Term
 What is the cost of the merge sort?
Definition
 Each tuple is read, written, read, written... so: Sort cost R1: 4x1000 = 4,000 disk accesses Sort cost R2: 4x500 = 2,000 disk accesses
Term
 Solve a join between relations R1, R2 on attribute C through a Merge Join with sort with the following elements: T(R1) = 10,000 T(R2) = 5,000 S(R1) = S(R2) = 1/10 block M = 101 blocks
Definition
 R1,R2 contiguous but usorted: Total Cost = sort cost + join cost = 6,000 + 1,500 = 7,500 disk accesses Nested loop cost = 5,500 disk accesses Merge join does not necessarily pay off If R1=10,000blocks contiugous and R2=5,000blocks not ordered   Nested loop cost = 505,000 disk accesses Merge join cost = 75,000 disk accesses   In this case merge join with sort is better
Term
 What are query trees?
Definition
 Are trees that represent queries visually by breaking them down into components:   SELECT PNUMBER,DNUM,LNAME,ADDRESS,DATE FROM PROJECT,DEPARTMENT, EMPLOYEE WHERE DNUM=DNUMBER AND MGRSSN=SSN AND PLOCATION='Stafford'   [image]
Term
 What are the optimization steps?
Definition
 Heuristic approach Start with canonical form Move o operators down the tree Reorder subtrees to put most restrictive o first Combine x and o to form |><| Move pi operators down the tree
Term
 Show the diagram for the Query plan, and show where to choose heuristics
Definition
 [image]
Supporting users have an ad free experience!