Description
Question 1. Estimate the cost of following operations with the information given below. All costs
should be in number of pages (read or written).
Recall that when the final result is found, it is put in the output buffer. This operation has no
additional cost.
R(A,B,C,D,E) TUPLES(R)= 100,000 PAGES(R)= 2,000
S(F,G,A) TUPLES(S) = 3,000,000 PAGES(S)= 6,000
(a) Block-nested loop join for R ./ S with M=101 blocks. For join ordering, choose the lowest cost
one and only show the result of that join.
Answer: cost = P AGES(R) + [P AGES(R)
M−1
] × P AGES(S) = 2000 + 20 × 6000 = 12200
(b) External sorting of S using M=60 blocks.
Answer: cost of read and merge = 6000 × 2 = 12000
6000/60 = 100 groups which is larger than M = 60, so repeat to merge groups.
Total cost = 12000 + 12000 + 6000 = 30000
(c) External sorting of S using M=100 blocks.
Answer: cost of read and merge = 6000 × 2 = 12000
6000/100 = 60 groups which is smaller than M = 60, sort directly
Total cost = 12000 + 6000 = 18000
(d) Hash join for R ./ S with M=101 blocks. Describe how join works and any assumptions you
make in your computation.
Assume tuples in R and S will uniformly distribute to 101 blocks. If a buckets after hashing
will not fit in memory, then you will use block nested loop join.
Answer: hash cost = 2 × (2000 + 6000) = 16000 R buckets: 2000/101 = 20 pages/bucket,
S buckets: 6000/101 = 60 pages/bucket
so enough memory to allocate
Total cost = 16000 + 2000 + 6000 = 24000
(e) Sort merge join for R ./ S with M=101 blocks. Sort each relation first and then join.
Note that if the number of partially sorted groups is smaller than allocated memory (in this
case 101 blocks), they can be sorted and joined together in a single step.
Answer: cost of sort R = 2 × 4000 + 2 × 4000 = 8000
cost of sort S(only step 1 of external sort) = 6000 × 2 =12000
Then use 60 blocks to do sort merge and join for S and 1 block to read R
Total cost = 8000 + 12000 + 8000 = 28000
1
Question 2. Estimate the cost of query Q1 using different query plans. All costs should be in
number of pages. Assume sufficient amount of memory is allocated for each query plan.
R(A,B,C,D,E) TUPLES(R)= 100,000 PAGES(R)= 2,000
Q1: SELECT A,B FROM R WHERE C>10 AND D=25
Index I1 with three levels (root,internal,leaf) on R(C,D,A) with 800
nodes in the leaf level
Index I2 with three levels (root,internal,leaf) on R(D,A,B,C) with 1,500
nodes in the leaf level
Index I3 with three levels (root,internal,leaf) on R(C) with 300
nodes in the leaf level
Index I4 with three levels (root,internal,leaf) on R(D) with 250
nodes in the leaf level
Expected number of tuples of R for conditions:
Tuples(R.C>10)=20,000
Tuples(R.D=25)=1,000
Tuples(R.C>10 and R.D=25)=50
(a) Plan 1: sequential scan over R.
Answer: Because of sequential scan, cost = PAGES(R) = 2000
(b) Plan 2: using index I1.
Answer: Index I1 has 800 nodes in the leaf level, so 10000/800 = 125 tuples/leave node.
Because condition 1 in Q1 is C > 10 which is a range, we need to go over all leave node with
C > 10, and cost of this step = 20000/125 = 160 to 161 pages.
Then because B is not in I1, so we need to scan all satisfied tuples with the maximum cost of
50 pages.
Total cost = 2 + (160 to 161) + 50 = 212 to 213 pages
(c) Plan 3: using index I2.
Answer: Index I2 has 1500 nodes in the leaf level, so 10000/1500 = 66 tuples/leave node.
First we need to go over all leave node with D = 25, and cost of this step = 1000/66 = 16 to
17 pages.
Then because A and B are in I2, so we do not need to scan all satisfied tuples.
Total cost = 2 + 16 to 17 = 18 to 19 pages
(d) Plan 4: using index I3.
Answer: Index I3 has 3000 nodes in the leaf level, so 10000/3000 = 333 tuples/leave node.
First we need to go over all leave node with C > 25, and cost of this step = 20000/333 = 60
to 61 pages.
Then because A and B and D are not in I3, so we need to scan all satisfied tuples.
Total cost = 2 + (60 to 61) + 20000 = 20062 to 20063 pages
2
(e) Plan 5: using index I4.
Answer: Index I4 has 250 nodes in the leaf level, so 10000/250 = 400 tuples/leave node.
First we need to go over all leave node with D = 25, and cost of this step = 1000/400 = 3 to
4 pages.
Then because A, B and C are not in I4, so we do need to scan all satisfied tuples.
Total cost = 2 + (3 to 4) + 1000 = 1005 to 1006 pages
(f) Plan 6: using index I3 and I4 both.
Answer: Index I2 has 1500 nodes in the leaf level, so 10000/1500 = 66 tuples/leave node.
From part (d), cost to get tuples with (C > 10) = 62 to 63 pages.
From part (e), cost to get tuples with (D = 25) = 5 to 6 pages.
Then we do intersection for above two group of tuples to find tuples satisfy both conditions.
Total cost = (62 to 63) + (5 to 6) + ((60 to 61) + (3 to 4)) + 50 = 180 to 184 pages
3
Question 3. Estimate the size (number of tuples) of the following queries:
Q1: select * from games where id = 21;
Q2: select * from contestants where gameid = 2345;
Q3: select * from contestants where shortname = ‘Gilbert’;
Q4: select * from contestants where gameid = 2345 and shortname = ‘Gilbert’;
Q5: select * from games g, contestants c
where g.gameid=c.gameid and c.shortname=’Gilbert’;
Q6: select * from responses where iscorrect = True;
Q7: select * from responses where shortname = ‘Gilbert’ and gameid=’5881′;
Q8: select * from responses where shortname = ‘Gilbert’ or gameid=’5881′;
Q9: select * from responses r, contestants c
where c.shortname=r.shortname;
Q10: select * from responses r, contestants c
where c.shortname=r.shortname and c.gameid=r.gameid;
using the following statistics (VALUES is the same DISTINCT values):
Tuples(Games) = 10,000
Values(Games.gameid) = 10,000
Values(Games.id) = 40
Values(Games.airdate) = 10,000
Tuples(Contestants)=30,000
Values(Contestants.shortname)=3,000
Values(Contestants.fullname)=18,000
Values(Contestants.gameid)=10,000
Tuples(Responses)=740,000
Values(Responses.gameid)=10,000
Values(Responses.clueid)=520,000
Values(Responses.shortname)=3,000
Values(Responses.iscorrect)=2
These are close to the real statistics, but they are not identical. However, the following is a great
educational exercise (not required for solving the homework). You can get the statistics for the
real data and compare the estimates to the real results (just by running select count(*) queries).
When do the estimates are far off from reality and why?
4
Answer:
Q1 1000 ×
1
40 = 250
Q2 30000 ×
1
10000 = 3
Q3 30000 ×
1
3000 = 10
Q4 30000 ×
1
10000×3000 = 1
Q5 10000 ×
30000
3000×10000 = 10
Q6 740000
2 = 370000
Q7 740000 ×
1
3000×10000 = 1
Q8 740000 × (1 −
1
10000 ) × (1 −
1
3000 ) = 321
Q9 30000 ×
740000
3000 = 7400000
Q10 30000 ×
740000
3000×10000 = 740
5
Question 4. Estimate the cost of the following query plans for the same query. Which one is the
cheapest plan?
The abbreviations used:
BNLJ Block nested loop join
SMJ Sort merge join
O-T-F On the fly (pipelined operation)
Sort External sort
I-O-S Index only scan
SS Sequential scan
For simplicity, we will give you the size of the result of each operation. You can assume that there
are appropriate projections to reduce the size of intermediate relations which are included in the
given sizes.
Furthermore, assume that the join operation is over a foreign key (so for each S.D, there is a single
R.A value). This simply makes the computation of sort merge join much simpler after a sort.
PAGES(R) = 100
PAGES(S) = 800
PAGES(σR.B>20 R) = 25
PAGES(R ./R.A=S.D S) = 700
PAGES((σR.B>20 R) ./R.A=S.D S) = 175
Index scan for Plan 2 uses an index on R.B,R.A,R.C. Assume that index has 2 levels, root and 100
leaf pages. The selectivity of the condition σR.B>20 R is 1/4.
Note that operations in each query plan are pipelined from lower operations to upper operations,
by placing the output of one operation to the input buffer of the next operation.
Answer:
1. Cost of BNLJ = 100 + 800 ×
100
50 = 1700
Cost of sort = 700
50 = 14 < 50
Total cost = 1700 + 700 × 2 + 700 = 3800
2. Cost of selection = 1 + 25 = 26
Cost of BNLJ = 25 + 800 ×
25
50 = 425
Cost of sort = 175
50 = 4 < 50
Total cost = 26 + 425 + 2 × 175 + 175 = 976
3. Cost of selection = 100
Cost of sort 1 = 100
50 × 2 + 100 = 300
Cost of sort 2 = 800
50 × 2 + 800 = 2400
Cost of SMJ = 300 + 2400 = 2700
Total cost = 100 + 300 + 2400 + 2700 = 5500
6