## Description

Question 1. Estimate the cost of following operations with the information given below. All costs

should be in number of pages (read or written).

Recall that when the final result is found, it is put in the output buffer. This operation has no

additional cost.

R(A,B,C,D,E) TUPLES(R)= 100,000 PAGES(R)= 2,000

S(F,G,A) TUPLES(S) = 3,000,000 PAGES(S)= 6,000

(a) Block-nested loop join for R ./ S with M=101 blocks. For join ordering, choose the lowest cost

one and only show the result of that join.

Answer: cost = P AGES(R) + [P AGES(R)

M−1

] × P AGES(S) = 2000 + 20 × 6000 = 12200

(b) External sorting of S using M=60 blocks.

Answer: cost of read and merge = 6000 × 2 = 12000

6000/60 = 100 groups which is larger than M = 60, so repeat to merge groups.

Total cost = 12000 + 12000 + 6000 = 30000

(c) External sorting of S using M=100 blocks.

Answer: cost of read and merge = 6000 × 2 = 12000

6000/100 = 60 groups which is smaller than M = 60, sort directly

Total cost = 12000 + 6000 = 18000

(d) Hash join for R ./ S with M=101 blocks. Describe how join works and any assumptions you

make in your computation.

Assume tuples in R and S will uniformly distribute to 101 blocks. If a buckets after hashing

will not fit in memory, then you will use block nested loop join.

Answer: hash cost = 2 × (2000 + 6000) = 16000 R buckets: 2000/101 = 20 pages/bucket,

S buckets: 6000/101 = 60 pages/bucket

so enough memory to allocate

Total cost = 16000 + 2000 + 6000 = 24000

(e) Sort merge join for R ./ S with M=101 blocks. Sort each relation first and then join.

Note that if the number of partially sorted groups is smaller than allocated memory (in this

case 101 blocks), they can be sorted and joined together in a single step.

Answer: cost of sort R = 2 × 4000 + 2 × 4000 = 8000

cost of sort S(only step 1 of external sort) = 6000 × 2 =12000

Then use 60 blocks to do sort merge and join for S and 1 block to read R

Total cost = 8000 + 12000 + 8000 = 28000

1

Question 2. Estimate the cost of query Q1 using different query plans. All costs should be in

number of pages. Assume sufficient amount of memory is allocated for each query plan.

R(A,B,C,D,E) TUPLES(R)= 100,000 PAGES(R)= 2,000

Q1: SELECT A,B FROM R WHERE C>10 AND D=25

Index I1 with three levels (root,internal,leaf) on R(C,D,A) with 800

nodes in the leaf level

Index I2 with three levels (root,internal,leaf) on R(D,A,B,C) with 1,500

nodes in the leaf level

Index I3 with three levels (root,internal,leaf) on R(C) with 300

nodes in the leaf level

Index I4 with three levels (root,internal,leaf) on R(D) with 250

nodes in the leaf level

Expected number of tuples of R for conditions:

Tuples(R.C>10)=20,000

Tuples(R.D=25)=1,000

Tuples(R.C>10 and R.D=25)=50

(a) Plan 1: sequential scan over R.

Answer: Because of sequential scan, cost = PAGES(R) = 2000

(b) Plan 2: using index I1.

Answer: Index I1 has 800 nodes in the leaf level, so 10000/800 = 125 tuples/leave node.

Because condition 1 in Q1 is C > 10 which is a range, we need to go over all leave node with

C > 10, and cost of this step = 20000/125 = 160 to 161 pages.

Then because B is not in I1, so we need to scan all satisfied tuples with the maximum cost of

50 pages.

Total cost = 2 + (160 to 161) + 50 = 212 to 213 pages

(c) Plan 3: using index I2.

Answer: Index I2 has 1500 nodes in the leaf level, so 10000/1500 = 66 tuples/leave node.

First we need to go over all leave node with D = 25, and cost of this step = 1000/66 = 16 to

17 pages.

Then because A and B are in I2, so we do not need to scan all satisfied tuples.

Total cost = 2 + 16 to 17 = 18 to 19 pages

(d) Plan 4: using index I3.

Answer: Index I3 has 3000 nodes in the leaf level, so 10000/3000 = 333 tuples/leave node.

First we need to go over all leave node with C > 25, and cost of this step = 20000/333 = 60

to 61 pages.

Then because A and B and D are not in I3, so we need to scan all satisfied tuples.

Total cost = 2 + (60 to 61) + 20000 = 20062 to 20063 pages

2

(e) Plan 5: using index I4.

Answer: Index I4 has 250 nodes in the leaf level, so 10000/250 = 400 tuples/leave node.

First we need to go over all leave node with D = 25, and cost of this step = 1000/400 = 3 to

4 pages.

Then because A, B and C are not in I4, so we do need to scan all satisfied tuples.

Total cost = 2 + (3 to 4) + 1000 = 1005 to 1006 pages

(f) Plan 6: using index I3 and I4 both.

Answer: Index I2 has 1500 nodes in the leaf level, so 10000/1500 = 66 tuples/leave node.

From part (d), cost to get tuples with (C > 10) = 62 to 63 pages.

From part (e), cost to get tuples with (D = 25) = 5 to 6 pages.

Then we do intersection for above two group of tuples to find tuples satisfy both conditions.

Total cost = (62 to 63) + (5 to 6) + ((60 to 61) + (3 to 4)) + 50 = 180 to 184 pages

3

Question 3. Estimate the size (number of tuples) of the following queries:

Q1: select * from games where id = 21;

Q2: select * from contestants where gameid = 2345;

Q3: select * from contestants where shortname = ‘Gilbert’;

Q4: select * from contestants where gameid = 2345 and shortname = ‘Gilbert’;

Q5: select * from games g, contestants c

where g.gameid=c.gameid and c.shortname=’Gilbert’;

Q6: select * from responses where iscorrect = True;

Q7: select * from responses where shortname = ‘Gilbert’ and gameid=’5881′;

Q8: select * from responses where shortname = ‘Gilbert’ or gameid=’5881′;

Q9: select * from responses r, contestants c

where c.shortname=r.shortname;

Q10: select * from responses r, contestants c

where c.shortname=r.shortname and c.gameid=r.gameid;

using the following statistics (VALUES is the same DISTINCT values):

Tuples(Games) = 10,000

Values(Games.gameid) = 10,000

Values(Games.id) = 40

Values(Games.airdate) = 10,000

Tuples(Contestants)=30,000

Values(Contestants.shortname)=3,000

Values(Contestants.fullname)=18,000

Values(Contestants.gameid)=10,000

Tuples(Responses)=740,000

Values(Responses.gameid)=10,000

Values(Responses.clueid)=520,000

Values(Responses.shortname)=3,000

Values(Responses.iscorrect)=2

These are close to the real statistics, but they are not identical. However, the following is a great

educational exercise (not required for solving the homework). You can get the statistics for the

real data and compare the estimates to the real results (just by running select count(*) queries).

When do the estimates are far off from reality and why?

4

Answer:

Q1 1000 ×

1

40 = 250

Q2 30000 ×

1

10000 = 3

Q3 30000 ×

1

3000 = 10

Q4 30000 ×

1

10000×3000 = 1

Q5 10000 ×

30000

3000×10000 = 10

Q6 740000

2 = 370000

Q7 740000 ×

1

3000×10000 = 1

Q8 740000 × (1 −

1

10000 ) × (1 −

1

3000 ) = 321

Q9 30000 ×

740000

3000 = 7400000

Q10 30000 ×

740000

3000×10000 = 740

5

Question 4. Estimate the cost of the following query plans for the same query. Which one is the

cheapest plan?

The abbreviations used:

BNLJ Block nested loop join

SMJ Sort merge join

O-T-F On the fly (pipelined operation)

Sort External sort

I-O-S Index only scan

SS Sequential scan

For simplicity, we will give you the size of the result of each operation. You can assume that there

are appropriate projections to reduce the size of intermediate relations which are included in the

given sizes.

Furthermore, assume that the join operation is over a foreign key (so for each S.D, there is a single

R.A value). This simply makes the computation of sort merge join much simpler after a sort.

PAGES(R) = 100

PAGES(S) = 800

PAGES(σR.B>20 R) = 25

PAGES(R ./R.A=S.D S) = 700

PAGES((σR.B>20 R) ./R.A=S.D S) = 175

Index scan for Plan 2 uses an index on R.B,R.A,R.C. Assume that index has 2 levels, root and 100

leaf pages. The selectivity of the condition σR.B>20 R is 1/4.

Note that operations in each query plan are pipelined from lower operations to upper operations,

by placing the output of one operation to the input buffer of the next operation.

Answer:

1. Cost of BNLJ = 100 + 800 ×

100

50 = 1700

Cost of sort = 700

50 = 14 < 50
Total cost = 1700 + 700 × 2 + 700 = 3800
2. Cost of selection = 1 + 25 = 26
Cost of BNLJ = 25 + 800 ×
25
50 = 425
Cost of sort = 175
50 = 4 < 50
Total cost = 26 + 425 + 2 × 175 + 175 = 976
3. Cost of selection = 100
Cost of sort 1 = 100
50 × 2 + 100 = 300
Cost of sort 2 = 800
50 × 2 + 800 = 2400
Cost of SMJ = 300 + 2400 = 2700
Total cost = 100 + 300 + 2400 + 2700 = 5500
6