Description
This assignment relies on the lectures
• SQL Part 1 and SQL Part 2 (Pure SQL);
• Relational Algebra (RA);
• joins and semijoins;
• Translating Pure SQL queries into RA expressions; and
• Query optimization
with particular focus on the last two lectures.
To turn in your assignment, you will need to upload to Canvas a single
file with name assignment3.sql which contains the necessary SQL statements that solve the problems in this assignment. The assignment3.sql
file must be so that the AI’s can run it in their PostgreSQL environment.
You should use the Assignment-Script-2021-Fall-Assignment3.sql file
to construct the assignment3.sql file. (note that the data to be used for
this assignment is included in this file.) In addition, you will need to upload
a separate assignment3.txt file that contains the results of running your
queries. Finally, you need to upload a file assignment3.pdf that contains
the solutions to the problems that require it.
The problems that need to be included in the assignment3.sql are
marked with a blue bullet •. The problems that need to be included in the
assignment3.pdf are marked with a red bullet •. (You should aim to use
Latex to construct your .pdf file.)
1
Database schema and instances
For the problems in this assignment we will use the following database
schema:1
Person(pid, pname, city)
Company(cname, headquarter)
Skill(skill)
worksFor(pid, cname, salary)
companyLocation(cname, city)
personSkill(pid, skill)
hasManager(eid, mid)
Knows(pid1, pid2)
In this database we maintain a set of persons (Person), a set of
companies (Company), and a set of (job) skills (Skill). The pname
attribute in Person is the name of the person. The city attribute
in Person specifies the city in which the person lives. The cname attribute in Company is the name of the company. The headquarter
attribute in Company is the name of the city wherein the company has
its headquarter. The skill attribute in Skill is the name of a (job)
skill.
A person can work for at most one company. This information is
maintained in the worksFor relation. (We permit that a person does
not work for any company.) The salary attribute in worksFor specifies
the salary made by the person.
The city attribute in companyLocation indicates a city in which
the company is located. (Companies may be located in multiple cities.)
A person can have multiple job skills. This information is maintained in the personSkill relation. A job skill can be the job skill of
multiple persons. (A person may not have any job skills, and a job skill
may have no persons with that skill.)
A pair (e, m) in hasManager indicates that person e has person m as
one of his or her managers. We permit that an employee has multiple
managers and that a manager may manage multiple employees. (It
is possible that an employee has no manager and that an employee is
1The primary key, which may consist of one or more attributes, of each of these relations
is underlined.
2
not a manager.) We further require that an employee and his or her
managers must work for the same company.
The relation Knows maintains a set of pairs (p1, p2) where p1 and p2
are pids of persons. The pair (p1, p2) indicates that the person with pid
p1 knows the person with pid p2. We do not assume that the relation
Knows is symmetric: it is possible that (p1, p2) is in the relation but
that (p2, p1) is not.
The domain for the attributes pid, pid1, pid2, salary, eid, and
mid is integer. The domain for all other attributes is text.
We assume the following foreign key constraints:
• pid is a foreign key in worksFor referencing the primary key pid
in Person;
• cname is a foreign key in worksFor referencing the primary key
cname in Company;
• cname is a foreign key in companyLocation referencing the primary key cname in Company;
• pid is a foreign key in personSkill referencing the primary key
pid in Person;
• skill is a foreign key in personSkill referencing the primary
key skill in Skill;
• eid is a foreign key in hasManager referencing the primary key
pid in Person;
• mid is a foreign key in hasManager referencing the primary key
pid in Person;
• pid1 is a foreign key in Knows referencing the primary key pid in
Person; and
• pid2 is a foreign key in Knows referencing the primary key pid in
Person
3
Pure SQL and RA SQL
In this assignemt, we distinguish between Pure SQL and RA SQL.
Below we list the only features that are allowed in Pure SQL and in
RA SQL.
In particular notice that
• join, NATURAL join, and CROSS join are not allowed in Pure
SQL.
• The predicates [not] IN, SOME, ALL, [not] exists are not allowed
in RA SQL.
The only features allowed in Pure SQL
select … from … where
WITH …
union, intersect, except operations
exists and not exists predicates
IN and not IN predicates
ALL and SOME predicates
VIEWs that can only use the above RA SQL features
The only features allowed in RA SQL
select … from … where
WITH …
union, intersect, except operations
join … ON …, natural join, and CROSS join operations
VIEWs that can only use the above RA SQL features
commas in the from clause are not allowed
4
1 Theoretical problems related to query translation and optimization
1. Consider two RA expressions E1 and E2 over the same schema.
Furthermore, consider an RA expression F with a schema that is
not necessarily the same as that of E1 and E2.
Consider the following if-then-else query:
if F = ∅ then return E1
else return E2
So this query evaluates to the expression E1 if F = ∅ and to the
expression E2 if F 6= ∅.
We can formulate this query in SQL as follows2
:
select e1.*
from E1 e1
where not exists (select distinct row() from F)
union
select e2.*
from E2 e1
where exists (select distinct row() from F);
Remark 1 The subquery query
select distinct row() from F
returns the empty set if F = ∅ and returns the tuple () if F 6= ∅.
3
In RA, this query can be written as
π()(F).
I.e., the projection of F on an empty list of attributes.
• In function of E1, E2, and F, write an RA expression in standard
notation that expresses this if-then-else query.4
2
In this SQL query E1, E2, and F denote SQL queries corresponding to the RA expressions E1, E2, and F, respectively.
3The tuple () is often referred to as the empty tuple, i.e., the tuple without components.
It is akin to the empty string in the theory of formal languages. I.e., the string wihout
alphabet characters.
4Hint: consider using the Pure SQL to RA SQL translation algorithm.
5
2. Let R(x) be a unary relation that can store a set of integers R.
Consider the following boolean SQL query:
select not exists(select 1
from R r1, R 2
where r1.x <> r2.x) as fewerThanTwo;
This boolean query returns the constant “true” if R has fewer
than two elements and returns the constant “false” otherwise.
• Using the insights you gained from Problem 1, write an RA
expression in standard notation that expresses the above boolean
SQL query.5
3. In the translation algorithm from Pure SQL to RA we tacitly
assumed that the argument of each set predicate was a (possibly parameterized) Pure SQL query that did not use a union,
intersect, nor an except operation.
In this problem, you are asked to extend the translation algorithm
from Pure SQL to RA such that the set predicates [not] exists
are eliminated that have as an argument a Pure SQL query (possibly with parameters) that uses a union, intersect, or except
operation.
More specifically, consider the following types of queries using the
[not] exists set predicate.
select L1(r1,…,rn)
from R1 r1, …, Rn rn
where C1(r1,…,rn) and
[not] exists (select L2(s1,…,sm)
from S1 s1,…, S1 sm
where C2(s1,…,sm,r1,…,rn)
[union | intersect | except]
select L3(t1,…,tk)
from T1 t1, …, Tk tk
where C3(t1,…,tk,r1,…,rn))
Observe that there are six cases to consider:
(a) exists (… union …)
5Hint: recall that, in general, a constant value “a” can be represented in RA by an
expression of the form (A: a). (Here, A is some arbitrary attribute name.) Furthermore,
recall that we can express (A: a) in SQL as “select a as A”. Thus RA expressions for
the constants “true” and “false” can be the expressions (A: true) and (A: false),
respectively.
6
(b) exists (… intersect …)
(c) exists (… except …)
(d) not exists (… union …)
(e) not exists (… intersect …)
(f) not exists (… except …)
• Show how such SQL queries can be translated to equivalent RA
expressions in standard notation. Be careful in the translation
since you should take into account that projections do not in
general distribute over intersections and over set differences.
To get practice, first consider the following special case where
n = 1, m = 1, and k = 1. I.e., the following case: 6
select L1(r)
from R r
where C1(r) [not] exists (select L2(s)
from S s
where C2(s,r)
[union | intersect | except]
select L3(t)
from T t
where C3(t,r))
4. • Let R be a relation with schema (a, b, c) and let S be a relation
with schema (d, e).
Prove, from first principles7
, the correctness of the following rewrite
rule:
πa,d(R ./c=d S) = πa,d(πa,c(R) ./c=d πd(S)).
5. • Consider the same rewrite rule
πa,d(R ./c=d S) = πa,d(πa,c(R) ./c=d πd(S))
as in problem 4.
Furthermore assume that S has primary key d and that R has
foreign key c referencing this primary key in S.
How can you simplify this rewrite rule? Argue why this rewrite
rule is correct.
6Once you can handle this case, the general case is a similar.
7
In particular, do not use the rewrite rule of pushing projections over joins. Rather,
use Predicate Logic or TRC to provide a proof.
7
2 Translating Pure SQL queries to RA expressions
and optimized RA expressions
In this section, you are asked to translate Pure SQL queries into RA
SQL queries as well as in standard RA expressions using the translation
algorithm given in class. You are required to show the intermediate
steps that you took during the translation. After the translation, you
are asked to optimize the resulting RA expressions.
You can use the following letters, or indexed letters, to denote relation names in RA expressions:
P, P1, P2, · · · Person
C, C1, C2, · · · Company
S, S1, S2, · · · Skill
W, W1, W2, · · · worksFor
cL, cL1, cL2, · · · companyLocation
pS, pS1, pS2, · · · personSkill
hM, hM1, hM2, · · · hasManager
K, K1, K2, · · · Knows
We illustrate what is expected using an example.
Example 1 Consider the query “ Find each (p, c) pair where p is the
pid of a person who works for a company c located in Bloomington and
whose salary is the lowest among the salaries of persons who work for
that company.
A possible formulation of this query in Pure SQL is
select w.pid, w.cname
from worksfor w
where w.cname in (select cl.cname
from companyLocation cl
where cl.city = ’Bloomington’) and
w.salary <= ALL (select w1.salary
from worksfor w1
where w1.cname = w.cname);
which is translated to8
select q.pid, q.cname
from (select w.*
from worksfor w
where w.cname in (select cl.cname
from companyLocation cl
where cl.city = ’Bloomington’)
intersect
8Translation of ‘and’ in the ‘where’ clause.
8
select w.*
from worksfor w
where w.salary <= ALL (select w1.salary from worksfor w1 where w1.cname = w.cname)) q; which is translated to9 select q.pid, q.cname from (select w.* from worksfor w, companyLocation cl where w.cname = cl.cname and cl.city = ’Bloomington’ intersect (select w.* from worksfor w except select w.* from worksfor w, worksfor w1 where w.salary > w1.salary and w1.cname = w.cname)) q;
which is translated to10
select q.pid, q.cname
from (select w.*
from worksfor w, (select cl.* from companyLocation cl where cl.city = ’Bloomington’) cl
where w.cname = cl.cname
intersect
(select w.*
from worksfor w
except
select w.*
from worksfor w, worksfor w1
where w.salary > w1.salary and w1.cname = w.cname)) q;
which is translated to the RA SQL query11
select q.pid, q.cname
from (select w.*
from worksfor w
natural join (select cl.* from companyLocation cl where cl.city = ’Bloomington’) cl
intersect
(select w.*
from worksfor w
except
select w.*
from worksfor w join worksfor w1 on (w.salary > w1.salary and w1.cname = w.cname))) q;
This RA SQL query can be formulated as an RA expression in standard notation as follows:
πW.pid,W.cname(E ∩ (W − F))
where
E = πW.∗(W ./ σcity=Bloomington(cL))
and
F = πW.∗(W ./W.salary>W1.salary ∧ W1.cname=W.cname W1).
We can now commence the optimization.
9Translation of ‘in’ and ‘<= ALL’. 10Move ’constant’ condition. 11Introduction of natural join and join. 9 Step 1 Observe the expression E ∩(W −F). This expression is equivalent with (E ∩ W) − F. Then observe that, in this case, E ⊆ W. Therefore E ∩ W = E, and therefore E ∩(W − F) can be replaced by E − F. So the expression for the query becomes πW.pid,W.cname(E − F). Step 2 We now concentrate on the expression E = πW.∗(W ./ σcity=Bloomington(cL)). We can push the projection over the join and get πW.∗(W ./ πcname(σcity=Bloomington(cL))). Which further simplifies to W n σcity=Bloomington(cL). We will call this expression Eopt . Step 3 We now concentrate on the expression F = πW.∗(W ./W.salary>W1.salary ∧ W1.cname=W.cname W1).
We can push the projection over the join and get the expression
πW.∗(W ./W.salary>W1.salary ∧ W1.cname=W.cname πW1.cname,W1.salary(W1)).
We will call this expression F
opt
.
Therefore, the fully optimized RA expression is
πW.pid,W.cname(E
opt − F
opt).
I.e.,
πW.pid,W.cname(W n σcity=Bloomington(cL)−
πW.∗(W ./W.salary>W1.salary ∧ W1.cname=W.cname πW1.cname,W1.salary(W1))).
10
We now turn to the problems in this section.
6. Consider the query “Find the cname and headquarter of each company that employs persons who earn less than 55000 and who do
not live in Bloomington.”
A possible way to write this query in Pure SQL is
select c.cname, c.headquarter
from company c
where c.cname in (select w.cname
from worksfor w
where w.salary < 55000 and
w.pid = SOME (select p.pid
from person p
where p.city <> ’Bloomington’));
(a) • Using the Pure SQL to RA SQL translation algorithm,
translate this Pure SQL query to an equivalent RA SQL
query. Show the translation steps you used to obtain your
solution.
(b) • Optimize this RA SQL query and provide the optimized
expression in standard RA notation. Specify at least three
conceptually different rewrite rules that you used during the
optimization.
7. Consider the query “Find the pid of each person who has all-butone job skill.”
A possible way to write this query in Pure SQL is
select p.pid
from person p
where exists (select 1
from skill s
where (p.pid, s.skill) not in (select ps.pid, ps.skill
from personSkill ps)) and
not exists (select 1
from skill s1, skill s2
where s1.skill <> s2.skill and
(p.pid, s1.skill) not in (select ps.pid, ps.skill
from personSkill ps) and
(p.pid, s2.skill) not in (select ps.pid, ps.skill
from personSkill ps));
(a) • Using the Pure SQL to RA SQL translation algorithm,
translate this Pure SQL query to an equivalent RA SQL
11
query. Show the translation steps you used to obtain your
solution.
(b) • Optimize this RA SQL query and provide the optimized
expression in standard RA notation. Specify at least three
conceptually different rewrite rules that you used during the
optimization.
8. Consider the query “Find the pid and name of each person who
works for a company located in Bloomington but who does not
know any person who lives in Chicago.”
A possible way to write this query in Pure SQL is
select p.pid, p.pname
from person p
where exists (select 1
from worksFor w, companyLocation c
where p.pid = w.pid and w.cname = c.cname and c.city = ’Bloomington’) and
p.pid not in (select k.pid1
from knows k
where exists (select 1
from person p
where k.pid2 = p.pid and p.city = ’Chicago’));
(a) • Using the Pure SQL to RA SQL translation algorithm,
translate this Pure SQL query to an equivalent RA SQL
query. Show the translation steps you used to obtain your
solution.
(b) • Optimize this RA SQL query and provide the optimized
expression in standard RA notation. Specify at least three
conceptually different rewrite rules that you used during the
optimization.
12
9. Consider the query “Find the cname and headquarter of each company that (1) employs at least one person and (2) whose workers who make at most 70000 have both the programming and AI
skills.”
A possible way to write this query in Pure SQL is
select c.cname, c.headquarter
from company c
where exists (select 1 from worksfor w where w.cname = c.cname) and
not exists (select 1
from worksfor w
where w.cname = c.cname and w.salary <= 70000 and
(w.pid not in (select ps.pid from personskill ps where skill = ’Programming’) or
w.pid not in (select ps.pid from personskill ps where skill = ’AI’)));
(a) • Using the Pure SQL to RA SQL translation algorithm,
translate this Pure SQL query to an equivalent RA SQL
query. Show the translation steps you used to obtain your
solution.
(b) • Optimize this RA SQL query and provide the optimized
expression in standard RA notation. Specify at least three
conceptually different rewrite rules that you used during the
optimization.
10. Consider the following Pure SQL query.
select p.pid, exists (select 1
from hasManager hm1, hasManager hm2
where hm1.mid = p.pid and hm2.mid = p.pid and
hm1.eid <> hm2.eid)
from Person p;
This query returns a pair (p, t) if p is the pid of a person who
manages at least two persons and returns the pair (p, f) otherwise.12
(a) • Using the insights gained from Problem 2, translate this
Pure SQL query to an equivalent RA SQL query.
(b) • Optimize this RA SQL query and provide the optimized
expression in standard RA notation.
12t represent the boolean value true and f represents the boolean value false.
13