CS520 - Data Integration, Warehousing, and Provenance - 2023 Fall

Course Webpage for CS520 - 2023 Fall taught by Boris Glavic

Notes taken during Lecture

Below you find the notes taken in class. I am also using hand written notes that you can find here: https://www.evernote.com/shard/s131/sh/07c1ca72-d8e4-4880-a1c4-191953171027/b0IkbmPwompLoW7JlrMI0MHpOSfq7UIyUETyFD9yOtLXZGYcMMPHsQxC4A

Class <2023-08-22 Tue>

design schemas (schema heterogenity)

attributes

  • person: name, phone, dateofbirth
  • address: streetname, zipcode, apt, city, state, country

table (relational database)

option 1
  • person(pid,name, phone, dateofbirth, aid)
  • address(aid,streetname, zipcode, apt, city, state, country)
  • constraints:

    • FK: person.aid to address.aid
    • PK: person.pid and address.aid
option 2
  • multiple addresses per person
  • person(pid,name, phone, dateofbirth)
  • address(aid,pid,streetname, zipcode, apt, city, state, country)
option 3
  • multiple addresses per person
  • person(pid,name, phone, dateofbirth)
  • address(aid,streetname, zipcode, apt, city, state, country)
  • livesat(pid,aid)

semi-structured (JSON)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
[
	{
		"pid": 1,
		"name": "Peter",
		"phone": "312-333-3333",
		"dateofbirth": "2023-01-01",
		"addresses": [
			{
				"aid" : 1,
				"streetname": "asdkasdjh",
				"zipcode": 60612,
				"apt": "13000b",
				"city": "Chicago",
				"state": "IL",
				"country": "USA"
			}
		]
	},
	{
		"pid": 2,
		"name": "Alice",
		"phone": "312-555-3333",
		"dateofbirth": "2023-01-01",
		"addresses": [
			{
				"aid" : 1,
				"streetname": "asdkasdjh",
				"zipcode": 60612,
				"apt": "13000b",
				"city": "Chicago",
				"state": "IL",
				"country": "USA"
			}
		]
	}
]

Class <2023-08-24 Thu>

Boolean logic

syntax

  • Set of Variable \(\mathbb{V}\)
  • atomic formula:

    • variable x (take boolean values)
    • constant { FALSE, TRUE }
  • formula:

    • any atomic formula is a formula
    • if \(\phi_1\) is a formula and \(\phi_2\) is a formula, then following are also formulas:

      • \(\phi_1 \land \phi_2\) (AND)
      • \(\phi_1 \lor \phi_2\) (OR)
      • \(\neg \phi_1\) (NOT)

semantics

  • Assignment \(\varphi: \mathbb{V} \to \{ \bot, \top \}\)
  • Given assignment, a formula evaluate to either \(\{\bot,\top\}\)
  • \(\varphi(x)\) = constant assigned to \(x\) by \(\varphi\)
  • \(\varphi(c) = c\)
  • \(\varphi(\phi_1 \land \phi_2) = \top\) if \(\varphi(\phi_1) = \top\) and \(\varphi(\phi_2) = \top\), otherwise return \(\bot\)
  • \(\varphi(\phi_1 \lor \phi_2) = \top\) if \(\varphi(\phi_1) = \top\) or \(\varphi(\phi_2) = \top\), otherwise return \(\bot\)
  • \(\varphi(\phi_1 \rightarrow \phi_2) = \top\) if \(\varphi(\phi_1) = \top\) and \(\varphi(\phi_2) = \top\) or \(\varphi(\phi_1) = \bot\), otherwise return \(\bot\)
  • \(\varphi(\neg \phi_1) = \top\) if \(\varphi(\phi_1) = \bot\), otherwise \(\top\)
  • tautologies: formula is true for every assignment

    • \(x \lor \neg x\)
  • satisfiability: if there exists an assignment for which the formula is true

    • example satisfiable formula \(x \lor \neg x\)
    • example not satisfiable formula \(x \land \neg x\)

Class <2023-09-05 Tue>

FO logic

examples

integers
  • predicates \(<(\cdot, \cdot)\), \(-(\cdot, \cdot)\)
  • domain \(\mathbb{D} = \{0,1,2,-1,-2\}\)
  • model \(<\)
0 1
-1 1

\(-\)

1 -1
-1 1
0 0
2 -2
  • \(\psi = \forall x: \exists y: -(x,y)\)
  • \(\psi[x = 1]\) then \(\exists y: -(1,y)\), yes, for y = -1
  • \(\exists x: <(x,0)\)
  • \(\forall x: \exists y: <(x,y)\)
  • \(<(x,0)\), can be interpreted as a query (return all \(x\) smaller than 0)
relations
  • database schema (predicates): takes(sid,cid) student(sid,name,major), course(cid,title, major)

takes

sid cid
1 1
1 2
2 1

student

sid name major
1 Peter CS
2 Bob CS

course

cid title major
1 CS520 CS
2 CS101 CS
  • \(\exists y,z: student(x,y,z)\)
1
SELECT sid FROM student;
  • \(\exists y,z,a: student(x,y,z) \land takes(x,a)\)
1
SELECT DISTINCT sid FROM student s, takes t WHERE s.sid = t.sid
  • \(\forall x: \exists y,z,a: student(x,y,z) \rightarrow takes(x,a)\)

some logical equivalences

  • \(a \vee (b \land c) \equiv (a \vee b) \land (a \lor c)\)
  • \(\neg (a \land b) \equiv \neg a \lor \neg b\)
  • \(\forall x: \phi(x) \equiv \neg \exists x: \neg \phi(x)\)
  • \(x \rightarrow y \equiv \neg x \lor y\)

integrity constraints

FDs

  • \(a \rightarrow b\)
  • example FD zip -> city over table address(zip,city)

*Expressing the FD in FO and rewritting the universal quantification into double negation to be able to express it in SQL *

  • \(\forall x,y,z: address(x,y) \land address(x,z) \rightarrow y = z\)
  • \(\neg \exists x,y,z: \neg( address(x,y) \land address(x,z) \rightarrow y = z)\) (\(\forall x: \phi(x) \equiv \neg \exists x: \neg \phi(x)\))
  • \(\neg \exists x,y,z: \neg( \neg(address(x,y) \land address(x,z)) \lor (y = z))\) (\(x \rightarrow y \equiv \neg x \lor y\))
  • \(\neg \exists x,y,z: (address(x,y) \land address(x,z) \land (y \neq z)\) (\(\neg \neg x \equiv x\) and \(\neg y = z \equiv y \neq z\))

Translation into SQL

1
SELECT NOT EXISTS (SELECT * FROM address a1, address a2 WHERE a1.zip = a2.zip AND z1.city <> z2.city);
zip city
60616 Chicago
60616 New York

Datalog

1
Q(X) :- R(X,Y), X < Y.
1
Q(X) :- R(X,Y), R(Y,Z).

connection

in out
Lake Madison
Madison IIT
IIT Hyde Park
Hyde Park IIT
1
Q(X) :- connection('Hyde Park', X).

Q

X
IIT
1
Q(X) :- connection(X, 'Hyde Park').

Q

X
IIT
  • 2 hops
1
Q(Start,End) :- connection(Start,Z), connection(Z, End).
  • 3 hops
1
Q(Start, End) :- connection(Start,Z), connection(Z, A), connection(A, End).
  • 2 hops or 3 hops
1
2
3
4
Qtwo(Start,End) :- connection(Start,Z), connection(Z, End).
Qthree(Start, End) :- Qtwo(Start,A), connection(A, End).
Q(Start,End) :- Qtwo(Start,End).
Q(Start,End) :- Qthree(Start,End).
  • 3 hops that are not 2 hops
1
2
3
Qtwo(Start,End) :- connection(Start,Z), connection(Z, End).
Qthree(Start, End) :- Q2(Start,A), connection(A, End).
Q(Start,End) :- Qthree(Start,End), not Qtwo(Start,End).
  • unsafe w/o negation
1
Q(X) :- R(Y).
  • unsafe negation (all variables in negative atoms, have to appear in positive atoms in the body)
1
Q(X) :- S(X), not R(X).
  • recursion
1
2
TC(X,Y) :- connection(X,Y).
TC(X,Y) :- TC(X,Z), connection(Z,Y).
  • \(I_0 = I\)
  • \(I_{i+1} = T_P(I_{i}) \cup I_{i}\)
  • \(I^{*} = I_n\) such that \(I_n = I_{n-1}\)
1
2
TC(X,Y) :- connection(X,Y).
TC(X,Y) :- TC(X,Z), TC(Z,Y).
1
2
connection('IIT', 'Hyde Park').
connection('IIT', 'Downtown').

Class <2023-09-12 Tue>

detecting constraint (FD) violations

  • zip -> city
  • P(zip,city)
  • self join on LHS check that at least one RHS attribute is different
1
2
3
4
SELECT EXISTS
         (SELECT 1
            FROM P p1, P p2
          WHERE p1.zip = p2.zip AND p1.city != p2.city)
  • zip, county -> city, region is equivalent zip, country -> city, zip, country -> region
  • A(zip,city,country,region)
1
2
3
4
SELECT EXISTS
         (SELECT 1
            FROM A a1, A a2
          WHERE a1.zip = a2.zip AND a1.country = a2.country AND  (a1.city != a2.city OR a2.region != a2.region)

find rows in violations

  • zip -> city
  • P(zip,city)
  • self join on LHS check that at least one RHS attribute is different
1
2
3
SELECT p1.zip AS zip1, p1.city AS city1, p2.zip AS zip2, p2.city AS city2
   FROM P p1, P p2
  WHERE p1.zip = p2.zip AND p1.city != p2.city AND p1.city < p2.city

find violations to denial constraints

  • $\forall \vec{x}: \neg \Phi(\vec{x})$
  • where $\Phi$ conjunction over relational atoms (predicates) and comparisons
  • E(eid,name,salary)
  • $\forall \neg (E(x,y,z) \land z < 0)$
  • $\forall \neg (E(\_,\_,z) \land z < 0)$
1
2
3
SELECT eid, salary
FROM E
WHERE salary < 0

Class <2023-09-14 Thu>

Incomplete Database

  • \(\mathbb{D} = \{D_1, \ldots, D_n\}\)
  • \(Q(\mathbb{D}) = \{Q(D_1, \ldots, Q(D_n) \}\) possible answer semantics
  • \(certain(Q,\mathbb{D})\) - certain answer

    • \(\{ t \mid \forall D \in \mathbb{D}: t \in Q(D) \}\)

Class <2023-09-19 Tue>

Class <2023-09-21 Thu>

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
INSERT INTO person2
SELECT Name,
       City AS address,
       Office-contact AS office-phone,
       NULL AS office-address,
       NULL AS home-phone
FROM person p, address a
WHERE p.Address = a.id;

INSERT INTO person2
SELECT Name,
       NULL AS address,
       NULL AS office-phone,
       NULL AS office-address,
       NULL AS home-phone
FROM person p;

INSERT INTO person2
SELECT NULL AS Name,
       City AS address,
       Office-contact AS office-phone,
       NULL AS office-address,
       NULL AS home-phone
FROM address a;
1
2
3
4
5
SELECT Name,
       City AS address,
       Office-contact AS office-phone
FROM person p, address a
WHERE p.Address = a.id;

Class <2023-09-26 Tue>

LOCAL SCHEMA
----------------------------------------
Person
   Name

GLOBAL SCHEMA
----------------------------------------

P2
   Name
   Age

MAPPING
----------------------------------------

Person.Name -> P2.Name

Person(X) = P2(X,Y).

Person

Name
Peter
Bob

P2

Name Age
Peter 1
Bob 1
Name Age
Peter 90
Bob 92

bag semantics example

  • Person(X) = P2(X,Y)

Person

Name
Peter
Peter
Bob
Bob

P2

Name Age
Peter 1
Peter 12
Bob 1
Bob 1

Class <2023-09-28 Thu>

1
2
3
Q(X,Y,Z) :- Person(X,Y,Z,_,_).
<=
Q(A,B,C) :- Person(A,X), Address(X,B,C).

Testing query equivalence for query answering with views

1
2
3
4
5
v2(I,T,Y) :- movie(I,T,Y,G), Y >= 1950, G=comedy.
v3(I,D) :- director(I,D), actor(I,D).
q2(T,Y,D) :- v2(I,T,Y), v3(I,D).

q(T,Y,D) :=  movie(I,T,Y,G), Y >= 1950, G=comedy, director(I,D), actor(I,D).
  • test containment

    1
    2
    3
    
    q2unfold(T,Y,D) :- movie(I,T,Y,G), Y >= 1950, G=comedy, director(I,D), actor(I,D).
    
    q(T,Y,D) :=  movie(I,T,Y,G), Y >= 1950, G=comedy, director(I,D), actor(I,D).
  • containment mapping q2unfold to q
I -> I
T -> T
Y -> Y
G -> G
D -> D
1
2
3
v6(T,Y) :- movie(I,T,Y,G), Y >= 1950, G=comedy.
v3(I,D) :- director(I,D), actor(I,D).
q1(T,Y,D) :- v6(T,Y), v3(I,D).
1
2
q(T,Y,D) :=  movie(I,T,Y,G), Y >= 1950, G=comedy, director(I,D), actor(I,D).
q1(T,Y,D) :- movie(I2,T,Y,G), Y >= 1950, G=comedy, director(I,D), actor(I,D).

Class <2023-10-03 Tue>

Class <2023-10-05 Thu>

Query containment

  • variable mapping $h$ from Q1 to Q2
  • head mapped to head
  • for each body atom \( R(\vec{x}) \) of Q1, \( h(R(\vec{x}) \in body(Q2) \)
  • test head to head first, this give you a partial containment mapping or you detect failure early on
  • for body

    • have to map to goals with the same predicate

      • start with goals whose predicates are unique in the query (or have very few occurrences)
      • utilize the partial containment mapping from the head-to-head mapping
  • \( Q_1 \rightarrow Q_4 \)
left var right var
Z A
A B
X C
Y C
  • \( Q_2 \to Q_1 \)
left var right var
Y
Z
A
B
  • \( Q_4 \to Q_3 \)
left var right var
A X
B Y
C
1
q4(X,Y) :- R(X,XXX), R(XXX,X), S(X,Y).
  • cannot map XXX to both A and Z.
  • \( Q_4' \to Q_3 \)
1
Q4(A,B) :- R(A,C), R(D,A), S(A,B).
left var right var
A X
B Y
C Z
D A
1
2
3
4
q1(X,Y) :- r(X,Y).
q2(X,Y) :- r(X,Y), r(Y,Z).
q3(X,Y) :- r(X,Y), r(X,Y).
q4(X,Y) :- r(X,Y), r(X,Z).
  • =Q4(c1,c2) :- r(c1,c2), r(c1,c2).
  • \( Q_1 \to Q_3 \)
left var right var
X X
Y Y
  • \( Q_4 \to Q_1 \)
left var right var
X X
Y Y
Z Y
1
2
Qa(X,X) :- R(X,X).
Qb(X,Y) :- R(X,Y), R(X,X).
  • Qa to Qb
left var right var
X N/A

no containment mapping exists

  • Qb to Qa
left var right var
X X
Y X

R(a1,a2)

a1 a2
1 1
1 2

Class <2023-10-10 Tue>

Constraint generation

Schema

  • *Camera*(brand,price,model,shutterspeed,zoom)
  • *Producer*(pid,name,revenue)

create FD

  • create functional dependency brand,model -> price

\[ \forall b,m,p_1,p_2,s_1,s_2,z_1,z_2: Camera(b,p_1,m,s_1,z_1) \land Camera(b,p_2,m,s_2,z_2) \rightarrow p_1 = p_2 \]

\[ \forall b_1,b_2,m_1,m_2,p_1,p_2,s_1,s_2,z_1,z_2: Camera(b_1,p_1,m_1,s_1,z_1) \land Camera(b_2,p_2,m_2,s_2,z_2) \] \[ \land b_1 = b_2 \land m_1 = m_2 \rightarrow p_1 = p_2 \]

create FK

  • create FK Camera.brand to Producer.pid

\[ \forall b,p,m,s,z: Camera(b,p,m,s,z) \rightarrow \exists n,r: Producer(b,n,r) \]

non-FD and FK constraints

  • if there are two cameras from the same brand and the first camera costs more than the second, then the shutterspeed of the first has to be higher than the shutterspeed of the second

\[ \forall: b_1,b_2,m_1,m_2,p_1,p_2,s_1,s_2,z_1,z_2: Camera(b_1,p_1,m_1,s_1,z_1) \land Camera(b_2,p_2,m_2,s_2,z_2) \] \[ \land b_1 = b_2 \land p_1 > p_2 \rightarrow s_1 > s_2 \]

  • if there two cameras from the same brand with a price $p_1$ for the first and $p_1 < p_2$ for the second, then there has to exist a third camera from the same brand such that $p_1 < p_3 < p_2$

\[ \forall: b_1,b_2,m_1,m_2,p_1,p_2,s_1,s_2,z_1,z_2: Camera(b_1,p_1,m_1,s_1,z_1) \land Camera(b_2,p_2,m_2,s_2,z_2) \] \[ \land b_1 = b_2 \land p_1 < p_2 \rightarrow \exists m_3,p_3,s_3,z_3: Camera(b_1,p_3,m_3,s_3,z_3) \land p_1 < l_3 \land p_3 < p_2 \]

Containment mappings

1
2
3
Q1 ( X ) : - R (X , Z ) , R (X , X ) , R (Z , A ).
Q2 ( X ) : - R (X , Y ) , R (Y , Z ).
Q3 ( X ) : - R (X , Z ) , R (Z , A ) , R (A , Y ).

Q1 to Q2

  • R(X,X) does not exist (head to head enforces X -> X
X -> X
Z ->
A ->

Q2 to Q1

X -> X
Y -> X
Z -> X

Q3 to Q1

X -> X
Y -> X
Z -> X
A -> X
X -> X
Y -> A
Z -> X
A -> Z

R(X,X), R(X,Z), R(Z,A)

Q2 to Q3

X -> X
Y -> Z
Z -> A

further examples

1
2
Q1(X) :- R(X), S(X)
Q2(X) :- R(X)
Q1 to Q2
  • no mapping exists
X -> X
Q2 to Q1
X -> X

Schema matching

  • Source: Person(Name, Address, Age)
  • Target: P2(LastName, City, Salary)
  • use only the schema (attribute names) and match if normalized edit distance < 0.6

Person

Name address age

P2

LastName City Salary
  • only Name and LastName matches (every other pair has edit distance < 0.8

Datalog

Fall 2016 - 1.1.1

Write a Datalog program that returns product titles (attribute title of relation product).

1
q(T) :- product(_,_,T,_).

Fall 2016 - 1.1.2

Write a Datalog program that returns the description and status of bugs for product “VCleaner”.

1
q(D,S) :- product(P,V,vcleaner,_), bug(_,P,V,D,S).

Fall 2016 - 1.1.4

Write a Datalog program that returns the names of supportes that are not assigned to any open bugs (attribute status).

1
2
opensup(N) :- supporter(N,_,_), bugassignment(N,B), bug(B,_,_,_,open).
q(N) :- supporter(N,_,_), not opensup(N).

universal quantification

Write a Datalog program the returns names of supporter that are assigned to all bug.

\[ \forall \vec{x}: \phi(\vec{x}) \Leftrightarrow \neg \exists \vec{x}: \neg \phi(\vec{x}) \]

1
2
missingbug(N) :- supporter(N,_,_), bug(B,_,_,_,_,_), not bugassignment(N,B).
q(N) :- supporter(N,_,_), not missingbug(N).

recursive query

Write a Datalog program that returns pairs of supporters such that either they directly share a bug assignment (they are assigned to the same bug) or we can find a sequence of other supporters connecting them via shared bug assignments.

1
2
3
shares(N1,N2) :- supporter(N1,_,_), supporter(N2,_,_), bugassignment(N1,B), bugassignment(N2,B).
q(N1,N2) :- shares(N1,N2).
q(N1,N2) :- q(N1,N3), shares(N3,N2).

tree recursion

  • table Tree(parent,left,right)
  • table Root(node)

Tree

parent left right
1 2 3
2 4 5
5 6 7

Root

node
1
  • compute ca(X,Y,Z) that contains all Z such that Z is a common ancestor of X, Y
1
2
3
4
5
path(X,Y) :- tree(X,Y,_).
path(X,Y) :- tree(X,_,Y).
path(X,Y) :- path(X,Z), path(Z,Y).

ca(X,Y,Z) :- path(Z,X), path(Z,Y).

translation to relational algebra

  • union the translation of all rules with the same head
  • if there are multiple IDB predicates, then translate based on topological sort
  • how to translate a rule:

    • for positive rules (no negation)

      • rename the attributes of goals in the body to match the names of variables of the goal (using the renaming operator $\rho$)
      • join goals in the body, then selection for comparison predicates, then projection for head

= R(A,B)

1
2
Q(X) :- R(X,Y), Y < 10.
Q(X) :- R(_,X).
  • in this case we can skip the renaming step to simplify the query

\[ \pi_{A}(\sigma_{B < 10}(R)) \cup \pi_{B}(R) \]

  • or using renaming:

\[ \pi_{X}(\sigma_{Y < 10}(\rho_{A\rightarrow X,B\rightarrow Y}(R))) \cup \pi_{X}(\rho_{A\rightarrow Z1,B\rightarrow X}(R)) \]

Class <2023-10-17 Tue>

universal data exchange solutions

\[ \forall x: S(x) \rightarrow \exists y: R(x,y). \]

I = { S(1), S(2) }

J = { R(1,N1), R(2,N2) }

iJ1 = { R(1,1), R(2,4) }
J2 = { R(1,2), R(2,6), R(3,3) }

homomorphisms from J to J1 and J2

h1: 1 -> 1
    2 -> 2
   N1 -> 1
   N2 -> 4
h2: 1 -> 1
    2 -> 2
   N1 -> 2
   N2 -> 6

query evaluation

Q() :- R(1,X).

  • evaluation over universal solution J
  • compose h with h1
h: X -> N1
Q() :- R(1,N1).
h1(h): X -> 1
Q() :- R(1,1).

Class <2023-10-19 Thu>

star schema

dimension tables

  • Location(lid, country, state, city , zip, shop), PK: lid
lid country state city zip shop
1 USA IL Chicago 60615 branch#1
2 USA IL Chicago 60615 branch#2
..
1000234 USA NY Buffalo 45554 branch#1412
  • Time(tid,year,month,day), PK tid
tid year month day
1 2015 01 01
2 2015 01 02
  • product(pid,type, brand, product), PK pid
pid type brand product
1 coffee tschibo soft blend

fact table

  • fact(lid,tid,pid,cost,revenue) PK: lid,tid,pid

    • FK lid to location
    • FK tid to time
    • FK pid to product
lid tid pid cost sales revenue
2 1 1 0.50 23 3.50

grouping queries

give me the total sales, the sales per year, and the sales per month

1
2
3
4
5
SELECT sum(sales), NULL AS year, NULL AS month FROM fact
UNION ALL
SELECT sum(sales), year, NULL AS month FROM fact GROUP BY year
UNION ALL
SELECT sum(sales), year, month FROM fact GROUP BY year, month;
1
SELECT sum(sales), year, month FROM fact GROUP BY ROLLUP (year,month);
  • ambiguity: NULL has two meanings:

    • we did not group on this attribute
    • the group-by value is NULL
1
2
3
4
5
SELECT sum(sales), 0 AS gb_year, 0 AS gb_month, NULL AS year, NULL AS month FROM fact
UNION ALL
SELECT sum(sales), 1 AS gb_year, 0 AS gb_month, year, NULL AS month FROM fact GROUP BY year
UNION ALL
SELECT sum(sales), 1 AS gb_year, 1 AS gb_month, year, month FROM fact GROUP BY year, month;
1
SELECT sum(sales), GROUPING(year) AS gb_year, GROUPING(month) AS gb_month FROM fact GROUP BY ROLLUP(year,month);
1
2
3
4
5
6
7
8
SELECT sum(sales), NULL AS year, NULL AS month, NULL AS city, NULL AS zip FROM fact
UNION ALL
SELECT sum(sales), NULL AS year, NULL AS month, NULL AS city, zip FROM fact
UNION ALL
SELECT sum(sales), NULL AS year, NULL AS month, city, zip FROM fact
UNION ALL
SELECT sum(sales), NULL AS year, month, NULL AS city, zip FROM fact
...

Class <2023-10-24 Tue>

1
SELECT sum(sales), year FROM fact GROUP BY ROLLUP(year)
  • equivalent to
1
2
3
SELECT sum(sales), NULL AS year FROM fact
UNION ALL
SELECT sum(sales), year FROM fact GROUP BY year -- query 2
  • we have the result grouped on year (query 2)

\[ a_1 + a_2 + a_3 + b_1 + b_2 + b_3 + b_4 \]

\[ a < b \Rightarrow h(a) < h(b) \]

\[ A = c \Rightarrow h(A) = h(c) \]

\[ n \cdot \log n \]

  • 10 partitions

\[ 10 \cdot (\frac{n}{10} \cdot \log \frac{n}{10}) = n \cdot \log \frac{n}{10} \]

Class <2023-10-26 Thu>

  • machine chance of failure p per hour
  • 1000 machine, probability that none fails
  • for independent events:

\[ P(A \cap B) = P(A) \cdot P(B) \]

\[ P(none) = (1-p)^{1000} \]

  • 1000 machines, the probability of at one failing (assuming independence)

\[ P(\text{1-or-more}) = 1 - P(none) = 1 - (1- p)^{1000} \]

1
2
3
4
5
p = 0.001
n = 1000
p_one_or_more = 1 - ((1 - p) ** n)

print(p_one_or_more)
0.6323045752290363

Class <2023-10-31 Tue>

A B
1 2
3 5
1 7
3 6
1
SELECT sum(B), A FROM R GROUP BY A;
A B
1 2
1 7
3 5
3 6

word count

  • input {(id,text)} where text is a document
  • output {(word,cnt)} where cnt is the number of times the word appears in all the doc
  • map: {(id,text)} -> { (word, 1) | word in text.split(" ") }
  • reduce: {(word,1), ...., (word,1), ...} -> {(word, SUM(v1, .., vn))}

-(1,"The green fox jumped over the moon") -> {(The,1), (green,1), (fox,1), (jumped,1) ...}

Class <2023-11-02 Thu>

relational operators as map reduce jobs

  • selection: just map $f$

    • $\sigma_\theta$ as $f(t) = \{t\}$ if $t \models \theta$ else $\emptyset$
  • projection: just map $f$

    • $\Pi_A$ as $f(t) = \{t.A\}$
  • aggregation: mapper $f$ and reducer $g$

    • $\gamma_{G,agg(a)}$ $f(t) = \{(t.G, a)\}$ and $g((k,[a_1, \ldots, a_n])) = \{ (k,agg([a_1, \ldots, a_n])) \}$
    • $\gamma_{b,sum(a)}$
  • join: both map $f$ and reduce $g$

    • $R \bowtie_{R.a = S.b} S$
    • $f_R(t) = \{ (t.a, (1,t)) \}$
    • $f_S(t) = \{(t.b,(2,t)) \}$
    • $g$

R

a c
1 1
1 2
2 4

S

b d
1 4
1 5
3 5
key value
1 (1,1,1)
1 (1,1,2)
1 (2,1,4)
1 (2,1,5)
2 (1,2,4)
3 (2,3,5)
  • union:

Class <2023-11-09 Thu>

A B TID
1 2 t1
1 3 t2
1 4 t3
2 3 t4
1
SELECT DISTINCT A FROM R;
1
Q(X) :- R(X,Y).

\[ \pi_{A}(R) \]

A TID
1 o1
2 o2

\[ WHY(Q,D,o_1) = \{ \{t_1\}, \{t_2\}, \{t_3\}, \{t_1, t_2\}, \{t_2,t_3\}, \{t_1,t_4\}, \ldots\} \]

\[ MWHY(Q,D,o_1) = \{ \{t_1\}, \{t_2\}, \{t_3\} \} \]

Class <2023-11-20 Mon>

\[ D'\,\text{is minimal if}\, \forall D'': D'' \subset D': D''\,\text{is not a witness} \]

Brute force algorithm (MWhy)

  • Input: database \(D\), query \(Q\), and tuple \(t \in Q(D)\)
  • Output: \(MWhy(Q,D,t)\)
1
2
3
4
5
MWHY = emptyset
forall D' subsets D:
    if t in Q(D'):
       MWHY.add(D')
return minimize(MWHY)
  • runtime complexity: \(O(2^n)\) for a database \(D\) with \(n\) rows

derive W (set of witnesses) using the top-down rules.

\(\pi_A(R)\)

  • rule for \(\pi\)

\[ W(\pi_{A}(R),o_1,D) = W(R,t_1,D) \cup W(R,t_2,D) \cup W(R,t_3,D) \]

  • rule table access \(R\)

\[ = \{\{t_1\}\} \cup \{\{t_2\}\} \cup \{\{t_3\}\} \]

\[ = \{\{t_1\}, \{t_2\}, \{t_3\}\} \]

\(\pi_{A}(R) \bowtie_{A=C} S\)

R

A B tid
1 10 r1
1 20 r2
2 30 r3
3 40 r4

S

C D tid
1 a s1
1 b s2
2 c s3

$\pi_C(S)$

C tid
1 c1
2 c2

\[\pi_{A}(R)\]

A tid
1 p1
2 p2
3 p3

\(Q_{j} = \pi_{A}(R) \bowtie_{A=C} \pi_{C}(S)\)

A C tid
1 1 o1
2 2 o3
  • \(W(Q_j, o1)\)
  • rule for \(\bowtie\)

\[ W(\pi_{A}(R) \bowtie_{A=C} \pi_{C}(S), o_1) = \{ (w_1 \cup w_2) \mid w_1 \in W(\pi_A(R), p_1) \land w_2 \in W(\pi_{C}(S),c_1) \} \]

  • rule for \(\pi\) and \(R\) (table access)

\[ W(\pi_{A}(R) \bowtie_{A=C} \pi_{C}(S), o_1) = \{ (w_1 \cup w_2) \mid w_1 \in \{\{r_1\},\{r_2\}\} \land w_2 \in \{\{s_1\},\{s_2\}\} \} \]

\[ W(\pi_{A}(R) \bowtie_{A=C} \pi_{C}(S), o_1) = \{ \{r_1,s_1\},\{r_2,s_1\} \} \cup \{\{r_1,s_2\},\{r_2,s_2\}\} \] \[ = \{ \{r_1,s_1\},\{r_2,s_1\}, \{r_1,s_2\},\{r_2,s_2\}\} \]

annotated databases

natural numbers \(\mathbb{N}\) (bag semantics)

R

A B
1 1
1 1
1 2
2 2
2 2

R using \(\mathbb{N}\) annotation

A B \(\mathbb{N}\)
1 1 2
1 2 1
2 2 2

MWhy annotations \(\mathcal{P}(\mathcal{P}(D))\)

A C \(\mathcal{P}(\mathcal{P}(D))\)
1 1 \(\{ \{r_1,s_1\},\{r_2,s_1\}, \{r_1,s_2\},\{r_2,s_2\}\}\)
2 2 \(\{\{r_3,s_3\}\}\)

\(\mathbb{N}\)-relation

  • \(\mathcal{U} = \{1,2\}\)
  • 2-ary \(\mathbb{N}\) -relation over \(\mathcal{U}\)

R

A B
1 1
1 1
2 2

R \(\mathbb{N}\) -relation

  • \(R((1,1)) = 2\)
  • \(R((1,2)) = 0\)
  • \(R((2,1)) = 0\)
  • \(R((2,2)) = 1\)
A B \(\mathbb{N}\)
1 1 2
2 2 1

\(\mathcal{P}(\mathcal{P}(D))\), \(\emptyset\) as zero

A C \(\mathcal{P}(\mathcal{P}(D))\)
1 1 \(\{ \{r_1,s_1\},\{r_2,s_1\}, \{r_1,s_2\},\{r_2,s_2\}\}\)
2 2 \(\{\{r_3,s_3\}\}\)
1 2 \(\emptyset\)
2 2 \(\emptyset\)

Semiring \(\mathbb{B}\)

  • \(\mathbb{B} = (\{false,true\}, \lor, \land, false, true)\)
  • \(k_1 \lor k_2 = k_2 \lor k_1\)
\(k_1\) \(k_2\) \(k_1 \lor k_2\) \(k_2 \lor k_1\)
false false false false
false true true true
true false true true
true true true true
  • \(k \lor false = k\)
  • \(k \land true = k\)
  • \(k \land false = false\)
  • \(f(x) = 2x\)

Queries over annotated DBs

\(\mathbb{B}\) (set semantics)

  • \(\mathbb{B} = (\{false,true\}, \lor, \land, false, true)\)

R

A B \(\mathbb{B}\)
1 1 true
1 2 true
1 3 true
2 2 true

\(\pi_A(\sigma_{B < 3}(R))\)

  • \(\sigma_{B < 3}(R)\)
A B \(\mathbb{B}\)
1 1 \(true \land true = true\)
1 2 \(true \land true = true\)
1 3 \(true \land false = false\)
2 2 \(true \land true = true\)
  • \(\pi_A(\sigma_{B < 3}(R))\)
A \(\mathbb{B}\)
1 \(true \lor true \lor false = true\)
2 \(true\)

\(\mathbb{N}\) (bag semantics)

  • \(\mathbb{N} = (\mathbb{N}, +, \cdot, 0, 1)\)

R

A B \(\mathbb{N}\)
1 1 2
1 2 4
1 3 1
2 2 5

\(\pi_A(\sigma_{B < 3}(R))\)

  • \(\sigma_{B < 3}(R)\)
A B \(\mathbb{N}\)
1 1 \(2 \cdot 1 = 2\)
1 2 \(4 \cdot 1 = 4\)
1 3 \(1 \cdot 0 = 0\)
2 2 \(5 \cdot 1 = 5\)
  • \(\pi_A(\sigma_{B < 3}(R))\)
A \(\mathbb{N}\)
1 \(2 + 4 + 0 = 6\)
2 \(5\)
A B
1 1
1 1
1 2
1 2
1 2
1 2
1 3
2 2
2 2
2 2
2 2
2 2

\(\pi_A(\sigma_{B < 3}(R))\)

A
1
1
1
1
1
1
2
2
2
2
2

Class <2023-11-21 Tue>

R

A B TID
1 1 t1
1 2 t2
1 3 t3

\(\pi_A(R)\)

A TID
1 o1

\[ Why(Q,D,o_1) = \{ \{t_1\}, \{t_2\}, \{t_3\}, \{t_1, t_2\}, \{t_1,t_3\}, \{t_2,t_3\}, \{t_1, t_2, t_3\} \} \]

n = 3
subsets = 2^n
nonsubsets = subsets - 1
n = 3 => n: 3
subsets = 2^n => subsets: 8
nonsubsets = subsets - 1 => nonsubsets: 7

nonsubsets: 7

n = 10000
subsets = 2^n
nonsubsets = subsets - 1
n = 10000 => n: 10,000
subsets = 2^n => subsets: 19,950,631,168,807,583,848,837,421,626,835,850,838,234,968,318,861,924,548,520,089,498,529,438,830,221,946,631,919,961,684,036,194,597,899,331,129,423,209,124,271,556,491,349,413,781,117,593,785,932,096,323,957,855,730,046,793,794,526,765,246,551,266,059,895,520,550,086,918,193,311,542,508,608,460,618,104,685,509,074,866,089,624,888,090,489,894,838,009,253,941,633,257,850,621,568,309,473,902,556,912,388,065,225,096,643,874,441,046,759,871,626,985,453,222,868,538,161,694,315,775,629,640,762,836,880,760,732,228,535,091,641,476,183,956,381,458,969,463,899,410,840,960,536,267,821,064,621,427,333,394,036,525,565,649,530,603,142,680,234,969,400,335,934,316,651,459,297,773,279,665,775,606,172,582,031,407,994,198,179,607,378,245,683,762,280,037,302,885,487,251,900,834,464,581,454,650,557,929,601,414,833,921,615,734,588,139,257,095,379,769,119,277,800,826,957,735,674,444,123,062,018,757,836,325,502,728,323,789,270,710,373,802,866,393,031,428,133,241,401,624,195,671,690,574,061,419,654,342,324,638,801,248,856,147,305,207,431,992,259,611,796,250,130,992,860,241,708,340,807,605,932,320,161,268,492,288,496,255,841,312,844,061,536,738,951,487,114,256,315,111,089,745,514,203,313,820,202,931,640,957,596,464,756,010,405,845,841,566,072,044,962,867,016,515,061,920,631,004,186,422,275,908,670,900,574,606,417,856,951,911,456,055,068,251,250,406,007,519,842,261,898,059,237,118,054,444,788,072,906,395,242,548,339,221,982,707,404,473,162,376,760,846,613,033,778,706,039,803,413,197,133,493,654,622,700,563,169,937,455,508,241,780,972,810,983,291,314,403,571,877,524,768,509,857,276,937,926,433,221,599,399,876,886,660,808,368,837,838,027,643,282,775,172,273,657,572,744,784,112,294,389,733,810,861,607,423,253,291,974,813,120,197,604,178,281,965,697,475,898,164,531,258,434,135,959,862,784,130,128,185,406,283,476,649,088,690,521,047,580,882,615,823,961,985,770,122,407,044,330,583,075,869,039,319,604,603,404,973,156,583,208,672,105,913,300,903,752,823,415,539,745,394,397,715,257,455,290,510,212,310,947,321,610,753,474,825,740,775,273,986,348,298,498,340,756,937,955,646,638,621,874,569,499,279,016,572,103,701,364,433,135,817,214,311,791,398,222,983,845,847,334,440,270,964,182,851,005,072,927,748,364,550,578,634,501,100,852,987,812,389,473,928,699,540,834,346,158,807,043,959,118,985,815,145,779,177,143,619,698,728,131,459,483,783,202,081,474,982,171,858,011,389,071,228,250,905,826,817,436,220,577,475,921,417,653,715,687,725,614,904,582,904,992,461,028,630,081,535,583,308,130,101,987,675,856,234,343,538,955,409,175,623,400,844,887,526,162,643,568,648,833,519,463,720,377,293,240,094,456,246,923,254,350,400,678,027,273,837,755,376,406,726,898,636,241,037,491,410,966,718,557,050,759,098,100,246,789,880,178,271,925,953,381,282,421,954,028,302,759,408,448,955,014,676,668,389,697,996,886,241,636,313,376,393,903,373,455,801,407,636,741,877,711,055,384,225,739,499,110,186,468,219,696,581,651,485,130,494,222,369,947,714,763,069,155,468,217,682,876,200,362,777,257,723,781,365,331,611,196,811,280,792,669,481,887,201,298,643,660,768,551,639,860,534,602,297,871,557,517,947,385,246,369,446,923,087,894,265,948,217,008,051,120,322,365,496,288,169,035,739,121,368,338,393,591,756,418,733,850,510,970,271,613,915,439,590,991,598,154,654,417,336,311,656,936,031,122,249,937,969,999,226,781,732,358,023,111,862,644,575,299,135,758,175,008,199,839,236,284,615,249,881,088,960,232,244,362,173,771,618,086,357,015,468,484,058,622,329,792,853,875,623,486,556,440,536,962,622,018,963,571,028,812,361,567,512,543,338,303,270,029,097,668,650,568,557,157,505,516,727,518,899,194,129,711,337,690,149,916,181,315,171,544,007,728,650,573,189,557,450,920,330,185,304,847,113,818,315,407,324,053,319,038,462,084,036,421,763,703,911,550,639,789,000,742,853,672,196,280,903,477,974,533,320,468,368,795,868,580,237,952,218,629,120,080,742,819,551,317,948,157,624,448,298,518,461,509,704,888,027,274,721,574,688,131,594,750,409,732,115,080,498,190,455,803,416,826,949,787,141,316,063,210,686,391,511,681,774,304,792,596,709,376
nonsubsets = subsets - 1 => nonsubsets: 19,950,631,168,807,583,848,837,421,626,835,850,838,234,968,318,861,924,548,520,089,498,529,438,830,221,946,631,919,961,684,036,194,597,899,331,129,423,209,124,271,556,491,349,413,781,117,593,785,932,096,323,957,855,730,046,793,794,526,765,246,551,266,059,895,520,550,086,918,193,311,542,508,608,460,618,104,685,509,074,866,089,624,888,090,489,894,838,009,253,941,633,257,850,621,568,309,473,902,556,912,388,065,225,096,643,874,441,046,759,871,626,985,453,222,868,538,161,694,315,775,629,640,762,836,880,760,732,228,535,091,641,476,183,956,381,458,969,463,899,410,840,960,536,267,821,064,621,427,333,394,036,525,565,649,530,603,142,680,234,969,400,335,934,316,651,459,297,773,279,665,775,606,172,582,031,407,994,198,179,607,378,245,683,762,280,037,302,885,487,251,900,834,464,581,454,650,557,929,601,414,833,921,615,734,588,139,257,095,379,769,119,277,800,826,957,735,674,444,123,062,018,757,836,325,502,728,323,789,270,710,373,802,866,393,031,428,133,241,401,624,195,671,690,574,061,419,654,342,324,638,801,248,856,147,305,207,431,992,259,611,796,250,130,992,860,241,708,340,807,605,932,320,161,268,492,288,496,255,841,312,844,061,536,738,951,487,114,256,315,111,089,745,514,203,313,820,202,931,640,957,596,464,756,010,405,845,841,566,072,044,962,867,016,515,061,920,631,004,186,422,275,908,670,900,574,606,417,856,951,911,456,055,068,251,250,406,007,519,842,261,898,059,237,118,054,444,788,072,906,395,242,548,339,221,982,707,404,473,162,376,760,846,613,033,778,706,039,803,413,197,133,493,654,622,700,563,169,937,455,508,241,780,972,810,983,291,314,403,571,877,524,768,509,857,276,937,926,433,221,599,399,876,886,660,808,368,837,838,027,643,282,775,172,273,657,572,744,784,112,294,389,733,810,861,607,423,253,291,974,813,120,197,604,178,281,965,697,475,898,164,531,258,434,135,959,862,784,130,128,185,406,283,476,649,088,690,521,047,580,882,615,823,961,985,770,122,407,044,330,583,075,869,039,319,604,603,404,973,156,583,208,672,105,913,300,903,752,823,415,539,745,394,397,715,257,455,290,510,212,310,947,321,610,753,474,825,740,775,273,986,348,298,498,340,756,937,955,646,638,621,874,569,499,279,016,572,103,701,364,433,135,817,214,311,791,398,222,983,845,847,334,440,270,964,182,851,005,072,927,748,364,550,578,634,501,100,852,987,812,389,473,928,699,540,834,346,158,807,043,959,118,985,815,145,779,177,143,619,698,728,131,459,483,783,202,081,474,982,171,858,011,389,071,228,250,905,826,817,436,220,577,475,921,417,653,715,687,725,614,904,582,904,992,461,028,630,081,535,583,308,130,101,987,675,856,234,343,538,955,409,175,623,400,844,887,526,162,643,568,648,833,519,463,720,377,293,240,094,456,246,923,254,350,400,678,027,273,837,755,376,406,726,898,636,241,037,491,410,966,718,557,050,759,098,100,246,789,880,178,271,925,953,381,282,421,954,028,302,759,408,448,955,014,676,668,389,697,996,886,241,636,313,376,393,903,373,455,801,407,636,741,877,711,055,384,225,739,499,110,186,468,219,696,581,651,485,130,494,222,369,947,714,763,069,155,468,217,682,876,200,362,777,257,723,781,365,331,611,196,811,280,792,669,481,887,201,298,643,660,768,551,639,860,534,602,297,871,557,517,947,385,246,369,446,923,087,894,265,948,217,008,051,120,322,365,496,288,169,035,739,121,368,338,393,591,756,418,733,850,510,970,271,613,915,439,590,991,598,154,654,417,336,311,656,936,031,122,249,937,969,999,226,781,732,358,023,111,862,644,575,299,135,758,175,008,199,839,236,284,615,249,881,088,960,232,244,362,173,771,618,086,357,015,468,484,058,622,329,792,853,875,623,486,556,440,536,962,622,018,963,571,028,812,361,567,512,543,338,303,270,029,097,668,650,568,557,157,505,516,727,518,899,194,129,711,337,690,149,916,181,315,171,544,007,728,650,573,189,557,450,920,330,185,304,847,113,818,315,407,324,053,319,038,462,084,036,421,763,703,911,550,639,789,000,742,853,672,196,280,903,477,974,533,320,468,368,795,868,580,237,952,218,629,120,080,742,819,551,317,948,157,624,448,298,518,461,509,704,888,027,274,721,574,688,131,594,750,409,732,115,080,498,190,455,803,416,826,949,787,141,316,063,210,686,391,511,681,774,304,792,596,709,375
  • minimize

\[ MWhy(Q,D,o_1) = \{ \{t_1\}, \{t_2\}, \{t_3\} \} \]

R

A B TID
a 1 r1
a 2 r2
b 1 r3

S

C D TID
1 d s1
2 d s2
3 d s3

\[ \pi_{A,D}(R \bowtie_{B=C} S) \]

A D TID
a d o1
b d o2

\[ MWhy(Q,o_1) = \{ \{r_1,s_1\}, \{r_2,s_2\} \} \]

\[ MWhy(Q,o_2) = \{ \{r_3, s_1\} \} \]

\[ D' \subset D: Q(D') = Q(D) = o_1 \in Q(D') \land o_2 \in Q(D') \]

\[ MWhy(Q) = \{ \{r_1,r_3,s_1\}, \{r_2,s_2,r_3,s_1\} \} \]

Class <2023-11-28 Tue>

\[ \pi_{A}(R) \]

R

A B \(\mathbb{N}[X]\)
1 1 x1
1 2 x2
1 3 x3
2 1 x4
A \(\mathbb{N}[X]\)
1 x1 + x2 + x3
2 x4
  • \(\mathbb{B}\) - set semantics
A B \(\mathbb{N}[X]\)
1 1 x1=true
1 2 x2=false
1 3 x3=true
2 1 x4=false
A B
1 1
1 3
A \(\mathbb{B}\)
1 \(true \lor false \lor true = true\)
2 \(false\)
  • \(\mathbb{N}\) - bag semantics
A B
1 1
1 1
1 2
1 3
1 3
1 3
2 1
2 1
A B \(\mathbb{N}[X]\)
1 1 x1=2
1 2 x2=1
1 3 x3=3
2 1 x4=2
A \(\mathbb{N}[X]\)
1 2 + 1 + 3 = 6
2 2

R

A \(\mathbb{N}[X]\)
1 r1
2 r2
3 r3

S

C \(\mathbb{N}[X]\)
1 s1
2 s2
4 s3
  • \[R \cup S\]
A \(\mathbb{N}[X]\)
1 r1 + s1
2 r2 + s2
3 r3
4 s3
  • \[R \bowtie_{A=C} S\]
A C \(\mathbb{N}[X]\)
1 1 r1 * s1
2 2 r2 * s2

CAP Theorem

  • Consistency
  • Availability
  • Partition tolerance

S1

item counter
coke 2
beer 2

S2

item counter
coke 3
beer 1

S1 merge S2

item counter
coke 3
beer 2
Last updated on 14 Aug 2023
Published on 14 Aug 2023