CS520 - Data Integration, Warehousing, and Provenance - 2022 Spring

Course Webpage for CS520 - 2022 Spring taught by Boris Glavic

Notes taken during Lecture

Below you find the notes taken in class. I am also using hand written notes that you can find here: https://www.evernote.com/l/AIOM1BQDeRBPzJzmVaJsngnE4-8CYA48p7g

<2022-01-12 Wed>

design a person and address schema

SQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- option 1
CREATE TABLE person (
  fname,
  lname,
  adress1,
  adress2,
  city,
  state,
  zip
  );

-- option 2
CREATE TABLE person (
  name,
  ssn,
  PRIMARY KEY(ssn),
  );

CREATE TABLE address (
  city,
  zip,
  street,
  PRIMARY KEY(city,zip,street),
);

CREATE TABLE livesat (
  ssn,
  city,
  zip,
  street
  );
-- option 3
CREATE TABLE person (
  fname,
  lname,
  zip
  );

Person

Name SSN zip city
Peter 111 60616 Chicago
Peter 111 60657 Chicago
Alice 222 60615 Chicago
Bob 333 60657 Chicago

JSON

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
[
{
    "Person":
    {
        "name": "Peter",
        "SSN": 111
        "Zip": 60614
    }
},
{
    "Person":
    {
        "name": "Peter",
        "SSN": 111
        "Zip": 60657
    }
}
]

// option 2
{
    "Name": "Peter",
	"SSN" : "111"
}

// option 3
[
    {
    "name" : "Peter" ,
    "SSN" : "111",
    "ZIP" : ["60616","60657"],
    "City" : "Chicago"
}

    {
    "name" : "Peter" ,
    "SSN" : "111",
    "ZIP" : ["60616","60657", "60645"],
    "City" : ["Chicago", "Chicago", "New York"]
    }


]

// option 4
{
    "name": "Peter",
    "ssn": "111",
    "addresses": [
        {"zip": "60616", "city": "Chicago"},
        {"zip": "60657", "city": "Chicago"}
    ]
}

// option 4
{
    "persons": [
        {
        "name": "Peter",
        "ssn": "111"
    }],
    "addresses": [
        {"zip": "60616", "city": "Chicago", "ssn": "111"},
        {"zip": "60657", "city": "Chicago", "ssn": "111"}
    ]
}

FO logic

signature

  • animal(X) and plan(X) and bird(X) and canfly(X) and canswim(X)

model

  • $\mathbb{D} = \{ penguin, swallow, cow, human, tree \}$
  • $bird =\{ (swallow), (penguin) \}$
  • $animal = \{ (cow), (swallow), (penguin), (human) \}$
  • $plant = \{ (tree) \}$
  • $canfly = \{ (swallow) \}$
  • $canswin = \{ (penguin) \}$
  • $equals = \{ (swallow, swallow), (penguin, penguin), \ldots$

formulas

Can Swallows fly?

$$canfly(swallow)$$ $$\forall s: swallow(s) \rightarrow canfly(s)$$ $$\forall x: equals(x,swallow) \rightarrow canfly(x)$$

All birds can fly

$$\forall x: canfly(x)$$ - this does not work, it requires that everything can fly not just birds

$$\forall x: bird(x) \rightarrow canfly(x)$$

No plant can fly

$$\forall x: plant(x) \rightarrow \neg canfly(x)$$

<2022-01-19 Wed>

<2022-01-26 Wed>

1
2
3
SELECT A
FROM R,S
WHERE A = B;
1
2
SELECT A
FROM R JOIN S ON (A=B);

a -> b -> c

a - b - c

1
2
Q(X) :- R(X,Y).
Q'(A) :- R(A,SOMETHING).

Variable mapping Q to Q': $\Psi_1$ = { X -> SOMETHING, Y -> A }

  • mapping head to head: { X -> A, Y -> A }
  • body contained in body: $\Psi_1(R(X,Y))$ = R(A,A)
  • mapping head to head: { X -> A, Y -> SOMETHING }
  • body cantained in body: $\Psi_2(R(X,Y))$ = R(A,SOMETHING) Q1() :- R(a,b), R(c,b). Q2() :- R(x,y).
  • Containment mapping: x-> a, y -> b

    • database D { R(1,1) }
  • a -> 1, b -> 1, c -> 1 x -> 1, y -> 1

    Q1() :- R(1,1), R(1,1).

    Q1() :- R(a,b), R(c,b).

    Q2() :- R(x,y).

<2022-01-31 Mon>

<2022-02-02 Wed>

Does DB fulfill constraint

R: A -> B

1
2
Q() :- R(X,Y), R(Z,A), X=Z, Y != A.
QFULL() :- not Q().
  • this does not work:
1
Q() :- R(X,Y), R(Z,A), X=Z, Y=A.
extend the example
1
Q(X,Y,Z,A,B,C) :- R(X,Y,Z), R(A,B,C), X=A, Y<B.

<2022-02-07 Mon>

<2022-02-09 Wed>

<2022-02-14 Mon>

1
2
3
4
5
6
7
SELECT p.Name AS Name,
       a.City AS Address,
       a.Office-contact AS Office-phone
--         ? AS Office-address,
--         ? AS Home-phone
FROM Person p, Address a
WHERE p.Adress = a.Id
1
2
3
4
5
6
CREATE VIEW personTarget AS (
SELECT p.Name AS Name,
       a.City AS Address,
       a.Office-contact AS Office-phone
FROM Person p, Address a
WHERE p.Address = a.Id);
1
SELECT Name FROM personTarget;
1
2
3
4
5
SELECT Name FROM (SELECT p.Name AS Name,
       a.City AS Address,
       a.Office-contact AS Office-phone
FROM Person p, Address a
WHERE p.Address = a.Id) AS persontTarget;

<2022-02-16 Wed>

<2022-02-23 Wed>

<2022-02-28 Mon>

1
2
3
4
5
6
WITH numst AS (SELECT count(*) as nums, dept FROM student GROUP BY dept),
     topd AS (SELECT nums, dept
                FROM numst
               ORDER BY nums DESC
               LIMIT 1)
SELECT * FROM topd, topd;
1
2
3
4
5
6
7
8
9
WITH numst AS (SELECT count(*) as nums, dept FROM student GROUP BY dept),
SELECT * FROM (SELECT nums, dept
                FROM numst
               ORDER BY nums DESC
               LIMIT 1),
              (SELECT nums, dept
                FROM numst
               ORDER BY nums DESC
               LIMIT 1);

<2022-03-02 Wed>

<2022-03-21 Mon>

Question 1.1

1
2
Q(Lastname,GPA) :- student(_,_,Lastname,Major,GPA), Major = cs.
Q(Lastname,GPA) :- student(_,_,Lastname,cs,GPA).

Question 1.2

1
2
Q(F,L) :- student(S,F,L,_,_), interest(S,hacking).
Q(F,L) :- student(S,F,L,_,_), interest(S,surfing).

Question 1.3

1
2
QnotHack(S) :- interest(S,A), A <> hacking.
Q(S) :- interest(S,hacking), not QnotHack(S).

Question 1.4

1
2
Qso(S,L,O) :- student(S,_,L,_,_), interest(S,A), event(_,O,A,_).
Q(S1,L1,S2,L2) :- Qso(S1,L1,O), Qso(S2,L2,O), S1 < S2.

Question 1.5

  • student 1 interest is $I_1$ and student 2 interest is $I_2$, we want $I_1 = I_2$,

    • which is $(I_1 - I_2) \cup (I_2 - I_1) = \emptyset$.
    • $\forall e: e \in I_1 \Leftrightarrow e \in I_2$

set in DL

  • table $R(A,B)$ and table $S(c,d)$ and we want $R - S$
1
Q(X,Y) :- R(X,Y), not S(X,Y).

set difference approach

1
2
3
4
5
Qsid(S) :- student(S,_,_,_,_).
QDiff(S1,S2) :- interest(S1,A1), Qsid(S2), not interest(S2,A1).
Qsymdiff(S1,S2) :- QDiff(S1,S2).
Qsymdiff(S1,S2) :- QDiff(S2,S1).
Q(S1,S2) :- Qsid(S1), Qsid(S2), not Qsymdiff(S1,S2), S1 < S2.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
WITH sids AS
    (SELECT sid FROM student),
  qdiff (SELECT i.sid AS sid1, s.sid AS sid2
         FROM interest i, sids s
         WHERE NOT EXISTS (SELECT * FROM interest i2
                            WHERE i2.activity = i.activity),
  qsymdiff AS (SELECT * FROM qdiff UNION SELECT sid2, sid1 FROM qdiff)
SELECT s1.sid AS sid1, s2.sid AS sid2
FROM sids s1, sids s2
WHERE NOT EXISTS (SELECT * FROM qsymdiff q WHERE s1.sid = q.sid1 AND s2.sid = q.sid2) AND s1.sid < s2.sid;
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
WITH sids AS
    (SELECT sid FROM student),
  qdiff (SELECT i.sid AS sid1, s.sid AS sid2
         FROM interest i, sids s
         WHERE NOT EXISTS (SELECT * FROM interest i2
                            WHERE i2.activity = i.activity)),
  qsymdiff AS (SELECT * FROM qdiff UNION SELECT sid2, sid1 FROM qdiff)
(SELECT s1.sid AS sid1, s2.sid AS sid2
FROM sids s1, sids s2
WHERE s1.sid < s2.sid)
EXCEPT
(SELECT * FROM qsymdiff q));
1
2
3
4
Q1(I1,I2,A) :- interest(I1,A), interest(I2,A).
Qextra(I1,I2) :- Q1(I1,I2,A), Q1(I1,I2,A2), interest(I1,A3), A <> A3, A <> A2.
Qextra(I1,I2) :- Q1(I1,I2,A), Q1(I1,I2,A2), interest(I2,A3), A2 <> A3, A <> A2.
Q(I1,I2) :- Q1(I1,I2,A), not Qextra(I1,I2).
students activity
1 c1
1 c2
1 c3
2 c1
2 c2
2 c3

universal quantification

  • $\forall x: \phi(x) \Leftrightarrow \neg \exists x \neg \phi(x)$

<2022-03-28 Mon>

<2022-04-06 Wed>

  • time: (TID, year, month)
  • location: (LID, state, city)
  • fact: (TID, LID, num_call)

breakdown per time

  • total calls
  • calls per year
  • calls per month
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
-- total
SELECT sum(num_calls) AS total_calls FROM fact
-- total per year
SELECT sum(num_calls) AS total_calls, year
FROM time t, fact f
WHERE t.TID = f.TID
GROUP BY year
-- total per month
SELECT sum(num_calls) AS total_calls, year, month
FROM time t, fact f
WHERE t.TID = f.TID
GROUP BY year, month
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
WITH total_calls AS(
SELECT sum(num_calls) AS total_calls, NULL AS year, NULL AS month, 0 AS month_present, 0 AS year_present
  FROM fact),
year_calls AS (
SELECT sum(num_calls) AS total_calls, year, NULL AS month, 0 AS month_present, 1 AS year_present
FROM time t, fact f
WHERE t.TID = f.TID
GROUP BY year),
month_calls AS (
SELECT sum(num_calls) AS total_calls, year, month, 1 AS month_present, 1 AS year_present
FROM time t, fact f
WHERE t.TID = f.TID
GROUP BY year, month)
SELECT * FROM total_calls
UNION ALL
SELECT * FROM year_calls
UNION ALL
SELECT * FROM month_calls
  • reuse computation
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
WITH
month_calls AS (
SELECT sum(num_calls) AS total_calls, year, month, 1 AS month_present, 1 AS year_present
FROM time t, fact f
WHERE t.TID = f.TID
GROUP BY year, month),

year_calls AS (
SELECT sum(total_calls) AS total_calls | year | NULL AS month | 0 AS month_present | 1 AS year_present |
FROM month_calls
GROUP BY year),

total_calls AS(
SELECT sum(total_calls) AS total_calls, NULL AS year, NULL AS month, 0 AS month_present, 0 AS year_present
  FROM year_calls)

SELECT * FROM total_calls
UNION ALL
SELECT * FROM year_calls
UNION ALL
SELECT * FROM month_calls
  • using special grouping constructs
1
2
3
4
SELECT sum(num_calls) AS total_calls, year, month, GROUPING(year) AS grp_year, GROUPING(month) AS grp_month
FROM time t, fact f
WHERE t.TID = f.TID
GROUP BY ROLLUP(year, month)

cumulative sums

  • for 2016, give me per month the cummulative calls per month
1
2
3
4
5
6
SELECT month, sum(total_calls) OVER (ORDER BY month) as cumul_calls
FROM
  (SELECT sum(num_calls) AS total_calls, month
     FROM fact f, time t
    WHERE f.TID = t.tid AND t.year = 2016
    GROUP BY month) monthly
  • write the query without window functions
1
2
3
4
5
6
7
8
9
WITH monthly AS (SELECT ROWNUM AS rid, sum(num_calls) AS total_calls, month
     FROM fact f, time t
    WHERE f.TID = t.tid AND t.year = 2016
    GROUP BY month
    ORDER BY month)
SELECT f.month, sum(s.total_calls)
FROM monthly f, monthly s
WHERE f.rid >= s.rid
GROUP BY f.month
  • for every year, give me per month the cummulative calls within that year
1
2
3
4
5
SELECT year, month, sum(total_calls) OVER (PARTITION BY year ORDER BY month) AS cumul_calls
FROM
  (SELECT sum(num_calls) AS total_calls, year, month
     FROM fact f, time t
    GROUP BY year, month) monthly
  • name of the student with the highest GPA: student(name,GPA,major)

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    
    SELECT name
    FROM student
    ORDER BY gpa DESC
    LIMIT 1; -- if multiple students have the highest gpa, the only one is retured
    
    WITH maxgpa AS (SELECT gap AS mgpa FROM student ORDER BY gpa DESC LIMIT 1),
    SELECT name
    FROM student, maxgpa
    WHERE GPA = mgpa;
    
    WITH maxgpa AS (SELECT max(GPA) AS mgpa FROM student),
    SELECT name
    FROM student, maxgpa
    WHERE GPA = mgpa;
    
    SELECT name FROM (
    (SELECT dense_rank() OVER (ORDER BY gpa DESC) AS rank, name FROM student)
    WHERE rank = 1;
    
    SELECT name
    FROM (SELECT name, max(gpa) OVER () AS mgpa, gpa
          FROM student)
    WHERE mgpa = gpa

<2022-04-13 Wed>

<2022-04-18 Mon>

  • failure during 1 hour: p = 0.001
  • probability that at least one of our 1000 machines fails during 1 hour:

    • 1 - probability that none of them fails: (1-p)^1000 =~ 0.36
    • 1 - p = 0.999
  • in general the probability to get exactly $m$ failures out of $n$ nodes: $$\binom{n}{m} \cdot p^m \cdot (1-p)^{n-m}$$
  • to get at least $m$ failures out of $n$ nodes $$\sum_{i=m}^{n} \binom{n}{i} \cdot p^i \cdot (1-p)^{n-i}$$
Last updated on 29 Dec 2021
Published on 29 Dec 2021