CS525 - Advanced Database Organization - 2023 Spring

Course webpage for CS525 - 2023 Spring taught by Boris Glavic

Lecture Notes for CS525

avg

init

1
2
3

count = 0;
sum = 0;

update(t)

1
2
count ++
sum += t.A

close

1
return sum / count
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
class SumAgg():

    def __init__(self):
        self.sum = 0
        self.count = 0

    def update(self,a):
        self.sum += a
        self.count += 1

    def close(self):
        return self.sum / self.count


agg = SumAgg()

for i in range(1,100):
    agg.update(i)

print(agg.close())
50.0

preaggregation

A B
1 1
5 2
2 1
6 2
4 1
7 2
count
6
1
SELECT count(*) FROM r;

Thread 1

A B
1 1
5 2
2 1
count
3

Thread 2

A B
6 2
4 1
7 2
count
3

Thread 3

final count
  • Preaggregation for …

    • count is sum of partial count
    • sum is sum of partial sum
    • min is min of partial min

Infinitely many plans

1
2
3
FROM R
FROM R NATURAL NATURAL JOIN (SELECT DISTINCT * FROM R)
FROM R NATURAL NATURAL JOIN (SELECT DISTINCT * FROM R) NATURAL JOIN (SELECT DISTINCT * FROM R)
1
WHERE R.A = S.B AND f(R.C,S.D)

Transactions & Schedules

1
UDPATE student SET gpa = gpa * 1.05 WHERE major = 'CS';
tid name major gpa
t1 X CS 3.5
t2 Y CS 1.5
t3 Z CS 2.3
t4 A BIO 1.0
t5 B BIO 4.0

Exam questions

  • how to solve schedules?
  • external sorting and hash join I/O + other operators

Non exam questions

  • postgres plans

I/O estimations

sorting

  • for sorting we need number of blocks of R B(R) and the number of blocks of memory M
  • B(R) is less then M

    • read R then write R and use any in memory sorting algo. $2 \cdot B(R)$
  • run generation: read chunks of size $M$ into memory and sort them, then write them back

    • number of I/O $2 \cdot B(R)$
    • $\lceil \frac{B(R)}{M} \rceil$ generated runs
  • merging merge $M-1$ runs at a time: reduced number of runs by a factor of $M-1$ in each merge iteration

    • $i = \lceil{\log_{M-1}(\lceil \frac{B(R)}{M} \rceil)\rceil$ merge phases
    • $2 \cdot B(R)$
  • total I/O

    • $(1 + i) \cdot 2 \cdot B(R)$

joins

  • input: B(R), B(S), M

block nested loop

hash join

  • if $B(R) < M$ or $B(S) < M$ load smaller table into memory, build hashtable, scan through the other

    • WLOG R is smaller $B(R) + B(S)$
  • partition both tables into $M-1$ fragments -> the fragments are of size $\lceil \frac{B(R)}{M-1} \rceil$ and $\lceil \frac{B(S)}{M-1} \rceil$

    • how much I/O $2 \cdot (B(R) + B(S))$
  • number of partition phases $i = \lceil \log_{M-1}(min(B(R),B(S))\rceil - 1$

$$(2 \cdot i + 1) \cdot (B(R) + B(S))$$

schedules

recoverability

  • T1 reads from T2 if T1 reads a data item A and T2 was the last one to write the data item A before the read
  • a schedule is recoverable if for every transaction Ti that reads from a transaction Tj the commit of Tj is before the commit of Ti
1
2
3
4
5
1: r2(A), -- 4
2: w1(A), -- 10
3: w2(A), -- 5
4: c2,
5: c1
Time A
0 4
1 4
2 10
3 5
4 5
5 5
  • not recoverable
1
2
3
4
5
1: w1(A), -- 10
2: r2(A), -- 10
3: w2(A), -- 11
4: c2,
5: a1
Time A after T1 abort
0 4 4
1 10 4
2 10 4
3 11 5
4 11 5
5 11 5
  • recoverable
1
2
3
4
5
1: w1(A), -- 10
2: r2(A), -- 10
3: w2(A), -- 11
4: c1,
5: c2

cascadeless

  • T1 reads from T2 if T1 reads a data item A and T2 was the last one to write the data item A before the read
  • a schedule is cascadeless if for every transaction Ti that reads from a transaction Tj the commit of Tj is before the read of Ti

    • we only read committed data
  • check for this wi(A) …. ci …. rj(A)
1
2
3
4
5
1: w1(A), -- 10
2: r2(A), -- 10
3: w2(A), -- 11
4: a1, -- have to abort T2 now
5: a2
1
w1(A), w2(A), w3(A)
1
r1(A), r2(A), r3(A)

strictness

  • a schedule is strict if every transaction is only reading and writing committed data
  • w1(A) …. c1 …. w2(A)

conflict-serializable

  • construct conflict graph

    • conflicting operations

      • same data item
      • different transactions
      • at least one of them is a read
      • pi(X) … qj(X) means there is an edge from i to j

examples

1
r2(A),r2(E),w1(C),r1(E),r2(C),c2,w3(A),w3(C),c3,r1(D),c1

s #+LABEL: fig:cg.jpg

cg.jpg

cg.jpg.jpg

1
r2(E),r1(A),r2(C),r1(B),r1(C),r2(A),w2(C),r2(B),c2,r1(C),w1(E),w1(A),c1

non2pl.jpg

2PL

examples

  • a schedule (here we need locks early
1
w1(A), w2(A), w1(B)
1
l1(A), w1(A), u1(A) -- T1 now in shrinking phase, l2(A), w2(A), l1(B) -- not possible T1 in shrinking phase, w1(B)
1
l1(A), l1(B), w1(A), u1(A) -- T1 now in shrinking phase, l2(A), w2(A), w1(B), u1(B) ...
  • a schedule (here we need locks late)
1
w1(C), w2(A), w1(A)
1
l1(A), l1(C), w1(C), u1(A), l2(A), w2(A),u2(A), l1(A) -- not allowed already in shrinking phase, w1(A)
1
l1(C), w1(C), l2(A), w2(A),u2(A), l1(A), w1(A)

new stuff

  1. postgres plans (externally and internally) - 1
  2. modern index structures (zone maps, Bw-trees, skip lists, LSM) - 4
  3. distributed query / transaction processing - 2
  4. in-memory databases - 6
  5. sorting and joins in postgres (when to use what) - 3
Last updated on 2 Jan 2023
Published on 2 Jan 2023