# CS525 - Advanced Database Organization - 2023 Spring

Course webpage for CS525 - 2023 Spring taught by Boris Glavic

# Lecture Notes for CS525

## avg

### init

 1 2 3   count = 0; sum = 0;

### update(t)

 1 2  count ++ sum += t.A

### close

 1  return sum / count
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  class SumAgg(): def __init__(self): self.sum = 0 self.count = 0 def update(self,a): self.sum += a self.count += 1 def close(self): return self.sum / self.count agg = SumAgg() for i in range(1,100): agg.update(i) print(agg.close())
50.0


## preaggregation

A B
1 1
5 2
2 1
6 2
4 1
7 2
count
6
 1  SELECT count(*) FROM r;

A B
1 1
5 2
2 1
count
3

A B
6 2
4 1
7 2
count
3

final count
• Preaggregation for …

• count is sum of partial count
• sum is sum of partial sum
• min is min of partial min

## Infinitely many plans

 1 2 3  FROM R FROM R NATURAL NATURAL JOIN (SELECT DISTINCT * FROM R) FROM R NATURAL NATURAL JOIN (SELECT DISTINCT * FROM R) NATURAL JOIN (SELECT DISTINCT * FROM R)
 1  WHERE R.A = S.B AND f(R.C,S.D)

## Transactions & Schedules

 1  UDPATE student SET gpa = gpa * 1.05 WHERE major = 'CS';
tid name major gpa
t1 X CS 3.5
t2 Y CS 1.5
t3 Z CS 2.3
t4 A BIO 1.0
t5 B BIO 4.0

## Exam questions

• how to solve schedules?
• external sorting and hash join I/O + other operators

## Non exam questions

• postgres plans

## I/O estimations

### sorting

• for sorting we need number of blocks of R B(R) and the number of blocks of memory M
• B(R) is less then M

• read R then write R and use any in memory sorting algo. $2 \cdot B(R)$
• run generation: read chunks of size $M$ into memory and sort them, then write them back

• number of I/O $2 \cdot B(R)$
• $\lceil \frac{B(R)}{M} \rceil$ generated runs
• merging merge $M-1$ runs at a time: reduced number of runs by a factor of $M-1$ in each merge iteration

• $i = \lceil{\log_{M-1}(\lceil \frac{B(R)}{M} \rceil)\rceil$ merge phases
• $2 \cdot B(R)$
• total I/O

• $(1 + i) \cdot 2 \cdot B(R)$

### joins

• input: B(R), B(S), M

#### hash join

• if $B(R) < M$ or $B(S) < M$ load smaller table into memory, build hashtable, scan through the other

• WLOG R is smaller $B(R) + B(S)$
• partition both tables into $M-1$ fragments -> the fragments are of size $\lceil \frac{B(R)}{M-1} \rceil$ and $\lceil \frac{B(S)}{M-1} \rceil$

• how much I/O $2 \cdot (B(R) + B(S))$
• number of partition phases $i = \lceil \log_{M-1}(min(B(R),B(S))\rceil - 1$

$$(2 \cdot i + 1) \cdot (B(R) + B(S))$$

## schedules

### recoverability

• T1 reads from T2 if T1 reads a data item A and T2 was the last one to write the data item A before the read
• a schedule is recoverable if for every transaction Ti that reads from a transaction Tj the commit of Tj is before the commit of Ti
 1 2 3 4 5  1: r2(A), -- 4 2: w1(A), -- 10 3: w2(A), -- 5 4: c2, 5: c1
Time A
0 4
1 4
2 10
3 5
4 5
5 5
• not recoverable
 1 2 3 4 5  1: w1(A), -- 10 2: r2(A), -- 10 3: w2(A), -- 11 4: c2, 5: a1
Time A after T1 abort
0 4 4
1 10 4
2 10 4
3 11 5
4 11 5
5 11 5
• recoverable
 1 2 3 4 5  1: w1(A), -- 10 2: r2(A), -- 10 3: w2(A), -- 11 4: c1, 5: c2

• T1 reads from T2 if T1 reads a data item A and T2 was the last one to write the data item A before the read
• a schedule is cascadeless if for every transaction Ti that reads from a transaction Tj the commit of Tj is before the read of Ti

• we only read committed data
• check for this wi(A) …. ci …. rj(A)
 1 2 3 4 5  1: w1(A), -- 10 2: r2(A), -- 10 3: w2(A), -- 11 4: a1, -- have to abort T2 now 5: a2
 1  w1(A), w2(A), w3(A)
 1  r1(A), r2(A), r3(A)

### strictness

• a schedule is strict if every transaction is only reading and writing committed data
• w1(A) …. c1 …. w2(A)

### conflict-serializable

• construct conflict graph

• conflicting operations

• same data item
• different transactions
• at least one of them is a read
• pi(X) … qj(X) means there is an edge from i to j

#### examples

 1  r2(A),r2(E),w1(C),r1(E),r2(C),c2,w3(A),w3(C),c3,r1(D),c1

 1  r2(E),r1(A),r2(C),r1(B),r1(C),r2(A),w2(C),r2(B),c2,r1(C),w1(E),w1(A),c1

### 2PL

#### examples

• a schedule (here we need locks early
 1  w1(A), w2(A), w1(B)
 1  l1(A), w1(A), u1(A) -- T1 now in shrinking phase, l2(A), w2(A), l1(B) -- not possible T1 in shrinking phase, w1(B)
 1  l1(A), l1(B), w1(A), u1(A) -- T1 now in shrinking phase, l2(A), w2(A), w1(B), u1(B) ...
• a schedule (here we need locks late)
 1  w1(C), w2(A), w1(A)
 1  l1(A), l1(C), w1(C), u1(A), l2(A), w2(A),u2(A), l1(A) -- not allowed already in shrinking phase, w1(A)
 1  l1(C), w1(C), l2(A), w2(A),u2(A), l1(A), w1(A)

## new stuff

1. postgres plans (externally and internally) - 1
2. modern index structures (zone maps, Bw-trees, skip lists, LSM) - 4
3. distributed query / transaction processing - 2
4. in-memory databases - 6
5. sorting and joins in postgres (when to use what) - 3
Last updated on 2 Jan 2023
Published on 2 Jan 2023