{"cells":[{"cell_type":"markdown","metadata":{},"source":"Hashtables\n==========\n\n"},{"cell_type":"markdown","metadata":{},"source":["## Agenda\n\n"]},{"cell_type":"markdown","metadata":{},"source":["- Discussion: pros/cons of array-backed and linked structures\n- Comparison to `set` and `dict`\n- The *map* ADT\n- Direct lookups via *Hashing*\n- Hashtables\n\n- Collisions and the \"Birthday problem\"\n\n- Runtime analysis & Discussion\n\n"]},{"cell_type":"markdown","metadata":{},"source":["## Discussion: pros/cons of array-backed and linked structures\n\n"]},{"cell_type":"markdown","metadata":{},"source":["Between the array-backed and linked list we have:\n\n1. $O(1)$ indexing (array-backed)\n2. $O(1)$ appending (array-backed & linked)\n3. $O(1)$ insertion/deletion without indexing (linked)\n4. $O(N)$ linear search (unsorted)\n5. $O(\\log N)$ binary search, when sorted (only array-backed lists)\n\n"]},{"cell_type":"markdown","metadata":{},"source":["## Comparison to `set` and `dict`\n\n"]},{"cell_type":"markdown","metadata":{},"source":["The `set` and `dict` types don't support positional access (i.e., by\nindex), but do support lookup/search. How fast do they fare compared to\nlists?\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["import timeit\n\ndef lin_search(lst, x):\n return x in lst\n\ndef bin_search(lst, x):\n # assumes lst is sorted\n low = 0\n hi = len(lst)-1\n while low <= hi:\n mid = (low + hi) // 2\n if x < lst[mid]:\n hi = mid - 1\n elif x < lst[mid]:\n low = mid + 1\n else:\n return True\n else:\n return False\n\ndef set_search(st, x):\n return x in st\n\n\ndef dict_search(dct, x):\n return x in dct"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["%matplotlib inline\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport random\n\nns = np.linspace(100, 10_000, 50, dtype=int)\n\nts_linsearch = [timeit.timeit('lin_search(lst, lst[-1])',\n setup='lst = list(range({})); random.shuffle(lst)'.format(n),\n globals=globals(),\n number=100)\n for n in ns]\n\nts_binsearch = [timeit.timeit('bin_search(lst, 0)',\n setup='lst = list(range({}))'.format(n),\n globals=globals(),\n number=100)\n for n in ns]\n\n\nts_setsearch = [timeit.timeit(#'set_search(st, 0)',\n 'set_search(st, {})'.format(random.randrange(n)),\n setup='lst = list(range({})); random.shuffle(lst);'\n 'st = set(lst)'.format(n),\n globals=globals(),\n number=100)\n for n in ns]\n\nts_dctsearch = [timeit.timeit(#'dict_search(dct, 0)',\n 'dict_search(dct, {})'.format(random.randrange(n)),\n setup='lst = list(range({})); random.shuffle(lst);'\n 'dct = {{x:x for x in lst}}'.format(n),\n globals=globals(),\n number=100)\n for n in ns]"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["plt.plot(ns, ts_linsearch, 'or')\nplt.plot(ns, ts_binsearch, 'sg')\nplt.plot(ns, ts_setsearch, 'db')\nplt.plot(ns, ts_dctsearch, 'om');"]},{"cell_type":"markdown","metadata":{},"source":["![img](c330ab22e12845d0448e5bd0545018bfe8504f19.png)\n\nIt looks like sets and dictionaries support lookup in constant time!\nHow?!\n\n"]},{"cell_type":"markdown","metadata":{},"source":["## The `map` ADT\n\n"]},{"cell_type":"markdown","metadata":{},"source":["We will focus next on the \"*map*\" abstract data type (aka \"associative\narray\" or \"dictionary\"), which is used to associate keys (which must be\nunique) with values. Python's `dict` type is an implementation of the\nmap ADT.\n\nGiven an implementation of a map, it is trivial to implement a *set* on\ntop of it (how?).\n\nHere's a simple map API:\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["class MapDS:\n def __init__(self):\n self.data = {}\n\n def __setitem__(self, key, value):\n self.data[key] = value\n\n\n\n def __getitem__(self, key):\n return self.data[key]\n\n def __contains__(self, key):\n return self.data.contains(key)"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["m = MapDS()\nm['batman'] = 'bruce wayne'\nm['superman'] = 'clark kent'\nm['spiderman'] = 'peter parker'"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["m['batman']"]},{"cell_type":"markdown","metadata":{},"source":[" 'bruce wayne'\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["m['batman'] = 'tony stark'"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["m['batman']"]},{"cell_type":"markdown","metadata":{},"source":[" 'tony stark'\n\nHow do we make the leap from linear runtime complexity to constant?!\n\n"]},{"cell_type":"markdown","metadata":{},"source":["## Direct lookups via *Hashing*\n\n"]},{"cell_type":"markdown","metadata":{},"source":["Hashes (a.k.a. hash codes or hash values) are simply numerical values\ncomputed for objects.\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["hash('hello')"]},{"cell_type":"markdown","metadata":{},"source":[" -954384285558724197\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["hash('batman')"]},{"cell_type":"markdown","metadata":{},"source":[" 5465486877337622348\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["hash('batmen')"]},{"cell_type":"markdown","metadata":{},"source":[" 8014717029909393586\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["[hash(s) for s in ['different', 'objects', 'have', 'very', 'different', 'hashes']]"]},{"cell_type":"markdown","metadata":{},"source":[" [2429629120202328647,\n 8372779892654583019,\n -8906997482930836953,\n 853381216711768263,\n 2429629120202328647,\n 7225739097362972930]\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["[hash(s)%100 for s in ['different', 'objects', 'have', 'very', 'different', 'hashes']]"]},{"cell_type":"markdown","metadata":{},"source":[" [47, 19, 47, 63, 47, 30]\n\n"]},{"cell_type":"markdown","metadata":{},"source":["## Hashtables\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["class Hashtable:\n def __init__(self, n_buckets):\n self.buckets = [[]] * n_buckets\n\n def __setitem__(self, key, val):\n h = hash(key)\n bucket = self.buckets[h % len(self.buckets)]\n for k in bucket:\n if(k[0] == key):\n k[1] = val\n bucket.append([key,val])\n\n def __getitem__(self, key):\n h = hash(key)\n for k in self.buckets[h % len(self.buckets)]:\n if(k[0] == key):\n return k[1]\n raise Exception(f\"key {key} not in hashtable\")\n\n def __contains__(self, key):\n try:\n _ = self[key]\n\n return True\n except:\n return False"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["ht = Hashtable(100)\nht['spiderman'] = 'peter parker'\nht['batman'] = 'bruce wayne'\nht['superman'] = 'clark kent'"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["ht['spiderman']"]},{"cell_type":"markdown","metadata":{},"source":[" 'peter parker'\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["ht['batman']"]},{"cell_type":"markdown","metadata":{},"source":[" 'bruce wayne'\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["ht['superman']"]},{"cell_type":"markdown","metadata":{},"source":[" 'clark kent'\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["ht['superman'] = 'bob'\nht['superman']"]},{"cell_type":"markdown","metadata":{},"source":[" 'bob'\n\n"]},{"cell_type":"markdown","metadata":{},"source":["## On Collisions\n\n"]},{"cell_type":"markdown","metadata":{},"source":["### The \"Birthday Problem\"\n\n"]},{"cell_type":"markdown","metadata":{},"source":["Problem statement: Given $N$ people at a party, how likely is it that at\nleast two people will have the same birthday?\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["def birthday_p(n_people):\n p_inv = 1\n for n in range(365, 365-n_people, -1):\n p_inv *= n / 365\n return 1 - p_inv"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["birthday_p(3)"]},{"cell_type":"markdown","metadata":{},"source":[" 0.008204165884781345\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["1-364/365*363/365"]},{"cell_type":"markdown","metadata":{},"source":[" 0.008204165884781456\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["n_people = range(1, 80)\nplt.plot(n_people, [birthday_p(n) for n in n_people]);"]},{"cell_type":"markdown","metadata":{},"source":["![img](d764b2052716d768ffe045cf53c2b0c13c9c5cb6.png)\n\n"]},{"cell_type":"markdown","metadata":{},"source":["### General collision statistics\n\n"]},{"cell_type":"markdown","metadata":{},"source":["Repeat the birthday problem, but with a given number of values and\n\"buckets\" that are allotted to hold them. How likely is it that two or\nmore values will map to the same bucket?\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["def collision_p(n_values, n_buckets):\n p_inv = 1\n for n in range(n_buckets, n_buckets-n_values, -1):\n p_inv *= n / n_buckets\n return 1 - p_inv"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["collision_p(23, 365) # same as birthday problem, for 23 people"]},{"cell_type":"markdown","metadata":{},"source":[" 0.5072972343239857\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["collision_p(10, 100)"]},{"cell_type":"markdown","metadata":{},"source":[" 0.37184349044470544\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["collision_p(100, 1000)"]},{"cell_type":"markdown","metadata":{},"source":[" 0.9940410733677595\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["# keeping number of values fixed at 100, but vary number of buckets: visualize probability of collision\nn_buckets = range(100, 100001, 1000)\nplt.plot(n_buckets, [collision_p(100, nb) for nb in n_buckets]);"]},{"cell_type":"markdown","metadata":{},"source":["![img](8f125191a3fc94123848d21d57e9a7ae712b566a.png)\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["def avg_num_collisions(n, b):\n \"\"\"Returns the expected number of collisions for n values uniformly distributed\n over a hashtable of b buckets. Based on (fairly) elementary probability theory.\n (Pay attention in MATH 474!)\"\"\"\n return n - b + b * (1 - 1/b)**n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["avg_num_collisions(28, 365)"]},{"cell_type":"markdown","metadata":{},"source":[" 1.011442040700615\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["avg_num_collisions(1000, 1000)"]},{"cell_type":"markdown","metadata":{},"source":[" 367.6954247709637\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["avg_num_collisions(1000, 10000)"]},{"cell_type":"markdown","metadata":{},"source":[" 48.32893558556316\n\n"]},{"cell_type":"markdown","metadata":{},"source":["## Dealing with Collisions\n\n"]},{"cell_type":"markdown","metadata":{},"source":["To deal with collisions in a hashtable, we simply create a \"chain\" of\nkey/value pairs for each bucket where collisions occur. The chain needs\nto be a data structure that supports quick insertion — natural choice:\nthe linked list!\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["class Hashtable:\n class Node:\n def __init__(self, key, val, next=None):\n self.key = key\n self.val = val\n self.next = next\n\n def __init__(self, n_buckets=1000):\n self.buckets = [None] * n_buckets\n\n def __setitem__(self, key, val):\n bidx = hash(key) % len(self.buckets)\n\n def __getitem__(self, key):\n bidx = hash(key) % len(self.buckets)\n\n def __contains__(self, key):\n try:\n _ = self[key]\n return True\n except:\n return False"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["ht = Hashtable(1)\nht['batman'] = 'bruce wayne'\nht['superman'] = 'clark kent'\nht['spiderman'] = 'peter parker'"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["ht['batman']"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["ht['superman']"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["ht['spiderman']"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["def ht_search(ht, x):\n return x in ht\n\ndef init_ht(size):\n ht = Hashtable(size)\n for x in range(size):\n ht[x] = x\n return ht\n\nns = np.linspace(100, 10_000, 50, dtype=int)\nts_htsearch = [timeit.timeit('ht_search(ht, 0)',\n #'ht_search(ht, {})'.format(random.randrange(n)),\n setup='ht = init_ht({})'.format(n),\n globals=globals(),\n number=100)\n for n in ns]"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["plt.plot(ns, ts_binsearch, 'ro')\nplt.plot(ns, ts_htsearch, 'gs')\nplt.plot(ns, ts_dctsearch, 'b^');"]},{"cell_type":"markdown","metadata":{},"source":["![img](6c2abfa5acc0a1b88f6131ede360364e10c9ce2a.png)\n\n"]},{"cell_type":"markdown","metadata":{},"source":["## Loose ends\n\n"]},{"cell_type":"markdown","metadata":{},"source":["### Iteration\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["class Hashtable(Hashtable):\n def __iter__(self):\n pass"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["ht = Hashtable(100)\nht['batman'] = 'bruce wayne'\nht['superman'] = 'clark kent'\nht['spiderman'] = 'peter parker'"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["for k in ht:\n print(k)"]},{"cell_type":"markdown","metadata":{},"source":[" \n TypeErrorTraceback (most recent call last)\n in \n ----> 1 for k in ht:\n 2 print(k)\n \n TypeError: iter() returned non-iterator of type 'NoneType'\n\n"]},{"cell_type":"markdown","metadata":{},"source":["### Key ordering\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["ht = Hashtable()\nd = {}\nfor x in 'banana apple cat dog elephant'.split():\n d[x[0]] = x\n ht[x[0]] = x"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["for k in d:\n print(k, '=>', d[k])"]},{"cell_type":"markdown","metadata":{},"source":[" b => banana\n a => apple\n c => cat\n d => dog\n e => elephant\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["for k in ht:\n print(k, '=>', ht[k])"]},{"cell_type":"markdown","metadata":{},"source":[" \n TypeErrorTraceback (most recent call last)\n in \n ----> 1 for k in ht:\n 2 print(k, '=>', ht[k])\n \n TypeError: iter() returned non-iterator of type 'NoneType'\n\n"]},{"cell_type":"markdown","metadata":{},"source":["### Load factor & Rehashing\n\n"]},{"cell_type":"markdown","metadata":{},"source":["It is clear that the ratio of the number of keys to the number of\nbuckets (known as the **load factor**) can have a significant effect on\nthe performance of a hashtable.\n\nA fixed number of buckets doesn't make sense, as it might be wasteful\nfor a small number of keys, and also scale poorly to a relatively large\nnumber of keys. And it also doesn't make sense to have the user of the\nhashtable manually specify the number of buckets (which is a low-level\nimplementation detail).\n\nInstead: a practical hashtable implementation would start with a\nrelatively small number of buckets, and if/when the load factor\nincreases beyond some threshold (typically 1), it *dynamically increases\nthe number of buckets* (typically to twice the previous number). This\nrequires that all existing keys be *rehashed* to new buckets (why?).\n\n"]},{"cell_type":"markdown","metadata":{},"source":["### Uniform hashing\n\n"]},{"cell_type":"markdown","metadata":{},"source":["Ultimately, the performance of a hashtable also heavily depends on\nhashcodes being *uniformly distributed* — i.e., where, statistically,\neach bucket has roughly the same number of keys hashing to it. Designing\nhash functions that do this is an algorithmic problem that's outside the\nscope of this class!\n\n"]},{"cell_type":"markdown","metadata":{},"source":["## Runtime analysis & Discussion\n\n"]},{"cell_type":"markdown","metadata":{},"source":["For a hashtable with $N$ key/value entries, we have the following\n*worst-case runtime complexity*:\n\n- Insertion: $O(N)$\n- Lookup: $O(N)$\n- Deletion: $O(N)$\n\nAssuming uniform hashing and rehashing behavior described above, it is\nalso possible to prove that hashtables have $O(1)$ *amortized runtime\ncomplexity* for the above operations. Proving this is also beyond the\nscope of this class (but is demonstrated by empirical data).\n\n"]},{"cell_type":"markdown","metadata":{},"source":["## Vocabulary list\n\n"]},{"cell_type":"markdown","metadata":{},"source":["- hashtable\n- hashing and hashes\n- collision\n- hash buckets & chains\n- birthday problem\n- load factor\n- rehashing\n\n---\n\n"]},{"cell_type":"markdown","metadata":{},"source":["## Addendum: On *Hashability*\n\n"]},{"cell_type":"markdown","metadata":{},"source":["Remember: *a given object must always hash to the same value*. This is\nrequired so that we can always map the object to the same hash bucket.\n\nHashcodes for collections of objects are usually computed from the\nhashcodes of its contents, e.g., the hash of a tuple is a function of\nthe hashes of the objects in said tuple:\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["hash(('two', 'strings'))"]},{"cell_type":"markdown","metadata":{},"source":[" 4246727162495154915\n\nThis is useful. It allows us to use a tuple, for instance, as a key for\na hashtable.\n\nHowever, if the collection of objects is *mutable* — i.e., we can\nalter its contents — this means that we can potentially change its\nhashcode.\\`\n\nIf we were to use such a collection as a key in a hashtable, and alter\nthe collection after it's been assigned to a particular bucket, this\nleads to a serious problem: the collection may now be in the wrong\nbucket (as it was assigned to a bucket based on its original hashcode)!\n\nFor this reason, only immutable types are, by default, hashable in\nPython. So while we can use integers, strings, and tuples as keys in\ndictionaries, lists (which are mutable) cannot be used. Indeed, Python\nmarks built-in mutable types as \"unhashable\", e.g.,\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["hash([1, 2, 3])"]},{"cell_type":"markdown","metadata":{},"source":[" \n TypeErrorTraceback (most recent call last)\n in \n ----> 1 hash([1, 2, 3])\n \n TypeError: unhashable type: 'list'\n\nThat said, Python does support hashing on instances of custom classes\n(which are mutable). This is because the default hash function\nimplementation does not rely on the contents of instances of custom\nclasses. E.g.,\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["class Student:\n def __init__(self, fname, lname):\n self.fname = fname\n self.lname = lname"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["s = Student('John', 'Doe')\nhash(s)"]},{"cell_type":"markdown","metadata":{},"source":[" 298582137\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["s.fname = 'Jane'\nhash(s) # same as before mutation"]},{"cell_type":"markdown","metadata":{},"source":[" 298582137\n\nWe can change the default behavior by providing our own hash function in\n`__hash__`, e.g.,\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["class Student:\n def __init__(self, fname, lname):\n self.fname = fname\n self.lname = lname\n\n def __hash__(self):\n return hash(self.fname) + hash(self.lname)"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["s = Student('John', 'Doe')\nhash(s)"]},{"cell_type":"markdown","metadata":{},"source":[" 7828797879385466672\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["s.fname = 'Jane'\nhash(s)"]},{"cell_type":"markdown","metadata":{},"source":[" -7042091445038950747\n\nBut be careful: instances of this class are no longer suitable for use\nas keys in hashtables (or dictionaries), if you intend to mutate them\nafter using them as keys!\n\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":[""]}],"metadata":{"org":null,"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.5.2"}},"nbformat":4,"nbformat_minor":0}