Design a Word Count for 50 billion docs

October 20, 2021 chang

hash(term)%r gives random server for term. If I change the hash((ord(term[0])-ord(‘a’))//3), starting a, b, c terms storing in server 1, def storing server 2. Totally, you needs to have 26//3 +1 = 9 servers. The benefit is this is globally sorted from server 1 to server 9 by terms.

System Design

Design a Search Engine

October 20, 2021 chang

1. Inverted Index in memory
Small amount of data that you can load all terms and docid in memory
21bytes terms on avg and 1000 docids with 4 bytes int. And if you have 50 % of 4GB ram commodity machines, you can handle 80,000,000 terms.

Covid19:[2, 17, 678, 898, 9886]
sockcrush:[56, 89, 222, 678, 5678, 98760]
.
.
.

2. Inverted Index on Disk

Term:docid in a file

I	1		Bearish	2
Covid19	1		Covid19	1
I	1		Crash	2
You	1	SORT ==>	I	1
Told	1		I	1
Stock	2		Stock	2
Bearish	2		Told	1
Crash	2		You	1

You can do it by External sort and Merge using Files.
Merge can be multiple phases due to # of files. It merges to one global sorted file. 
k-way external sort-merge O(logk)*O(nk) with min-heap with k values.
Related leetcode problem

5. Inverted Index on Muti Disks-Map Reduce (altogether with GFS, External sort and merge to ss Table)
50 billion docs

a. Distributed File System (GFS) because the SS Table(inverted index) is huge.

Master knows its chunks and can answer whic machine the filename “ss table” file resides and location of the files in the machine.
SS Table file : c1 -> s1, s1′, s1”
c2 -> s2,s2′, s2”
c3 -> s3, s3′, s3”
cn -> sn, sn’, sn”

b. Map-Reduce (Batch processing)

Chunck servers has files with Term:dockid from crowlers

Chunk server 1	Chunk server 2	Chunk server 3	Chunk server 4	…	Chunk m
a, 1	a, 100	a, 320	a, 450
aa, 1	aaa, 100	aaa, 320	apple, 450
abc, 2	abc,100	abc, 320	avid, 450
apple,2	..	…	..
..	..	…	…
..	..	…	…

Mappers load the file term:docid and save it to reducer files and get ready for send the data to reducer. each reducer maps to H(term)%R from term. So each reducer get all the same term from all chunk servers. And then Reducer can do external sort and merge to get SS table file for H(term)%R.When does reducer know that it gets all the data from chunk server is because of the scheduler. Mapper can sort and reduce duplicates then it can reduce datacenter’s network traffiic.

6. ETL (Extract, Transfer, Load)

By External sort and Merge, we have one global sorted Term and docID.
Chang: 20
Chang: 67
Chang: 187

From one global sorted Term and docID, The goal is to create SS Table file (Sorted String Table):
Chang: 20, 67, 187,…
Cheng: 456, 578, 2898,…

Loading the SS table to RAM: Instead of loading all the DocIDs (not feasible too big), we add only offset to save the RAM space.
Dictionary in RAM
Chang: 1245 (offset fro the file)
Cheng: 3467 (offset from the file)

Option2: In a block of the SS Table, we can set the offset of the block. Not all terms in dictionary
Dictionary in RAM
Chang: 1245 (offset)
Dannis: 3953 (offset)
Ellis: 5688 (offset)

If you tries to find Cheng, you go Chang first and search from the Cheng’s offset in the file
Benefits: you can even save more space.

7. App tier (imported from cache or DB)

Search: Covid19 and stock crush

Get the docIDs for “Covid19” and the docIDs for “stock crush” and merge

Covid19 : [3, 23, 56, 67, 8,123,234,332,432,455,478,543,546,568,590, …….]

Stock crush : [33, 35,37,34,36,38,45,58,688,690,698,874,890,911,1221,1322, ….]

Problem. The client might not need to have all the document in advance. He might need to a few most relevant documents and that is it. But we are doing get all the docIDs from both and do and query.

Scatter-Gather (It is good to have in Cache)

It can serve to retrieve data faster. By chunking the long list data into range, it can serve faster.

python coding question

LRU Cache

October 18, 2021 chang

Design a data structure that follows the constraints of a Least Recently Used (LRU) cache.

Implement the LRUCache class:

LRUCache(int capacity) Initialize the LRU cache with positive size capacity.
int get(int key) Return the value of the key if the key exists, otherwise return -1.
void put(int key, int value) Update the value of the key if the key exists. Otherwise, add the key-value pair to the cache. If the number of keys exceeds the capacity from this operation, evict the least recently used key.
The functions get and put must each run in O(1) average time complexity.
What is considered as `used`?
Both get and put are considered as used.

Example 1:

Input
[“LRUCache”, “put”, “put”, “get”, “put”, “get”, “put”, “get”, “get”, “get”]
[[2], [1, 1], [2, 2], [1], [3, 3], [2], [4, 4], [1], [3], [4]]
Output
[null, null, null, 1, null, -1, null, -1, 3, 4]

Explanation
LRUCache lRUCache = new LRUCache(2);
lRUCache.put(1, 1); // cache is {1=1}
lRUCache.put(2, 2); // cache is {1=1, 2=2}
lRUCache.get(1); // return 1
lRUCache.put(3, 3); // LRU key was 2, evicts key 2, cache is {1=1, 3=3}
lRUCache.get(2); // returns -1 (not found)
lRUCache.put(4, 4); // LRU key was 1, evicts key 1, cache is {4=4, 3=3}
lRUCache.get(1); // return -1 (not found)
lRUCache.get(3); // return 3
lRUCache.get(4); // return 4

class Node:
    def __init__(self, key, value, left=None, right=None):
        self.key = key
        self.value = value
        self.left = left
        self.right = right


class LRUCache(object):

    def __init__(self, capacity):
        self.cap = capacity
        self.head, self.tail = None, None
        self.keyMap = dict()

    def get(self, key):
        if key not in self.keyMap:
            return -1
        node = self.keyMap[key]
        self.removeNode(node)
        self.addingNodeFront(node)
        return node.value

    def put(self, key, value):
        if key in self.keyMap:
            # Move the key in the first.
            keyNode = self.keyMap[key]
            self.removeNode(keyNode)
            self.addingNodeFront(keyNode)

        else:
            # first check the capacity.
            if len(self.keyMap) < self.cap:
                # if less than capacity, add the key in the first
                newNode = Node(key, value)
                if self.head == None:
                    self.head = newNode
                    self.tail = newNode
                else:
                    self.addingNodeFront(newNode)
                    # add the node to Mapp
                self.keyMap[key] = newNode
            else:
                # else remove the last in the list and add the new key in the first. and add the node in keymap
                tmp = self.tail
                self.tail = tmp.left
                self.tail.right = None
                self.keyMap.pop(tmp.key)

                newNode = Node(key, value)
                self.addingNodeFront(newNode)
                self.keyMap[key] = newNode

        print(self.keyMap)

    def addingNodeFront(self, node):
        tmp = self.head
        self.head = node
        node.right = tmp
        tmp.left = node

    def removeNode(self, node):
        _prev = node.left
        _next = node.right

        if _prev == None:

            self.head = node.right
            _next.left = None

        elif _next == None:
            self.tail = node.left
            _prev.right = None

        else:
            _prev.right = _next
            _next.left = _prev

Test Code
lRUCache = LRUCache(2)
lRUCache.put(1, 1)  # cache is {1=1}
lRUCache.put(2, 2)  # cache is {1=1, 2=2}
print(lRUCache.get(1))  # // return 1
lRUCache.put(3, 3)  # // LRU key was 2, evicts key 2, cache is {1=1, 3=3}
print(lRUCache.get(2))  # // returns -1 (not found)
lRUCache.put(4, 4)  # // LRU key was 1, evicts key 1, cache is {4=4, 3=3}
# lRUCache.get(1);    // return -1 (not found)
# lRUCache.get(3);    // return 3
# lRUCache.get(4);    // return 4

Chang's Home

Monthly Archives: October 2021

Design a Word Count for 50 billion docs

Design a Search Engine

LRU Cache