Design a Word Count for 50 billion docs

hash(term)%r  gives random server for term. If I change the hash((ord(term[0])-ord(‘a’))//3), starting a, b, c terms storing in server 1, def storing server 2. Totally, you needs to have 26//3 +1 = 9 servers. The benefit is this is globally sorted from server 1 to server 9 by terms. 

 

Design a Search Engine

1. Inverted Index in memory
Small amount of data that you can load all terms and docid in memory
21bytes terms on avg and 1000 docids with 4 bytes int. And if you have 50 % of 4GB ram commodity machines, you can handle 80,000,000 terms.

Covid19:[2, 17, 678, 898, 9886]
sockcrush:[56, 89, 222, 678, 5678, 98760]
.
.
.

2. Inverted Index on Disk

Term:docid in a file

I1 Bearish2
Covid191 Covid191
I1 Crash2
You1SORT ==>I1
Told1 I1
Stock2 Stock2
Bearish2 Told1
Crash2 You1
You can do it by External sort and Merge using Files.
Merge can be multiple phases due to # of files. It merges to one global sorted file.
k-way external sort-merge O(logk)*O(nk) with min-heap with k values.
Related leetcode problem
 
5. Inverted Index on Muti Disks-Map Reduce (altogether with GFS, External sort and merge to ss Table)
50 billion docs

a. Distributed File System (GFS) because the SS Table(inverted index) is huge.

Master knows its chunks and can answer whic machine the filename “ss table” file resides and location of the files in the machine.
SS Table file : c1 -> s1, s1′, s1”
              c2 -> s2,s2′, s2”
              c3 -> s3, s3′, s3”
              cn -> sn, sn’, sn”

b. Map-Reduce (Batch processing)

Chunck servers has files with Term:dockid from crowlers

Chunk server 1Chunk server 2Chunk server 3Chunk server 4Chunk m
a, 1a, 100a, 320a, 450  
aa, 1aaa, 100aaa, 320apple, 450  
abc, 2abc,100abc, 320avid, 450  
apple,2....  
....  
....  

Mappers load the file term:docid and save it to reducer files and get ready for send the data to reducer. each reducer maps to H(term)%R from term. So each reducer get all the same term from all chunk servers. And then Reducer can do external sort and merge to get SS table file for H(term)%R.When does reducer know that it gets all the data from chunk server is because  of the scheduler.  Mapper can sort and reduce duplicates then it can reduce datacenter’s network traffiic.

6. ETL (Extract, Transfer, Load) 

By External sort and Merge, we have one global sorted Term and docID.
Chang: 20
Chang: 67
Chang: 187

From one global sorted Term and docID, The goal is to create SS Table file (Sorted String Table):
Chang: 20, 67, 187,…
Cheng: 456, 578, 2898,…

Loading the SS table to RAM: Instead of loading all the DocIDs (not feasible too big), we add only offset to save the RAM space.
Dictionary in RAM
Chang: 1245 (offset fro the file)
Cheng: 3467 (offset from the file)

Option2: In a block of the SS Table, we can set the offset of the block. Not all terms in dictionary
Dictionary in RAM
Chang: 1245 (offset)
Dannis: 3953 (offset)
Ellis: 5688 (offset)

If you tries to find Cheng, you go Chang first and search from the Cheng’s offset in the file
Benefits: you can even save more space.

7. App tier (imported from cache or DB)

Search: Covid19 and stock crush

Get the docIDs for “Covid19” and the docIDs for “stock crush” and merge 

Covid19 : [3, 23, 56, 67, 8,123,234,332,432,455,478,543,546,568,590, …….]

Stock crush : [33, 35,37,34,36,38,45,58,688,690,698,874,890,911,1221,1322, ….]

Problem. The client might not need to have all the document in advance. He might need to a few most relevant documents and that is it. But we are doing get all the docIDs from both and do and query.

Scatter-Gather (It is good to have in  Cache)

 

It can serve to retrieve data faster. By chunking the long list data into range, it can serve faster. 

LRU Cache

Design a data structure that follows the constraints of a Least Recently Used (LRU) cache.

Implement the LRUCache class:

LRUCache(int capacity) Initialize the LRU cache with positive size capacity.
int get(int key) Return the value of the key if the key exists, otherwise return -1.
void put(int key, int value) Update the value of the key if the key exists. Otherwise, add the key-value pair to the cache. If the number of keys exceeds the capacity from this operation, evict the least recently used key.
The functions get and put must each run in O(1) average time complexity.
What is considered as `used`?
Both get and put are considered as used.

Example 1:

Input
[“LRUCache”, “put”, “put”, “get”, “put”, “get”, “put”, “get”, “get”, “get”]
[[2], [1, 1], [2, 2], [1], [3, 3], [2], [4, 4], [1], [3], [4]]
Output
[null, null, null, 1, null, -1, null, -1, 3, 4]

Explanation
LRUCache lRUCache = new LRUCache(2);
lRUCache.put(1, 1); // cache is {1=1}
lRUCache.put(2, 2); // cache is {1=1, 2=2}
lRUCache.get(1); // return 1
lRUCache.put(3, 3); // LRU key was 2, evicts key 2, cache is {1=1, 3=3}
lRUCache.get(2); // returns -1 (not found)
lRUCache.put(4, 4); // LRU key was 1, evicts key 1, cache is {4=4, 3=3}
lRUCache.get(1); // return -1 (not found)
lRUCache.get(3); // return 3
lRUCache.get(4); // return 4

class Node:
def __init__(self, key, value, left=None, right=None):
self.key = key
self.value = value
self.left = left
self.right = right


class LRUCache(object):

def __init__(self, capacity):
self.cap = capacity
self.head, self.tail = None, None
self.keyMap = dict()

def get(self, key):
if key not in self.keyMap:
return -1
node = self.keyMap[key]
self.removeNode(node)
self.addingNodeFront(node)
return node.value

def put(self, key, value):
if key in self.keyMap:
# Move the key in the first.
keyNode = self.keyMap[key]
self.removeNode(keyNode)
self.addingNodeFront(keyNode)

else:
# first check the capacity.
if len(self.keyMap) < self.cap:
# if less than capacity, add the key in the first
newNode = Node(key, value)
if self.head == None:
self.head = newNode
self.tail = newNode
else:
self.addingNodeFront(newNode)
# add the node to Mapp
self.keyMap[key] = newNode
else:
# else remove the last in the list and add the new key in the first. and add the node in keymap
tmp = self.tail
self.tail = tmp.left
self.tail.right = None
self.keyMap.pop(tmp.key)

newNode = Node(key, value)
self.addingNodeFront(newNode)
self.keyMap[key] = newNode

print(self.keyMap)

def addingNodeFront(self, node):
tmp = self.head
self.head = node
node.right = tmp
tmp.left = node

def removeNode(self, node):
_prev = node.left
_next = node.right

if _prev == None:

self.head = node.right
_next.left = None

elif _next == None:
self.tail = node.left
_prev.right = None

else:
_prev.right = _next
_next.left = _prev
Test Code
lRUCache = LRUCache(2)
lRUCache.put(1, 1) # cache is {1=1}
lRUCache.put(2, 2) # cache is {1=1, 2=2}
print(lRUCache.get(1)) # // return 1
lRUCache.put(3, 3) # // LRU key was 2, evicts key 2, cache is {1=1, 3=3}
print(lRUCache.get(2)) # // returns -1 (not found)
lRUCache.put(4, 4) # // LRU key was 1, evicts key 1, cache is {4=4, 3=3}
# lRUCache.get(1); // return -1 (not found)
# lRUCache.get(3); // return 3
# lRUCache.get(4); // return 4