1. Overview of Big Data and NoSQL technologies:
Compare with relational database when performing information retrieval tasks, search engine doesn’t require full understanding of complex database schema, joined query, and fuzzy match with missed keywords. Also it has better support for unstructured data information (Documents, spreadsheets, etc.).
2. Requirement of searching:
-To allow user recognizing and observing objects and activities –> Clustering and classification;
-To give user better experience in browsing useful information –> interests correlation;
-To Summarize the data as histogram, charts and timeline based diagram –> topic discovery and summarization;
3. Indexing
-index an object/document by features (keywords), such as image tag or fingerprint id;
-Revert indexing can be a way to search object, but it’s very costly for large collection of data.
4. Locality Sensitive Hashing(LSH) method for object searching
-Compare n pairs of objects in O(n) time (linear time)
-Hash object x to h(x), so that:
if x=y or x~=y, then h(x)=h(y) with high probability. vice versa.
-Using normal hash table to map large # n of object to small # m of hash, the time complexity to search is ‘Independent to n’
5. LSH application:
-fingerprint matching,
-grouping tweets smililarity;
-analyze doc duplications;
-finding time-series patterns such as big-data
-resolving people identities from multiple inputs
Part II
1. Information is about the surprise, a message informing us of an event that has probability p: log2(p), which stands for the bits of information. that’s to say, the common messages are with shorter/simpler expression, the rare messages are usually longer.
2. Shannon mutual information model
In order to maximize the mutual information, transmitter signal should match the context of receiver’s signal. For example, when user wants to buy a specific type of product, and the ads shows the # of times the productions’ ads for this specific type is clicked, then the mutual information is highest.
Example: In google’s AdSense, provide the inverse-search between “pages to keywords” and “query words to pages”
Part III
Question: Why Big-Data Technologies?
Compared with new BD tech, the traditional distributed system has the following shortcomings:
- Not fault-tolerant at scale;
- variety of data type makes relational db tech complicated;
- Needs to continuously archiving data to prevent unlimited growing;
- parallelism was an add-on
- limited computing capability
- price-performance challenge
Solutions: map-reduce and DFS
MapReduce
What is MapReduce?
MapReduce is a programming model to perform distributed computation on large amount of data, and an execution framework to process data on server cluster.
Why Large Data?
- Because large scale of data leads to better algorithms and systems to solve real-world problems.
- HOW? Organizing computations on cluster of machines. –>MapReduce
Why MapReduce?
- Scale up to Internet scale of bit data;
- Analytics of user behavior data, such as ever growing user submitted request, logging, etc.. Can be used for business intelligence analysis, such as data warehouse, data mining, recommendation, etc.
MapReduce Implementation
MapReduce is a tech for data parallel paradigm, it’s for message passing workflow, we need to specify what to do for ‘map’ and ‘reduce’ process, but it’s leave to map-reduce framework for detailed message passing implementation.
When the data set is extremely large, even the map-reduce function cannot be efficient enough,
Each piece of data is
.png)
, and it needs to propagate to all reducers
with extra combiners for efficiency, the e became:

Roles in Distributed web system:
There is exactly one NameNode in each cluster, which manages the namespace, filesystem metadata, and access control. You can also set up an optional SecondaryNameNode, used for periodic handshaking with NameNode for fault tolerance. The rest of the machines within the cluster act as both DataNodes and TaskTrackers. The DataNode holds the system data; each data node manages its own locally scoped storage, or its local hard disk. The TaskTrackers carry out map and reduce operations.
HDFS, GFS and big-data storage
-Large data stored in chunks of file across chunk servers, and chunks replicated across nodes for failure recovery –>consistency
-Read operation need to know which node store the request data–>NameNode tells client which chunk server to find the data
-Write operation try insert the data in master server and replicated across other nodes, when failed to write, retry, or store in another chunk. (Once done, contact NameNode for updating metadata)
NoSQL and MySQL
-MySQL issue: Transaction for maintain the consistency control (locker, transaction, etc.)
-MySQL Storage: B+Tree data structure for indexing; Disk management by RDBMS
-MySQL overhead: When dataset becomes larger, and joining tables become complex, the query becomes slow
HBase
-Column-based DB: columns are mapped as page projection
-OLAP: Online Analytical Processing
-Why not MySQL: transaction processing is not needed for analytics (ACID properties not necessary). Thus complex join statement and index become less relevant.
-NoSQL:
| NoSQL |
In-memory Database |
|
No ACID tranx;
|
RT tranx |
|
Restricted joins
|
Complex Joins
|
|
Columnar storage
|
|
|
Rather than tranditional indexing
–> use sharded indexing
|
Various indexes |
The nature of column allows several columns under same category attribute. The column data is organized by adding new column in new data, which allows creating snapshot of data.
_
.png)
Distributed filesystem nature of NoSQL allows high performance parallel operations (large parallel insert/read is efficient)
aggregation query is efficient based on the parallel computing
HBase distribute records across servers based on single key. To enable effective query in NoSQL, secondary key is needed
.png)
MongoDB
Document based. Can use any underlying file system. Data stored in sharding, support full text indexing. Support MapReduce;
Writes: MongoDB hires ‘Eventual Consistency‘ principle, unlike HBase or RDBMS (writes won’t succeed until replication done)
The read based on timestamp to read the latest written result
.png)
SQL is hard to map with MapReduce, available solutions are Pig Latin and HiveQL
_
.png)