Web Intelligence and Big Data

Notes of Web Intelligence & Big Data

Part I

1. Overview of Big Data and NoSQL technologies:
Compare with relational database when performing information retrieval tasks, search engine doesn’t require full understanding of complex database schema, joined query, and fuzzy match with missed keywords. Also it has better support for unstructured data information (Documents, spreadsheets, etc.).
2. Requirement of searching:
-To allow user recognizing and observing objects and activities –> Clustering and classification;
-To give user better experience in browsing useful information –> interests correlation;
-To Summarize the data as histogram, charts and timeline based diagram –> topic discovery and summarization;
3. Indexing
-index an object/document by features (keywords), such as image tag or fingerprint id;
-Revert indexing can be a way to search object, but it’s very costly for large collection of data.
4. Locality Sensitive Hashing(LSH) method for object searching
-Compare n pairs of objects in O(n) time (linear time)
-Hash object x to h(x), so that:
if x=y or x~=y, then h(x)=h(y) with high probability. vice versa.
-Using normal hash table to map large # n of object to small # m of hash, the time complexity to search is ‘Independent to n’
5. LSH application:
-fingerprint matching,
-grouping tweets smililarity;
-analyze doc duplications;
-finding time-series patterns such as big-data
-resolving people identities from multiple inputs

Part II

1. Information is about the surprise, a message informing us of an event that has probability p: log2(p), which stands for the bits of information. that’s to say, the common messages are with shorter/simpler expression, the rare messages are usually longer.
2. Shannon mutual information model
In order to maximize the mutual information, transmitter signal should match the context of receiver’s signal. For example, when user wants to buy a specific type of product, and the ads shows the # of times the productions’ ads for this specific type is clicked, then the mutual information is highest.
Example: In google’s AdSense, provide the inverse-search between “pages to keywords” and “query words to pages”

Part III

Question: Why Big-Data Technologies?
Compared with new BD tech, the traditional distributed system has the following shortcomings:
  1. Not fault-tolerant at scale;
  2. variety of data type makes relational db tech complicated;
  3. Needs to continuously archiving data to prevent unlimited growing;
  4. parallelism was an add-on
  5. limited computing capability
  6. price-performance challenge
Solutions: map-reduce and DFS
MapReduce
Shared memory processing

High level message passing

Shared memory approach is easy to implement, but hard to scale Message passing is difficult to implement, but easy for scale/distributed, and no global lock issue
What is MapReduce?
MapReduce is a programming model to perform distributed computation on large amount of data, and an execution framework to process data on server cluster.
Why Large Data?
  • Because large scale of data leads to better algorithms and systems to solve real-world problems.
  • HOW? Organizing computations on cluster of machines. –>MapReduce
Why MapReduce?
  • Scale up to Internet scale of bit data;
  • Analytics of user behavior data, such as ever growing user submitted request, logging, etc.. Can be used for business intelligence analysis, such as data warehouse, data mining, recommendation, etc.
MapReduce Implementation
MapReduce is a tech for data parallel paradigm, it’s for message passing workflow, we need to specify what to do for ‘map’ and ‘reduce’ process, but it’s leave to map-reduce framework for detailed message passing implementation.
When the data set is extremely large, even the map-reduce function cannot be efficient enough,
Each piece of data is , and it needs to propagate to all reducers
with extra combiners for efficiency, the e became:

Roles in Distributed web system:
There is exactly one NameNode in each cluster, which manages the namespace, filesystem metadata, and access control. You can also set up an optional SecondaryNameNode, used for periodic handshaking with NameNode for fault tolerance. The rest of the machines within the cluster act as both DataNodes and TaskTrackers. The DataNode holds the system data; each data node manages its own locally scoped storage, or its local hard disk. The TaskTrackers carry out map and reduce operations.

HDFS, GFS and big-data storage
-Large data stored in chunks of file across chunk servers, and chunks replicated across nodes for failure recovery –>consistency
-Read operation need to know which node store the request data–>NameNode tells client which chunk server to find the data
-Write operation try insert the data in master server and replicated across other nodes, when failed to write, retry, or store in another chunk. (Once done, contact NameNode for updating metadata)
NoSQL and MySQL
-MySQL issue: Transaction for maintain the consistency control (locker, transaction, etc.)
-MySQL Storage: B+Tree data structure for indexing; Disk management by RDBMS
-MySQL overhead: When dataset becomes larger, and joining tables become complex, the query becomes slow
HBase
-Column-based DB: columns are mapped as page projection
-OLAP: Online Analytical Processing
-Why not MySQL: transaction processing is not needed for analytics (ACID properties not necessary). Thus complex join statement and index become less relevant.
-NoSQL:
NoSQL In-memory Database
No ACID tranx;
RT tranx
Restricted joins
Complex Joins
Columnar storage
Rather than tranditional indexing
–> use sharded indexing
Various indexes
The nature of column allows several columns under same category attribute. The column data is organized by adding new column in new data, which allows creating snapshot of data.
_
Distributed filesystem nature of NoSQL allows high performance parallel operations (large parallel insert/read is efficient)
aggregation query is efficient based on the parallel computing
HBase distribute records across servers based on single key. To enable effective query in NoSQL, secondary key is needed
MongoDB
Document based. Can use any underlying file system. Data stored in sharding, support full text indexing. Support MapReduce;
Writes: MongoDB hires ‘Eventual Consistency‘ principle, unlike HBase or RDBMS (writes won’t succeed until replication done)
The read based on timestamp to read  the latest written result
SQL is hard to map with MapReduce, available solutions are Pig Latin and HiveQL
_

Generate object models from xsd file using Eclipse JAXB plugin

JAXB(Java Architecture for XML Binding) library integrated binding compiler for XML schema, which allows JAVA code access and process XML element without writing extra parsers, and also supports unmarshall XML document to JAVA object. The xjc component in JAXB library provides an easy way for mutual conversion between XML Schema Definition(XSD) and Object models. With maven-jaxb-plugin component in Eclipse, bean/object files can be generated automatically from user defined xsd file.See the following diagram from Oracle official document:

This post illustrated a simple example of this conversion in Eclipse.

The first step is to add JAXB maven plugin for Eclipse by adding following dependency in pom.xml file:

 <dependency>
 <groupId>com.sun.tools.xjc.maven2</groupId>
 <artifactId>maven-jaxb-plugin</artifactId>
 <version>1.1.1</version>
 </dependency>

The following userInfo xsd file defined a simple schema of user info with a rating list, which recorded user’s rating history.

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
 targetNamespace="http://thoughtforge.net/model" xmlns:tns="http://thoughtforge.net/model"
 attributeFormDefault="unqualified" elementFormDefault="qualified"
 version="1.0">
 
 <xsd:element name="userInfo" type="tns:userInfo" />
 <xsd:complexType name="userInfo">
 <xsd:sequence>
 <xsd:element name="name">
 <xsd:simpleType>
 <xsd:restriction base="xsd:string">
 <xsd:maxLength value="50" />
 </xsd:restriction>
 </xsd:simpleType>
 </xsd:element>
 <xsd:element name="address">
 <xsd:simpleType>
 <xsd:restriction base="xsd:string">
 <xsd:maxLength value="50" />
 </xsd:restriction>
 </xsd:simpleType>
 </xsd:element>
 <xsd:element name="ratingList" type="tns:ratings"
 maxOccurs="unbounded">
 </xsd:element>
 </xsd:sequence>
 </xsd:complexType>
 <xsd:complexType name="ratings">
 <xsd:sequence>
 <xsd:element name="ratingItemId" type="xsd:string" />
 <xsd:element name="ratingScore" type="xsd:int" />
 </xsd:sequence>
 </xsd:complexType>
</xsd:schema>

Right click this userInfo.xsd file and choose ‘Generate…–>JAXB classes’ and specify the destination packages, it’ll generate the ObjectFacotry.java, package-info.java, Ratings.java and UserInfo.java classes. By specifying the element attribute <maxOccurs=”unbounded”>, the JAXB automatically recognized the aggregation feature of rating element, and created the List type for this field.

The full description of XSD element types can be found here.

Things to think about:

1) How to define generic data type and generate corresponding classes;

2) Find out the tools to summarize xsd from raw xml response.

Reference: this post and this tutorial

Configure Apache XML-RPC with Spring 3

XML-RPC is C/S mode communication protocol, which wraps client request and server response with XML messages. By specifying method name and parameters, registered server-side services can be invoked and perform some computational tasks or backend operations, then get back to client with XML response. To avoid modification of existing code when new services required, Spring 3 framework is integrated, whose ‘IoC’ and ‘DI’ feature enabled this possibilities. As an after-work practice, I followed the quick tutorials online (see the end of this article), built an XML-RPC simple project. The main components include:

  1. Maven: help resolving the project dependencies which unique version, which prevent discrepancy among releases;
  2. Spring 3: manage the dependency injection (DI) and object managment, which greatly reduced the implementation overhead of object lifecycle control;
  3. Apache XML-RPC: an XML communication protocol over HTTP to implement remote procedure call, see http://ws.apache.org/xmlrpc/
  4. Tomcat 7.0 web server: to build an HTTP servlet on which XML-RPC relied to transmit messages;
  5. Apache JMeter: A transmission protocol testing tools, which can be used to test various web-based communication, see http://jmeter.apache.org/ .

The first step is creating a Maven based project in Eclipse (install m2eclipse plugin first), it will autogenerate the project skeleton (by choosing ‘create archtype sample project) and pom.xml script. The required dependencies listed as link http://pastebin.com/NLiPusVj .

The next step is to add the maven project with Dynamic Web Module, so that WEB-INF and META-INF content can be generated and used for servlet configuration. Simple right click project and choose ‘Project facets’ option in properties panel, and click ‘Dynamic Web Module’ to complete the auto setup. Create a new Tomcat 7.0 server instance and bind project resource to the server. In the generated web.xml (create it if not exist), the content should be like http://pastebin.com/3Y0YfTg3 , where applicationContext.xml located in classpath:/ioc/ is the Spring context and bean definition, and Spring DispatcherServlet is a Spring compatible HTTP servlet, and will be used to transmit HTTP request/response. The ‘url-pattern’ tag in <servlet-mapping> attribute defined the access endpoint of all the incoming request.

<servlet-mapping>
 <servlet-name>xmlrpc</servlet-name>
 <url-pattern>/xmlrpc/*</url-pattern>
 </servlet-mapping>

So the final access point on local server is http://localhost:8080/{projectName}/xmlrpc/{mapped_service_suffix}, with XML post body like following:

<?xml version="1.0"?>
<methodCall>
 <methodName>publisher.insert</methodName>
 <params>
 <param>
 <id><value>3</value></id>
 </param>
 </params>
</methodCall>

The Spring 3 framework inject a handler mapping object

Map<String, IHandler> handlers

to the XmlRpcServerController class,  and call SpringHandlerMapping class to register the handlers as public method.

The client request pointed to the mapped service endpoint will invoke the following ’serve’ function, which basically dispatch request to corresponding registered services:

@RequestMapping(value="/1", method=RequestMethod.POST)
 public void serve(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException{
 server.execute(request, response);
 }

The actual service implementation is in PublishHandler class(base class IHandler), @Service annotation defined the set of RPC methods, and is the prefix of method name, so in this example the method access name of insert is http://localhost:8080/XmlRpc/xmlrpc/1, and method name is ‘publisher.insert’. If method also accepts other primitives parameters, it should also be set in the <params> group tag.

Complete applicationContext.html is as http://pastebin.com/e7kU0Y5q .

The <context:component-scan base-package=“com.xmlrpc.service” /> tag defined the Spring context injecting scope, and <context:annotation-config /> tag means Spring annotation such as @Autowired, @Controller can be used to mark the role of field/method/class.

[NOTE] This project is mostly followed blog Apache XML-RPC Service with Spring 3 and XML-RPC + Maven + Spring, but it’s under investigation what’s the purpose of SpringRequestProcessorFactoryFactory class and the dedicated classHandlerMapping data structure described in former posts. I’ll keep this thread posted for what’s I lately find out.

My first published android app– ‘Tips Widget’

I’m always tired to calculate how much tips should pay based on the restaurant bill, especially when we have to split into multiple persons. Mobile build-in calculator might be a good idea but the step-by-step multiplication and add are still troublesome to me. Motivated by it, I decided to build a simple Android widget, to get all these steps done in a few clicks.

Here’s the first version of Tips calculate widget, it’s published on Google market a while ago.

Tips Widget v1.0

Tips Widget v1.0

It’s simple and straightforward, integrated basic settings such as rates, tax and cofirm button. However, feedback and suggestions come back from users after pulished to the google market. The primary drawbacks are as following:

  1. Poor user interface design, the elements and font misaligned, and therefore not attractive or intuitive, the user doesn’t know which fields are required;
  2. Heavy rely on the system build-in soft keyboard, which usually pop up as a huge full keyboard with alphabetical letters, various symbols and useless formatting buttons, which make them dizzy;
  3. Too many input fields, combined with #2, makes user have to enter some value, escapse soft keyboard, click on next field, and proceed same steps over and over again, which is ‘awkard’;
  4. Inefficient use of  the screen, as you can see most text fields are full of blank spaces which occupied most room of screen, yet without much use;
  5. User cannot get any feedback after they type in or click on something, maybe some means can be taken to improve their experience.

Based on these feedback, I decided to redesign and implement this small widget, just as an off-work practise. I spend about 5-6 nights working on it, did some on-paper prototype design, user survey from friends and web research. Finally come up with the 2nd version of Tips Widget, which is also available on Google Play Market now.

Tips Widget v2.0

To open setting panel for rate configuration, click show options and the panel will ‘fly in’.

Tips Widget v2.0

The features including:

  1. Support of bill split;
  2. User-friendly interface with embeded number keyboard, screen vibration on each click;
  3. Add clickable arrow group on text field;
  4. Optimized user interface, including leather texture backgroud and customized components style;
  5. Animation on the display and interaction.

Here’re the places of improvement in the planned update:

  1. Add secondary display layout for horizontal view, to avoid the scroll movement which may affect user experience;
  2. Changeable application theme including button style and backgroud;
  3. Add payment history support so you know historical spend on the bills.

Visit Google Play Market for this free Tips Widget, and let me know how you like it, so I can make approvement accordingly in next release :)

When I got stuck in Draw Something…

I love playing DrawSomething, and my younger brother always gave me difficult time when he draw some stuff, here are some picture he drew.

mybrodraw

With the increasing amount of vocabularies out of my knowledge, it’s hard to get PASS sometimes, motivated by this, I thought it’s a good practise to make a simple Android app to help me do the trick.

I’m gradually modifying the algorithm to improve its efficiency.

1. At first I tried get a list of full permutation of the characters in given length, then called Bing translate API to get the Chinese meanings of them (using Bing because Google does not provide free translate API usage any more since last June. ToT…)

However, in these case the size the permutation list is N! when the characters list contain N elements, and N! times http call to API  just to check a word exists or not, is obviously time consuming;

2. Then I found a dictionary txt file with 23.7W words here, rather than making http call separately, I lookup each combination in dictionary, if exists, then add to a List<String>, after all, iterate through the list and arise Bing API call to get the translation.

This way the performance improved a bit, but still not quite enough.

3.Then I tried another way other than permutation on the character list, this time I only do a single scan on the dictionary, for each words, I sort it first in alphabetical order and check if it’s the LCS(Longest Common Subsequence) of the sorted character list, if it is, then add it to the List<String>. For example, ‘road’ is a possible result of ‘android’ because LCS(’road’.sort(), ‘android’.sort()) == ‘road’.sort() == ‘ador’.

In this method the performance is quite ‘acceptable’, it took about 10 second to finish a 5 words calculation for 10-size character list

draw_something

4. The next two changes I plan to research on are:

a. Replace the text dictionary with Sqlite dict table in android, and do a query for the word plus its translation as a table column as well;

b. Based on a, add ‘word frequency’ value for each entry and do a ‘ORDER BY’ and ‘TOP’ for the lookup, hopefully it will improve the lookup speed further.

In later update on this posts I’ll describe some of the String-wise algorithm in this ‘tiny’ app.

How to implement rounding to percentile in Java?

In my development of the tips calculator in Android, there’s a step to rounding a double number to percentile (2nd decimal point), i.e. 15.4321 to 15.43, or 1.4783 to 1.48, etc.

Here’re four ways to do this, cited here from a developer’s tech blog;

 import java.math.BigDecimal;
 import java.text.DecimalFormat;
 import java.text.NumberFormat;
 
 public class format {
   double f = 111231.5585;
   public void m1() {
     BigDecimal bg = new BigDecimal(f);
     double f1 = bg.setScale(2, BigDecimal.ROUND_HALF_UP).doubleValue();
     System.out.println(f1);
   }
 
 public void m2() {
   DecimalFormat df = new DecimalFormat("#.00");
   System.out.println(df.format(f));
 }
 
 public void m3() {
   System.out.println(String.format("%.2f", f));
 }
 public void m4() {
   NumberFormat nf = NumberFormat.getNumberInstance();
   nf.setMaximumFractionDigits(2);
   System.out.println(nf.format(f));
 }
 public static void main(String[] args) {
   format f = new format();
   f.m1();
   f.m2();
   f.m3();
   f.m4();
   }
 }

The result:

 111231.56 
 111231.56 
 111231.56 
 111,231.56

Difference between sort() and sort_by() in Ruby

In ruby, if you want to sort an array say ary=[1,5,3,2,4], you can do it in the following ways:

1. Use sort():

ary.sort{
 
|a,b| a<=>b
 
}

‘<=>’ is the default comparing operator in Ruby, which specified a should be arranged before be when a.to_i is less than b.to_i. This sorting method works well when the sample number of array is small, however the performance is poor when the sample increased, because is will calculate the current sorted list over and over again until every item, i.e. 100 elements will execute 500 times, 1000 elements will execute 9500 times.

2. Use sort_by():

ary.sort_by{
 
|x| x.to_i
 
}

This approach is faster because prior to the sorting, it scans every element in the array and put them into a temp array, which holds all items, then perform calculation based on this temp array, so only one call of sorting algorithm will be applied on each item.

See the following content cited from Gayle’s blog:

Ruby Sorting – When and Why to use sort_by()

September 28, 2009 — Gayle


When I read the rdoc on sort_by, I understood the general idea that sort_by is more efficient in some situations. The specifics on why were still over my head, so I wasn’t planning to get into specifics during my recent talk on sorting. Yet just a few hours before my talk Jim Weirich was still trying to cajole me into using big words like “Schwartzian Transformation” in my talk because, he teased, “using big words makes you sound important :)

The good thing is that this gave me a chance to talk it out with him, and actually understand it for real. It was too late for me to add that into my talk a few hours before I was to give it, but I do want to talk about it here now that I understand.

sort_by() is good if the values you’re sorting on require some kind of complex calculation or operation to get their value.

Let’s say you have an aquarium, and you save the dates of when each fish is born in a database. Later, you want to sort the list of fish by age. But you must calculate the age based on the birth date. So the Fish class has an age method:

class Fish
...
  def age
    (Date.today - birthday).to_i
  end
end

So when you sort like this:

>> fishes = Fish.find(:all)

>> fishes.sort do |a, b|
>>   a.age <=> b.age
>> end

It will calculate age over and over as it sorts. And if you’ve studied sorting algorithms, you know that the items in the list are compared with other list items repeatedly until it can be determined where the items go in the ordered list. So using this way of sorting, the age will be calculated a lot!

When you use sort_by() instead:

>> fishes.sort_by do |a|
>>   a.age
>> end

It does 3 things:

1. It will first go through each item in fishes, calculate age, and put those values into a temporary array keyed by the value. Let’s say we have 3 fish, one 300 days old, one 365 days old, and one 225 days old. The temporary array looks like this

[[300, #<Fish:A>][365, #<Fish:B>][225, #<Fish:C>]

2. The complex calculation is now done, once for each fish. It sorts this temporary array by the first item in each sub array. Meaning, it sorts by the numbers 300, 365 and 225, without recalculating them.

[[225, #<Fish:C>],[300, #<Fish:A>][365, #<Fish:B>]]

3. Lastly, it goes back through the array, grabbing the 2nd array elements (the actual Fish objects) and putting them in order into a flattened 1-dimensional array

[#<Fish:C>, #<Fish:A>, #<Fish:B>]

So, that is how you end up with a sorted array without recalculating values more than you need to. And that is why sort_by() can be more efficient.

Reading notes for “High Performance MySQL”

Chap 3: Schema Optimization and indexing

3.1 Schema optimization and data type

1. Correctly set the margin size of schema, smaller size is faster because it use proper disk memory and cache;

2. Choose simple data type: Integer is cheaper than characters, because characters involved complex comparison in charset and collations. Also choose build-in date-time type rather than char for date, and integer for IP;

3. Avoid NULL if possible, because NULLable column is hard to make index, can substitute with zero, empty string or special value

4. Choose proper datatype, TIMESTAMP uses half storage space compare with DATETIME. If no negative numbers, can use unsigned int to double the upperbound of integer

5. CHAR and VARCHAR:

VARCHAR saves space, it uses variable-size of storage, which is dynamic according to actual content, and use 1 or 2 extra bytes to store the content size info, like varchar(10) use 11 bytes of storage space (when largest length is much longer than average length, use VARCHAR.

CHAR type always save data in fixed-length, when content is short or all values are nearly the same length (MD5, CRC32) because it doesn’t need extra length flag, and generates less fragment.

6. BLOB and TEXT:

  • BLOG cannot specify collation and character set, while TEXT can. MySQL does not sort the TEXT and BLOB by full length, but only the first max_sort_length bytes of column
  • How to avoid on-disk temp tables?
    • Memory storage engine doesn’t support BLOB and TEXT, queries of these two types need to first create a MyISAM type temp table on disk, which is slow, to prevent it, we can sort by SUBSTRING(column, length), so the in-memory temp table will be created rather than on disk. (We should use EXPLAIN to check is Extra column contains ‘using temporary’

7. Performance impact from ORM framework

  • ORM often hide the complexity of database to developer, however, sometimes it doesn’t carefully design to facilitate proper data storage approach, just simply ‘dump’ your data using seperated rows or timestamp-based versioning, which largely waste the space and reduce the efficiency
  • In this case, need to carefully verify the framework scale well

3.2 Indexing basics

  • Storage: MyISAM uses B-Tree, while InnoDB uses B+Tree.
  • B-Tree allows finding the entry without scanning whole table, and it only traces from root to leaves using pointers, because parent node stores the lower and upper bound of child nodes
  • How can B-Tree help you find / sort data?
  • B-Tree can find data with common prefix or match a range of values, or exact first part + range on the following part, or indexing-only fields

3.3 Hash indexing and hash collision in MySQL

  • The probability of hash collision in MySQL is 1% in 93000 values (CRC32 or MD5);
  • To avoid collision, should use two columns in where clause, for example
    mysql> SELECT word, crc FROM words WHERE crc = CRFC32(’GNU’) AND word = ‘gnu’
  • Spatial (R-Tree) indexing is a geospatial based indexing and require MySQL GIS functions such as MBRCONTAINS()
  • Full-text indexes: contains B-Tree index on the same column, for MATCH AGAINST operations, not ordinary WHERE clause operations (?)

Android dev is cool!

About one month ago I began study android app development, actually I was facinated by mobile dev long time ago, in 2009 I was working at educational technology office of MSU, and studied iOS dev for near a month, since I didn’t technically own a MacBook which was the prerequisite of iOS dev, finally gave up after coded some simple demo. Compared with iOS, android dev seems easier to start with, because almost ‘everyone’ knows Java better than Obj-C, and 99$ annual developer fee plus a macbook can be waived using coupon ‘android’ :)

All for all, here are some ongoing projects I am working on, which sololy served as my learning curve, the aim is to get familiar with dev work flow, different APIs and internal architecture. IDE is Eclipse Palsa with latest Android SDK installed.

1. Celsius <–>Fahrenheit Unit Converter

I don’t know how to calculate Fahrenheit temprature in my past 25 years, even cannot judge it’s cold or not based on ‘F’ value, so I build my first android app like this.

It used common interface components like text field, radios, button, and added menu support to clear text content.

tempraturetemprature0

NEXT STEP: Add web sync func integrated with location service to obtain current degree as startup;

2. Location sensor and distance calculation

To learn the Google Map API and android location service, I downloaded latest Google Map dev package and got Google API key, plug in to xml config file and we are ready to go.

This app can track your current location, and get distance between the place where you last time checkin(under dev). Key components involved including LocationManager, MapActivity, Location_Service as system service, DistanceBetween(),  etc.

location

NEXT STEP: Utilize buildin android db SQLlite to record the places you’ve checked in before, or import from gallery photos (with gps tag), so can easily view the history of visited places.

3. Ball Balance

Inspired by previouse balance ball game, this app allows two balls shown on the screen and controlled by gravity sensor.

Initially two balls will overlap with each other when coordinate became the same, and now it has boundary check and collision detection to avoid weird effect;

Key points including BMP image and layer draw, and thread run/yield control by timer, because essentially the move of ball is implemented by repeatly draw of canvas and ball with changing coordinate, we need to lock thread to draw the canvas, draw balls in current timeslice (every 50ms), and yield the thread.

balance_ball

NEXT STEP: Add support to touch adding more balls (Using composite and factory pattern); Add meaningful background with ‘holes’ to emulate the read game (drop ball when move to hole)

4. Gesture recognition

Android support multi-touch, so it can not only detect the touch of screen, but also long-touch and multiple points detection, like fling, expand with two fingers, etc. Here it will detect the fling when speed larger than 100px per second, in each direction

gesture

NEXT STEP: TBD

How to make a CSV file unicode-friendly?

During my recent work, I need to export some mysql table data into CSV files. From wiki , CSV(Comma Seperated Value) file is basically just a dump of rows delimited with apostrophe and comma, which you can view the content using notepad or text editor.

It sounds pretty straightforward to export data fields value into CSV, however, it’s not as easy as you think when the table value contains unicode-only characters, such as Korean and Chinese letters, if you want to view them in Microsoft Excel, you’ll have trouble.
For example, when you export data or edit cell value with Korean value like this,

CSV encode

Save it ignoring warning window, close and reopen, you will see the following,

test

Why? It’s because by default Excel will use ASCII or ANSI charcter set to decode the unicode-encoded CSV file, which leads to all foreign letters displayed as question mark or unrecognizable symbols, and it cannot be restored or converted to original values, because the character data has been corrupted in this step.

It took me a long time to find the workaround. I tried using Ruby ‘iconv‘ library, with following codes but not works.

require 'iconv'

utf8_csv = File.open("utf8file.csv").read

# gotta be careful with the weird parameters order: TO, FROM !
ansi_csv = Iconv.iconv("LATIN1", "UTF-8", utf8_csv).join

File.open("ansifile.csv", "w") { |f| f.puts ansi_csv }

Also I tried convert the encode  to UTF16BE (BigEnding) /UTF16SE(SmallEnding) using Ruby build-in file opening conversion as following, still no luck,

file = File.read("ansifile.csv", "r:UTF16BE:UTF8")

or further

new_file = file.force_encode("UTF-8")

Finally, I find out the solution was using Byte-Order Mark (BOM) to ‘mark’ the encode type of CSV file, and it works pretty well. Refer here.