## February 28, 2012

### Big Data, Why Matrix is important?

We feels the beauty of harmony and convenience of order from regular array that can be found in the Library or Parking lot. Like this, the matrix is applied not only to mathematical problems but also to problems in our real life as a useful concept. For examples, account book, items or goods management, encryption and decryption, population analysis, statistical data analysis, quantitative business analysis, and the transportation network analysis, ..., etc.

The matrix is everywhere, it is all around us.

The same is true of the Cyber world. The matrix is an essential part of information management and analysis. Just think of Amazon bookstore, Foursquare, Google Maps/Places, Social network services and its traffic flow networks or user in/out flows, ..., etc. Log data. The only difference is scale, Local Vs. World-wide. In shortly, Big Data! Do you love this term?

Wait! What is Matrix?

In mathematics, the matrix is an rectangular array of numbers or letters arranged in rows and columns. An m x n matrix has m rows and n columns. If m = n, we call it a square matrix. Two matrices are equal if they have the same number of rows and columns, and the corresponding entries in every position are equal. We usually represent a matrix as follows:

Let assume the A and B  be both m x n matrices, then A + B is defined by [aij + bij]. The product of matrices A and B is de ned if A is m x k and B is k x n matrices. In other words, AB is de fined if the column number of A is the same as the row number of B. The dimensionality of AB is m x n. The entries in AB are de fined by cij = ai1b1j + ai2b2j + ... + aikbkj.

What is use of matrix arithmetic in real world? Maybe we've already learned in high school e.g., gets the total cost from the product of matrices.

BigTable

The Google's BigTable was born for this reason or to store huge semi-structured (WWW) data. I'm mentioning this for one reason: how to store a very large matrix?

When I looked at their paper for the first time, I thought that is a sparse matrices storage for large link graph data or spatial data. I may be wrong or right. However, I'm still think it's good one for matrix storing. Because, it allows random access read/write and its column oriented design allows to read one specific column vector effectively, to store sparse matrix data efficiently. Certainly, there are many advantages than using of flat files.

Here's good news, there are BigTable clone open source software: HBase, Cassandra, and Accumulo.

Matrix computations on Big Data

When people does talk about Big Data, It is always - "extract value from the data". What's the meaning of this? In the past, we relied on intuition and luck. But now, to forecast and re-act more scientific and correct, we should have to extract valuable patterns and information, previously hidden within Big Data! That's all.

As you already know, there's a many good open source solutions for Big Data such as Hadoop, Hive, HBase..., etc. So then, how does Big Data solutions extract valuable patterns and information? Well, the value is relative. You should use your own math. MapReduce may be enough or not.

The math or mining tool is still in the beginning stages in the Big Data world. The matrix is everywhere from simple statistics analysis to more complex machine learning algorithms or recommendation systems, but there is not suitable computing engine for Matrix (and also graph) computations yet. WTF!?

With this, I can't mining anything.

Recently, the message passing stuffs like BSP (Bulk Synchronous Parallel) and MPI are came back again because of limited MapReduce capacity.

A notable example is the Apache Hama, which is a pure BSP(Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms.

Oh, It's time to go to check out the Hama.