Edward J. Yoon's Blog: 02/2012

Big Data, Why Matrix is important?

We feels the beauty of harmony and convenience of order from regular array that can be found in the Library or Parking lot. Like this, the matrix is applied not only to mathematical problems but also to problems in our real life as a useful concept. For examples, account book, items or goods management, encryption and decryption, population analysis, statistical data analysis, quantitative business analysis, and the transportation network analysis, ..., etc.

The matrix is everywhere, it is all around us.

The same is true of the Cyber world. The matrix is an essential part of information management and analysis. Just think of Amazon bookstore, Foursquare, Google Maps/Places, Social network services and its traffic flow networks or user in/out flows, ..., etc. Log data. The only difference is scale, Local Vs. World-wide. In shortly, Big Data! Do you love this term?

Wait! What is Matrix?

In mathematics, the matrix is an rectangular array of numbers or letters arranged in rows and columns. An m x n matrix has m rows and n columns. If m = n, we call it a square matrix. Two matrices are equal if they have the same number of rows and columns, and the corresponding entries in every position are equal. We usually represent a matrix as follows:

Let assume the A and B be both m x n matrices, then A + B is defined by [a_ij + b_ij]. The product of matrices A and B is de ned if A is m x k and B is k x n matrices. In other words, AB is de fined if the column number of A is the same as the row number of B. The dimensionality of AB is m x n. The entries in AB are de fined by c_ij = a_i1b_1j + a_i2b_2j + ... + a_ikb_kj.

What is use of matrix arithmetic in real world? Maybe we've already learned in high school e.g., gets the total cost from the product of matrices.

BigTable

The Google's BigTable was born for this reason or to store huge semi-structured (WWW) data. I'm mentioning this for one reason: how to store a very large matrix?

When I looked at their paper for the first time, I thought that is a sparse matrices storage for large link graph data or spatial data. I may be wrong or right. However, I'm still think it's good one for matrix storing. Because, it allows random access read/write and its column oriented design allows to read one specific column vector effectively, to store sparse matrix data efficiently. Certainly, there are many advantages than using of flat files.

Here's good news, there are BigTable clone open source software: HBase, Cassandra, and Accumulo.

Matrix computations on Big Data

When people does talk about Big Data, It is always - "extract value from the data". What's the meaning of this? In the past, we relied on intuition and luck. But now, to forecast and re-act more scientific and correct, we should have to extract valuable patterns and information, previously hidden within Big Data! That's all.

As you already know, there's a many good open source solutions for Big Data such as Hadoop, Hive, HBase..., etc. So then, how does Big Data solutions extract valuable patterns and information? Well, the value is relative. You should use your own math. MapReduce may be enough or not.

The math or mining tool is still in the beginning stages in the Big Data world. The matrix is everywhere from simple statistics analysis to more complex machine learning algorithms or recommendation systems, but there is not suitable computing engine for Matrix (and also graph) computations yet. WTF!?

With this, I can't mining anything.

Recently, the message passing stuffs like BSP (Bulk Synchronous Parallel) and MPI are came back again because of limited MapReduce capacity.

A notable example is the Apache Hama, which is a pure BSP(Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms.

Oh, It's time to go to check out the Hama.

Pregel clone package on top of Apache Hama

Today, I finished testing new Graph package and its examples of Apache Hama on 2 rack 512 cores fully distributed cluster.

The new Graph APIs is the completely clone of Google's Pregel and its performance is also quite good. Hama-0.5 release will provide really powerful BSP computing engine and lot of new features. :D

Here's full source code of Single Source Shortest Path:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hama.examples;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hama.HamaConfiguration;
import org.apache.hama.bsp.HashPartitioner;
import org.apache.hama.bsp.SequenceFileInputFormat;
import org.apache.hama.bsp.SequenceFileOutputFormat;
import org.apache.hama.graph.Edge;
import org.apache.hama.graph.GraphJob;
import org.apache.hama.graph.Vertex;
import org.apache.hama.graph.VertexArrayWritable;
import org.apache.hama.graph.VertexWritable;

public class SSSP {
  public static final String START_VERTEX = "shortest.paths.start.vertex.name";

  public static class ShortestPathVertex extends Vertex<IntWritable> {

    public ShortestPathVertex() {
      this.setValue(new IntWritable(Integer.MAX_VALUE));
    }

    public boolean isStartVertex() {
      String startVertex = getConf().get(START_VERTEX);
      return (this.getVertexID().equals(startVertex)) ? true : false;
    }

    @Override
    public void compute(Iterator<IntWritable> messages) throws IOException {
      int minDist = isStartVertex() ? 0 : Integer.MAX_VALUE;

      while (messages.hasNext()) {
        IntWritable msg = messages.next();
        if (msg.get() < minDist) {
          minDist = msg.get();
        }
      }

      if (minDist < this.getValue().get()) {
        this.setValue(new IntWritable(minDist));
        for (Edge e : this.getOutEdges()) {
          sendMessage(e, new IntWritable(minDist + e.getCost()));
        }
      }
    }
  }

  private static void printUsage() {
    System.out.println("Usage: <startnode> <input> <output> [tasks]");
    System.exit(-1);
  }

  public static void main(String[] args) throws IOException,
      InterruptedException, ClassNotFoundException {
    if (args.length < 3)
      printUsage();

    // Graph job configuration
    HamaConfiguration conf = new HamaConfiguration();
    GraphJob ssspJob = new GraphJob(conf, SSSP.class);
    // Set the job name
    ssspJob.setJobName("Single Source Shortest Path");

    conf.set(START_VERTEX, args[0]);
    ssspJob.setInputPath(new Path(args[1]));
    ssspJob.setOutputPath(new Path(args[2]));

    if (args.length == 4) {
      ssspJob.setNumBspTask(Integer.parseInt(args[3]));
    }

    ssspJob.setVertexClass(ShortestPathVertex.class);
    ssspJob.setMaxIteration(Integer.MAX_VALUE);
    ssspJob.setInputFormat(SequenceFileInputFormat.class);
    ssspJob.setInputKeyClass(VertexWritable.class);
    ssspJob.setInputValueClass(VertexArrayWritable.class);

    ssspJob.setPartitioner(HashPartitioner.class);
    ssspJob.setOutputFormat(SequenceFileOutputFormat.class);
    ssspJob.setOutputKeyClass(Text.class);
    ssspJob.setOutputValueClass(IntWritable.class);

    long startTime = System.currentTimeMillis();
    if (ssspJob.waitForCompletion(true)) {
      System.out.println("Job Finished in "
          + (double) (System.currentTimeMillis() - startTime) / 1000.0
          + " seconds");
    }
  }
}

My guitar playing

Received new car as compensation

I was having (engine stalls) problem with ma car 2011 Z4 sDrive30i, finally received new one as compensation from the BMW. Yahoo!

The model of my new car is 2012 new Z4 sDrive 35i with twin turbo, 7-speed dual clutch transmission, alpine white body, red seats, full sound package ..., etc. options.

I dislike turbo engine (more precisely, turbo-lag) but this is quite good, except crackable alloy wheels (you must watch out for all potholes nervously) and rough engine braking (the car behind might kick your ass if you take off the gas pedal suddenly).

This car has an awesome performance, unbelievable quick response, new +/- paddle shifts, and back-fire sounds like beast... (If you listen very closely, you may hear the sound of Benz SLS amg). Overall, ★★★★☆.

Compared to (1st generation) older model Z4, this car got convenience and enough output but lost sparky and sprinter's explosive power. Shortly, this is a sporty-daily car. If you're looking for hardcore machine, you should have to buy Lotus or, Porsche.

Additionally, this car is very fit for people who live in a mountainous area. Because, the cornering is fantastic. In (my) S. Korea case, 70% of the area is mountainous. There are all the winding roads (Even in the winter, I drive this). If you live in big country, you'll be friends with spine specialists soon.

Cassandra 책 속에 내 이름

아 못난이같지만 자랑질 좀 하자.
카싼드라 책 속에 내이름이 있음을 오늘 처음 발견했다. ㅋ

기다려봐.
내가 글로벌 최대 출판사와 IT전문서적을 publish한 최초의 한국인이 될테니.

Stability equals death

Biologically, being alive means keeping instability. Our cells pump out sodium (Na) and take in potassium (K) until they die. Between start and end of all things, there's only instability.

Why don't we have to pursue more instable life?
.
.
.
.
.
(But, ... I seems not ready to give up my stable life yet.)

빅데이터는 노하우의 내재화가 핵심이다.

내가 지금껏 IT 업계에서 봐온 짜증-류의 작업은 크게 3개 정도 있다.

1) 첫째가 새벽에 출근해서 DB 만지는 것.

가령, 블로그 서비스에 (사소할지언정) 어떤 기능이 하나 추가되거나 기획자들이 리포트를 원할 때면 필연적으로 RDBMS 스키마를 변경하거나 묵직한 쿼리를 날려야 되는 문제가 따라온다. 그러면 그냥 새벽에 ‘임시점검’ 띄워놓고 DB 작업하는 거다. 데이터가 증가하거나 장애가 뜨면 또 어떤가. 바로 이런 짜증스런 문제에서 Schema-free, ad-hoc query processing, fault tolerant 요구가 나오고 NoSQL 기술이 진화하는 것이다.

2) 두 번째, 웹 서버에 웹 로그 파일 4GB 짜리가 수십 개씩 뚝뚝 떨어진다.

로그파일 떨어지는걸 감당못해 바로바로 압축하고 테이프에 떠서 지워가는 곳도 있을거다. 이 때, 어떤 장애가 발생하면 당근 과거 로그는 뒤져볼 수 가 없겠고, 로그레벨을 debug로 맞춰서 재현될 때까지 멍청하게 눈팅 하는거다. 그래서 거대한 분산 파일시스템, 로그 마이닝 같은 기술에 열광하는게 아닐까. 잡설 1, 미국 어느 주에서는 Facebook, Twitter 타임라인가지고 crime prediction 하기도 하고 (왠지 자살같은것도 미연에 방지할 수 있겠고) 그런다던데 ... 한국은 왠지 알바생들이 나꼼수 트위터 눈팅할 듯.

3) 세 번째, 의사결정권자는 언제나 근거자료를 원한다.

어떤 문제나 서비스/상품을 기획해서 에스컬레이션 올리면 의사결정권자는 근거를 원한다. 그 근거는 수치로 말하는 것이 확실하다. Shut up and use the math. 이런 통계를 내려고 MySQL 깔아서 데이터 입력해놓고 쿼리문으로 조지던 개발자들 많을거다.

뭐 여튼, 빅데이터 기술 진화는 사실 이렇게 필연적이었다고 말할 수 있겠다. 이게 뭐 꼭 오늘날 직면하게된 문제는 아니고 5년 전부터 그 증상들이 이곳저곳에서 나타나고 있었지. 양키들이 NoSQL만들때 우리는 무얼했나? 뭐든 빨리빨리 아웃풋 내놓으라고 쪼아대던 관리자가 주범이다. 그들은 x잡고 반성해야되며, 우리 개발자들의 무능함을 탓하지 말라.

.. 간혹 킬러앱으로 소셜 데이터를 분석해서 서비스 퀄러티를 높인다는 둥 .. 내 생각에 요건 한계가 있다.

검색엔진이 페이지랭크로 추천해주는 문서보다는 그냥 물어보고 직접 사람이 답하는 Q&A 서비스가, 암만 뉴스 개인화 추천시스템이 훌륭해봐야 트위터 친구들이 끊임없이 물어다주는 뉴스들이 더 감칠맛나는 것처럼, ... 빅데이터 분석을 응용하면 서비스의 퀄러티가 높아진다는건 좀 .. (물론 가능성은 있지만) 시만틱웹처럼 뭔가 애매~ 합니다잉.

여튼 뭘 하든지 간에, 결국은 빅데이터는 노하우의 내재화가 핵심인거다. 외부 솔루션 들여와서 자 이제 뭘 할건데?

Terminate AWS instances with Java SDK

    BasicAWSCredentials awsCredentials = new BasicAWSCredentials(
        AWSAccessKeyId, SecretAccessKey);

    AmazonEC2Client ec2Client = new AmazonEC2Client(awsCredentials);
    ec2Client.setEndpoint("ec2.us-east-1.amazonaws.com"); // Zone

    List<String> instancesToTerminate = new ArrayList<String>();
    
    DescribeInstancesResult result = ec2Client.describeInstances();
    List<Reservation> reservations = result.getReservations();
    for (Reservation reservation : reservations) {
      List<Instance> instances = reservation.getInstances();
      for (Instance instance : instances) {
        System.out.println("Terminating: " + instance.getInstanceId());
        instancesToTerminate.add(instance.getInstanceId());
      }
    }

    TerminateInstancesRequest term = new TerminateInstancesRequest();
    term.setInstanceIds(instancesToTerminate);
    ec2Client.terminateInstances(term);

Edward J. Yoon's Blog