olexiy prokhorenko’s blog

olexiy prokhorenko’s blog

Olexiy Prokhorenko  //  I wear many hats and like different things. Dream and evolve, explore what else I can do. I want to build something nice.

More about me you can find on my profile page. You can follow me on Twitter @alexeypro or connect with me on LinkedIn. Do you know me personally? Help me improve, rate my skills on PlusRated.

My interests: Mobile and Web, Technology, Entrepreneurship, Startups, Business, User Experience and Human Interfaces, Lean and Agile Methodologies, Self-improvement.

Dec 19 / 10:00pm

Fun with #ruby and #hbase for beginners.

This blog post is mostly educational for myself, as I am just trying to put my hands on HBase, so these are kind of "first steps" for anybody who never tried it before (like myself). My plan is to build simple application using HBase for storage (certainly, no expectations to make somehow "production ready", but it will be using HBase, so it will be "scalable" :-) and describe all the steps right here.

Let's begin with simplest. I am using Mac OS X, so you either follow the same steps or adjust them for your OS. First - our lovely process of installing new toys. I assume that we all have Java already, so not covering that. Installing HBase -- download it from http://hadoop.apache.org/hbase/ and then this is what I do:

o-macair:~ olexiy$ cd /usr/local/
o-macair:local olexiy$ tar -zxvf ~/Desktop/hbase-0.20.2.tar.gz
...
o-macair:local olexiy$ ln -sf hbase-0.20.2 hbase
o-macair:local olexiy$ ls -al | grep hba
lrwxr-xr-x   1 olexiy  wheel   12 Dec  9 21:23 hbase -> hbase-0.20.2
drwxr-xr-x@ 16 olexiy  wheel  544 Nov 10 10:24 hbase-0.20.2

It's that simple and actually the second step is my own preference, just to create symlink, for my own convenience. Now, let's verify that environment variables are set properly:

o-macair:~ olexiy$ echo $JAVA_HOME
/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
o-macair:~ olexiy$ echo $HBASE_HOME
/usr/local/hbase
o-macair:~ olexiy$ echo $PATH
/opt/local/bin:/opt/local/sbin:/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin:/usr/local/hbase/bin:/usr/local/mysql/bin:/usr/local/maven/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin

So far so good. But not enough - to do everything fast we want to cut corners -- we need to define JAVA_HOME in HBase's conf:

o-macair:~ olexiy$ cat /usr/local/hbase/conf/hbase-env.sh  - grep JAVA_HOME
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

And also let's check "System Preferences", then "Sharing" and see if "Remote Login" is enabled. It should be, because it enables SSH which is required before we can launch HBase.

Now we do have everything we want and need, so let's move slightly forward. Now, we have a dilemma. What are we going to use for our application? Will it be Java or Ruby? I decided to stick to Ruby, as my test code will very simple, and not sure if I want to do (but have to agree - more natural choice) Java here.

But the decision of which framework to use for Ruby is tough. I though to use Ruby's HBase gem which is linked from HBase's website. You can check it out here http://github.com/sishen/hbase-ruby. But, I knew that there is also other gem created by guys from Rapleaf. Here is the original blog post I found http://blog.rapleaf.com/dev/?p=16 and it gives a simple example of how gem could be used, but you'll also see the "kind of" same example down below. At the end I found myself using third option http://github.com/greglu/hbase-ruby/ which is an "official" gem modified to be used with HBase's Stargate REST server.

o-macair:hbase-with-ruby-first-try olexiy$ sudo gem install hbase-ruby -s http://gemcutter.org
Building native extensions.  This could take a while...
Successfully installed json-1.2.0
Successfully installed hbase-ruby-1.1.3
2 gems installed
Installing ri documentation for json-1.2.0...
Installing ri documentation for hbase-ruby-1.1.3...
Installing RDoc documentation for json-1.2.0...
Installing RDoc documentation for hbase-ruby-1.1.3...

Fast and simple. We seems like having just everything. Oh, sorry - I didn't mention that I do have Ruby pre-installed with my Mac OS X:

o-macair:~ olexiy$ ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [universal-darwin10.0]

But I guess it's not that important especially after we already installed HBase gem :-)

So, what kind of application are we going to build, what do you think? Something scalable, big and certainly very new? May be next time. I decided to build something better. Like forum! :-) So, our requirements to the forum are the following:
- It definitely needs to have user. Each user has name and email.
- It should have topics for discussion. Each topic has name and actual description.
- And we need comments. Each comment is posted by user for some topic and has the body of the comment.

Disclaimer: please, I do understand that forum is simple, stupid and easy example. That's why I chose it. And that's why I do not care if you think RDBMS works better for this kind of application. It's an example.

Fairly simple, don't you think? So, rolling our sleeves and here we go. We should start with reading documentation, though. Very helpful - http://wiki.apache.org/hadoop/Hbase/DataModel, http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture and short but nice reading http://jimbojw.com/wiki/index.php?title=Understanding_HBase_and_BigTable. There is also basic BigTable document from Google http://labs.google.com/papers/bigtable.html but it's well too dry and basic, so not sure I got from it anything useful. ;-) But I see necessary to suggest you going through which was a very helpful presentation for me.

Before we even jump into coding we need to build storage structure. We have our requirements to the project listed above, so let's think (in terms of HBase) what we need to have. As my background was mostly with RDBMS I am almost hundred percent sure I didn't build the correct architecture for HBase, but what a heck, you learn from mistakes. ;-)

Launch HBase:

o-macair:~ olexiy$ start-hbase.sh
Password:
localhost: starting zookeeper, logging to /usr/local/hbase/bin/../logs/hbase-olexiy-zookeeper-o-macair.out
starting master, logging to /usr/local/hbase/bin/../logs/hbase-olexiy-master-o-macair.out
Password:
localhost: starting regionserver, logging to /usr/local/hbase/bin/../logs/hbase-olexiy-regionserver-o-macair.out

Nice. My tiny Macbook Air feels so powerful now, when it's running HBase. Just kidding :-) To create tables and stuff we want to use HBase Shell http://wiki.apache.org/hadoop/Hbase/Shell - tiny tool, but it helps to get things going faster.

Let's begin doing that one by one. So, we want table "userstable".

o-macair:~ olexiy$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Version: 0.20.2, r834515, Tue Nov 10 10:07:05 PST 2009
hbase(main):001:0> create 'userstable', {NAME => 'maininfo'}, {NAME => 'additionalinfo'}
...
0 row(s) in 6.2910 seconds
hbase(main):002:0> scan 'userstable'
09/12/13 14:38:56 DEBUG client.HConnectionManager$TableServers: Cached location address: 192.168.1.64:64336, regioninfo: REGION => {NAME => 'userstable,,1260743863388', STARTKEY => '', ENDKEY => '', ENCODED => 619697624, TABLE => {{NAME => 'userstable', FAMILIES => [{NAME => 'additionalinfo', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'maininfo', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
09/12/13 14:38:56 DEBUG client.HTable$ClientScanner: Creating scanner over userstable starting at key ''
09/12/13 14:38:56 DEBUG client.HTable$ClientScanner: Advancing internal scanner to startKey at ''
09/12/13 14:38:56 DEBUG client.HConnectionManager$TableServers: Cache hit for row <> in tableName userstable: location server 192.168.1.64:64336, location region name userstable,,1260743863388
ROW                          COLUMN+CELL                                                                     
09/12/13 14:38:56 DEBUG client.HTable$ClientScanner: Finished with scanning at REGION => {NAME => 'userstable,,1260743863388', STARTKEY => '', ENDKEY => '', ENCODED => 619697624, TABLE => {{NAME => 'userstable', FAMILIES => [{NAME => 'additionalinfo', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'maininfo', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
0 row(s) in 0.0170 seconds
hbase(main):003:0>

That's quite a thing we did. Let's cover everything line by line. So our first command was:

create 'userstable', { NAME => 'maininfo' }, { NAME => 'additionalinfo' }

In HBase Shell this means -- create table with name 'userstable' which has two column families 'maininfo' and 'additionalinfo'. Those two families - if we put it in simple words - is like predefined list of "columns structure". I.e. in our case we intend to keep "full name" and "password" in 'maininfo', and "city" and "state" in 'additionalinfo'. So, it's for our organizing of things.

Next useful line is:

scan 'userstable'

Which gives us everything about and from this table. We did it to be sure that our tables is really here, and certainly it yet has nothing inside. Let's fix that, and fill our table with some data:

hbase(main):004:0> put 'userstable', '20091213093540', 'maininfo:fullname', 'Olexiy Prokhorenko'
hbase(main):005:0> put 'userstable', '20091213093540', 'maininfo:email', 'olexiy@prokhorenko.us'    
hbase(main):006:0> put 'userstable', '20091213093540', 'maininfo:password', 'zup3rpazzv0rd'    
hbase(main):007:0> get 'userstable', '20091213093540'
...
COLUMN                       CELL                                                                            
 maininfo:email              timestamp=1260756864019, value=olexiy@prokhorenko.us                            
 maininfo:fullname           timestamp=1260756857570, value=Olexiy Prokhorenko                               
 maininfo:password           timestamp=1260756870079, value=zup3rpazzv0rd                                    
3 row(s) in 0.6730 seconds

Again, step by step explanation. First important line:

put 'userstable', '20091213093540', 'maininfo:fullname', 'Olexiy Prokhorenko'

So we save record in table 'userstable', for row with key '20091213093540' (in a second more details why it is so ugly), to 'maininfo' section into column 'fullname' my full name, which is Olexiy Prokhorenko. ID (or row key) for every user (and actually topic, etc. -- we will keep same idea everywhere) is some kind of timestamp -- YYYYMMYYhhmmss - simple enough to keep, and will work pretty fine unless we get users created more often than every second.

Second line and third line:

put 'userstable', '20091213093540', 'maininfo:email', 'olexiy@prokhorenko.us'    
put 'userstable', '20091213093540', 'maininfo:password', 'zup3rpazzv0rd'

Is the same, except that we save the email and password.
Third line actually gives us more warm feeling of filling table:

hbase(main):008:0> get 'userstable', '20091213093540'

Our table knows about our user 'olexiy@prokhorenko.us'!!! Woo-hoo! How cool is that?

Now, as we are getting more comfortable, let's create table for topics and fill it with couple records (please note, that output of commands is eliminated for convenience) - and note -- that we also creating one more user (with very bad password :-), to make our life more interesting!

hbase(main):009:0> put 'userstable', '20091213120030', 'maininfo:fullname', 'John Axe'
hbase(main):010:0> put 'userstable', '20091213120030', 'maininfo:email', 'westla7@gmail.com'
hbase(main):011:0> put 'userstable', '20091213120030', 'maininfo:password', 'john'

Okay, another user created.

hbase(main):012:0> create 'topicstable', { NAME => 'content' } 
hbase(main):013:0> put 'topicstable', '20091213161745', 'content:name', 'Blackberry Bold 9700'
hbase(main):014:0> put 'topicstable', '20091213161745', 'content:description', 'Discussion about Blackberry Bold 9700 and probably all other Blackberry phones.'

Now, we created 'topicstable', where we keep the content. ID for row key is the same idea as above.

hbase(main):015:0> create 'commentstable', { NAME => 'content' }, { NAME => 'postinginfo' }
hbase(main):016:0> put 'commentstable', '20091213161745-20091213120030-20091213172000', 'content:body', 'Hey, my first comment!'
hbase(main):017:0> put 'commentstable', '20091213161745-20091213120030-20091213172000', 'postinginfo:author', '20091213120030'
hbase(main):018:0> put 'commentstable', '20091213161745-20091213120030-20091213172000', 'postinginfo:topic', '20091213161745'
hbase(main):019:0> put 'commentstable', '20091213161745-20091213120030-20091213172000', 'postinginfo:replyto', ''

Almost same here. We create 'commentstable', which keeps the comments. It knows the author and the topic it was posted it, as well it keeps a track if this comment is reply to some other reply with 'postinginfo:replyto'. Please note, I decided to user composite row key for comments. First part of it (before - sign) is topic ID, and the second part is user's ID, and then the actual comment ID. In this case we pretty much sure until some topic is getting comments by same user more often than once a second - we are fine.

hbase(main):020:0> put 'commentstable', '20091213161745-20091213093540-20091213221000', 'content:body', 'I can comment on your comment!'
hbase(main):021:0> put 'commentstable', '20091213161745-20091213093540-20091213221000', 'postinginfo:author', '20091213093540'
hbase(main):022:0> put 'commentstable', '20091213161745-20091213093540-20091213221000', 'postinginfo:topic', '20091213161745'
hbase(main):023:0> put 'commentstable', '20091213161745-20091213093540-20091213221000', 'postinginfo:replyto', '20091213161745-20091213120030-20091213172000'
hbase(main):024:0> put 'commentstable', '20091213161745-20091213093540-20091213221250', 'content:body', 'And I can leave my own comment on topic...'
hbase(main):025:0> put 'commentstable', '20091213161745-20091213093540-20091213221250', 'postinginfo:author', '20091213093540'
hbase(main):026:0> put 'commentstable', '20091213161745-20091213093540-20091213221250', 'postinginfo:topic', '20091213161745'
hbase(main):027:0> put 'commentstable', '20091213161745-20091213093540-20091213221250', 'postinginfo:replyto', ''

So we dropped in two more comments by our first user. Everything looks kind of fun, we have everything we need, but let's think how we fulfill those two requirements:
1. Allow to see all comments by user?
2. Allow to see all comments on topic?

First thing comes to my mind is using normal 'scan'. Here we go:

hbase(main):028:0> scan 'commentstable', { STARTROW => '20091213161745-', STOPROW => '20091213161745-99999999999999-99999999999999' }

Will give us all comments on topic.

hbase(main):029:0> scan 'commentstable', { STARTROW => '20091213161745-20091213120030-', STOPROW => '20091213161745-20091213120030-99999999999999' }

This helps us to see all comments by user for specific topic. As you can tell -20091213120030- is a part where we mention ID of user with email 'westla7@gmail.com'.
We still have left one unresolved problem. How do we show all comments by user? We may create the same scan, but do the first part from 00000000000000-... to 99999999999999-..., and then filter out all entries with our user ID, but that does not sound too nice. I do not really like this solution, frankly speaking. It's neither scalable nor pretty.

So, don't you think that we can create referral table? Say something like the following:

hbase(main):030:0> create 'usersreferencestable', { NAME => 'comment' }

And imagine that when we created comments we also for every comment added:

hbase(main):031:0> put 'usersreferencestable', '20091213120030', 'comment:20091213161745-20091213120030-20091213172000', '20091213161745'
hbase(main):032:0> put 'usersreferencestable', '20091213093540', 'comment:20091213161745-20091213093540-20091213221250', '20091213161745'
hbase(main):033:0> put 'usersreferencestable', '20091213093540', 'comment:20091213161745-20091213093540-20091213221000', '20091213161745'

Yes, that's totally a denormalization of data, keeping it same, but let's see. Now to get all comments for user with email 'olexiy@prokhorenko.us' which ID is 20091213093540 we do simple:

hbase(main):034:0> get 'usersreferencestable', '20091213093540'
09/12/13 23:55:34 DEBUG client.HConnectionManager$TableServers: Cache hit for row <> in tableName usersreferencestable: location server 192.168.1.64:64336, location region name usersreferencestable,,1260777299770
COLUMN                       CELL                                                                            
 comment:20091213161745-2009 timestamp=1260777318126, value=20091213161745                                   
 1213093540-20091213221000                                                                                   
 comment:20091213161745-2009 timestamp=1260777312319, value=20091213161745                                   
 1213093540-20091213221250                                                                                   
2 row(s) in 0.0120 seconds

And get as result ID of comment as part of the column name, and actually the value will be the topic ID, which we may want to use for easier access. Totally makes us happy.

At this point we have our database created, filled with data, so let's try and create some Ruby application to do what we need - list topics, list comments to this topic, and list comments by user (you can use http://github.com/westla7/hbase-with-ruby-first-try/blob/master/hbase-shell-execute.txt to run all the commands mentioned above).

By the way, or Ruby application will be accessing the HBase with the help of REST API, which server supports. However, you need to remember to start REST server first -- so do:

o-macair:~ olexiy$ hbase rest start
Starting restServer
2009-12-17 21:20:46.886::INFO:  Logging to STDERR via org.mortbay.log.StdErrLog
2009-12-17 21:20:46.973::INFO:  jetty-6.1.14
2009-12-17 21:20:47.866::INFO:  Started SocketConnector@0.0.0.0:60050
...

But as I am using http://github.com/greglu/hbase-ruby/ gem which is for Stargate server, we need to launch that one -- and it'll take a little bit more of the efforts:

o-macair:~ olexiy$ cd /usr/local/hbase
o-macair:hbase olexiy$ cp contrib/stargate/hbase-*-stargate.jar lib/
o-macair:hbase olexiy$ cp contrib/stargate/lib/* lib/

And now we launching the server:

o-macair:~ olexiy$ hbase org.apache.hadoop.hbase.stargate.Main
2009-12-19 19:31:47.827::INFO:  Logging to STDERR via org.mortbay.log.StdErrLog
2009-12-19 19:31:47.886::INFO:  jetty-6.1.14
2009-12-19 19:31:48.183::INFO:  Started SocketConnector@0.0.0.0:8080
...

And keep it open or throw to background, whatever you prefer.

Now, point your browsers to http://github.com/westla7/hbase-with-ruby-first-try and check out the code.

We begin with http://github.com/westla7/hbase-with-ruby-first-try/blob/master/load_user.rb which is a script to load user, and say "Hello" to it. Showing that we can very easily access user's data and manipulate it.

But loading is not enough, so next will go http://github.com/westla7/hbase-with-ruby-first-try/blob/master/load_save_user.rb, which now can create user object and save it into HBase. Which gives us much more power.
Once you run it you'll get user created with the ID shown in the output.
By the way, if you don't like it, feel free to:

hbase(main):035:0> deleteall 'userstable', '20091219103439'

(where '20091219103439' is your created user ID, which you saw in the output)

But we are not going to stop on this. What we want now, is to get list of all users and topics we have. Open http://github.com/westla7/hbase-with-ruby-first-try/blob/master/list_data.rb and you can get an understanding how we can get list of topics we have.

Last, but not least - we are moving ahead to the simplest use case possible for our application. We want to list all the topics, and then list all of comments to every topic, and also user name, who posted each comment. Sounds big 'n crazy? Not really. Run http://github.com/westla7/hbase-with-ruby-first-try/blob/master/complete_list.rb and enjoy:

o-macair:hbase-with-ruby-first-try olexiy$ ruby complete_list.rb
"Blackberry Bold 9700": Discussion about Blackberry Bold 9700 and probably all other Blackberry phones.
   Olexiy Prokhorenko: I can comment on your comment! (reply to: Hey, my first comment!)
   Olexiy Prokhorenko: And I can leave my own comment on topic...
   John Axe: Hey, my first comment!

And - finally, for the sake of 'usersreferencestable', which I created and didn't use -- here is an example how it's possible to show comments by users. As for me - it looks better, but a little bit more wordy - http://github.com/westla7/hbase-with-ruby-first-try/blob/master/other_complete_list.rb

o-macair:hbase-with-ruby-first-try olexiy$ ruby other_complete_list.rb
User "Olexiy Prokhorenko" says:
   I can comment on your comment!
   And I can leave my own comment on topic...

User "John Axe" says:
   Hey, my first comment!

User "Test User" says:
   (nothing, 0 comments)

On this note I should say that I hope that gave you some overview of what you can do with HBase, and may be in general, what "key-value NOSQL" storages can bring to you. There is some headache, and obviously redundancy of data, but it helps to accomplish numerous other things. Before that it was all upon the SQL, now it's all about your code (again :-)

Filed under  //  development   hadoop   hbase   ruby  

Comments (0)