Since I am still pursuing my degree in Computer Science, I am taking some classes at a local university. One of the classes I am taking this semester is Big Data. For this class, I have a project, where I have to evaluate MemcacheDB, which is a key-value based database.
This particular article is going to be split into 3 parts:
Part 1 - Introduction
First thing first, "What the hell is MemcacheDB and why do I need it?" Both are great questions and I'll attempt to answer them to the best of my abilities.
Let's start with "What"
MemcacheDB is first and foremost a database. Unlike MySQL, SQLite and others, MemcacheDB is a NoSQL system, meaning that it uses data structures like Tree, Hash, Key-Value Pairs (KVP), and others to store data. This approach is very useful when you need fast data access form your application.
MemcacheDB is a variant of Memcached database, which keeps all your data in a KVP in RAM thus providing a lightning fast access to your data. The main difference between MemcacheDB and MemcacheD is that MemcacheDB provides persistent storage on your disk. MemcacheDB is built on top of Berkeley DB. That means that you get the best of both worlds: Memcached protocol to access data and Berkeley DB features that include transac
I hope that was a somewhat satisfying answer to the "What" question. Now, let's take a brief look at the "Why".
To answer the "Why" question, let's revisit the infamous memory hierarchy pyramid:
Image curtesy of http://www.ict4u.net/
As you can see, the higher in the pyramid you are, the less space you have (or the more expensive it gets) but you get REALLY fast access times. Since RAM is only 2 levels away from the CPU, its performance is pretty darn good. So, if you can spare couple of gigs of RAM to serve that frequently accessed data, your application's performance should improve dramatically! Here are some benchmark results.
Part 2 - Installation
Before we dive into the installation process, here is my setup:
- Box: VirtualBox VM
- OS: Ubuntu 12.04.4 LTS Desktop Edition
- RAM: 4GB
- HDD Size: 20GB
Based on the instructions provided here, it was supposed to be pretty close to "copy/paste and you're done". Sadly, it proved to be slightly more complicated.
Right off the bat, the url for the Berkeley DB in the INSTALL file is no longer valid. The correct url is as follows: http://www.oracle.com/database/berkeley-db/db/index.html. While that is not a problem, it was slightly annoying.
After downloading the latest version of BerkeleyDB (6.0.30 as of this writing), I continued on with the rest of the installation instructions.
As I got to the section of installing MemcacheDB, I started getting some compile errors...
cd memcachedb-1.2.0 ./configure --enable-threads
I got the following output
... checking for libevent directory... (system) checking for library containing db_create... no configure: error: cannot find libdb.so in /usr/local/BerkeleyDB.4.7/lib
After some research, I found out there there is an option in the configure script to specify location of your BerkeleyDB installation. Happy that I found solution to the problem, I ran:
./configure --enable-threads --with-bdb=/usr/local/BerkeleyDB.6.0 make
gcc -DHAVE_CONFIG_H -I. -I/usr/local/BerkeleyDB.6.0/include -g -O2 -MT bdb.o -MD -MP -MF .deps/bdb.Tpo -c -o bdb.o bdb.c bdb.c: In function ‘bdb_env_init’: bdb.c:208:23: error: ‘DB_ENV’ has no member named ‘repmgr_set_local_site’ bdb.c:215:27: error: ‘DB_ENV’ has no member named ‘repmgr_add_remote_site’ make: *** [bdb.o] Error 1 make: Leaving directory `/root/memcachedb-1.2.0' make: *** [all-recursive] Error 1 make: Leaving directory `/root/memcachedb-1.2.0' make: *** [all] Error 2
As it turns out, when instructions say that MemcacheDB requires Berkeley DB 4.7 or later, they actually meant that it requires Berkeley DB 4.7 - 5.1.29. So after reinstalling Berkeley DB with latest supported version (5.1.29), everything worked!
Part 3 - Usage
Now that you've got your MemcacheDB installed, it is time to start using it!
First thing's first: we need to launch the instance. For this blog article, I am going to use the sample command (as per the INSTALL instructions).
memcachedb -p21201 -d -r -H /data1/21201 -N -v >/data1/21201.log 2>&1
So, what do all these parameters actually mean, you ask?
- -p21201 - Port number, where database is going to be listening for incoming requests (In this case: 21201)
- -d - Run in a daemon mode
- -r - Maximize core file limit
- -H /data1/21201 - Location of the database on your local computer
- -N - Enables DB_TXN_NOSYNC, which gains performance
- -v - Enables verbose mode
- > /data1/21201.log - Redirect all output form stdin and stderr to /data1/21201.log file
For more options, type in:
Now that we have our server up and running, let's start populating it with data!
There are 2 ways of doing so:
In order to access your database instance via telnet, you need to execute the following command:
telnet 127.0.0.1 21201 #or whatever IP or Hostname and port number of your db instance might be
Once you are connected, you can start issuing commands. Basic supported commands are as follows:
- get - Command for retrieving data. Takes one or more keys and returns all found items
- set - Most common command. Store this data, possibly overwriting any existing data. New items are at the top of the LRU
- add - Store this data, only if it does not already exist. New items are at the top of the LRU. If an item already exists and an add fails, it promotes the item to the front of the LRU anyway
- replace - Store this data, but only if the data already exists. Almost never used, and exists for protocol completeness (set, add, replace, etc)
- append - Add this data after the last byte in an existing item. This does not allow you to extend past the item limit. Useful for managing lists
- prepend - Same as append, but adding new data before existing data
- incr / decr - Increment and Decrement. If an item stored is the string representation of a 64bit integer, you may run incr or decr commands to modify that number. You may only incr by positive values, or decr by positive values. They does not accept negative values
- delete - Removes an item from the cache, if it exists
- stats - Server statistics
Here's a nice blog post that does a nice job explaining how to use telnet with MemcacheD: Examples of Memcached telnet commands
As per MemcacheD wiki page, you can see that there is a large variety of MemcacheD(B) clients ranging from Perl to Java to Python, etc. Lately, I've been spending a lot of time writing code in Python, so, I am going to demonstrate MemcacheDB access with Python. For this blog post, I am going to use pylibmc
To install this module, we have couple of approaches:
sudo apt-get install python-pylibmc or sudo pip install pylibmc
Once everything has been installed, you can start having fun with MemcacheDB and Python!
Below is a sample script that will demonstrate how to use basic functions (i.e. get, set, update, delete)
from pylibmc import Client hosts = ["localhost:21201"] mc = Client(hosts) # demonstrate setting and getting a value mc.set("key1", "value1") mc["key2"] = "value2" print "Value of key1 is: %s" % mc.get("key1") print "Value of key2 is: %s" % mc["key2"] # demonstrate update mc["key2"] = "updated value2" print "Value of key2 is: %s" % mc["key2"] # demonstrate key deletion mc.delete("key2") print "key2 does not exist" if mc.get("key2") is None else "key2 exists"
Running script above yields the following results:
Overall, this was a huge headache to get up and running. After overcoming the installation hurdle, I was getting a lot of issues trying to add a key to the database. Then, I was getting crashes due to segmentation faults. After doing a lot of pointless research on those errors, I've decided to start from scratch and it finally worked. Also, during my research of the problems, I saw a lot of people not being happy with Berkeley DB. Not quite sure why they weren't happy, but they were trying to move away from it.
Anyhow, I hope this was somewhat helpful.