Uncategorized | Kishore Gopalakrishna

Apache Helix – Year in review 2013

December 16, 2013 gkishore Leave a comment

My flight to Peru got overbooked, they offered us another non stop flight 6 hours later. Gladly took the offer. Since I had packed everything, I was jobless. Instead of spending time of facebook and twitter, I thought of writing about all the exciting stuff that happened around Apache Helix in 2013.

Becoming Open Source and Joining Apache Incubator

Helix was open sourced in October 2012 during SOCC. Along with the open source announcement, we entered Apache Incubator. Initially I thought it was a big mistake, primarily due to the Apache release process. It took us 3-4 months to make a release. The only code change I made was changing the package name from com.linkedin to org.apache. Thanks to our mentors (Olivier Lamy and Patrick Hunt) for helping us make the first release that passes all standard Apache checks. I was always cursing “How can the process of making a release be this difficult even after 100’s of projects going through the same path” and I bet every podling goes through the same phase. I really don’t know the reason why first release is difficult and takes a long time but here is a tip: “don’t fight it, just turn off your brain cells and do what ever it takes to make the first release”. Once everything is scripted, next releases are pretty easy. After I made the first release, we have made 3 more releases and each release was done by a different person and things went smoothly each time.

Entering Apache Incubation provides a lot of benefits, which is why many projects choose to do it. ApacheCon provides a very good venue to showcase the project and interact with other Apache project members.

The First Use Case

After we made the first release, it took a while for us to get the first use case outside of LinkedIn. There were few interesting discussions on the mailing list. It was a great learning, on how people think about distributed systems. I found that there is strong hesitancy to think in terms of state machine and transitions. To be honest, it is hard to define systems in terms of FSMs and constraints. It makes one spend a lot of time on the design board than coding while most of us want to get to coding ASAP. We decided to write some recipes to show case the common patterns that we all use: Master-Slave, Leader-Standby, Online-Offline. That allowed users to visualize how their system would look when modeled in terms of FSM and constraints. Following that we had our first production usage: a wall street finance company used the Master Slave pattern and was able to quickly put a distributed system together. We received great feedback:

In fact, the first day of PROD we benefited from Helix since we had to force kill our Master after some emergency change.
The change was made on the Slave, restarted the Slave, then killed the Master to failover to the new Slave.
Everything worked perfectly.

Here is a list of things that Helix is great at
1-It works
2-It’s very flexible
3-The documentation is very good
4-It’s open source, I need to read the code even with good documentation
5-You and your team is very professional and responsive

Things that I like to see
1-I need it to be even more flexible. In particular I need to not fail back to Master. If I restarted a failed Master, I need it to come back up as a Slave instead.
2-I need native DR support.

New Features

Pluggable Rebalancer

The first external use helped us learn a lot what users like to get from the system: an “ability to control the behavior of the framework”. The Helix controller was not flexible enough to plugin custom rebalancers. This motivated us to clean up some code and made the rebalancer pluggable.

http://helix.incubator.apache.org/site-releases/0.6.2-incubating-site/tutorial_user_def_rebalancer.html

Helix Agent

We also realized many need this pattern to support non-JVM based systems. We wrote a standalone Helix agent that can act as a proxy to any process. This was really simple and got us our first big usage outside of LinkedIn at Box. They used standalone Helix agent to manage the node.js processes. Unfortunately, we don’t yet have any documentation on how to use Helix Agent.

https://git-wip-us.apache.org/repos/asf?p=incubator-helix.git;a=tree;f=helix-agent

Python Participant and Spectator

Thanks to Kanak and Jason for writing a Python Helix agent. It allows one to build distributed systems in python. This is also a great way to build sharded, fault-tolerant MySQL and PostgreSQL systems. We are already seeing great interest in Python based Helix Agent.

https://pypi.python.org/pypi/pyhelix

Biggest Weakness

docs, docs, docs.

I can easily say that lot of people have shied away from Helix because of poor documentation, javadocs and apis. We have soo many hidden gems that its reached a point where even our team members within LinkedIn don’t know what Helix can do. Unfortunately, making Helix easy to use was never our priority. The only excuse we have “As all developers we are lazy to write documentation, its not challenging enough”. Also part of the reasoning was that if some one is building a distributed system using Helix, they have to look at the code any ways. In an effort to improve documentation, we did dedicate 0.7.0 release to make our api’s better and would love to get feedback on that.

Skuld

Fortunately or unfortunately I came across Aphyr (Kyle Kingsbury) who thrashes every distributed system out there (don’t get me wrong, he supports it with relevant data points). He tried using Helix for his Skuld project and he had all kinds of problems. I spent a weekend trying to remotely debug the issues and had a really hard time understanding what was going on. Finally, we found that it was because of missing transitions in the state model definition. Helix assumes that one defines the state model correctly and has no validation what so ever built in. After we fixed the state model to include missing transitions every thing worked. However, it gave me a dose of how difficult it is for others to use Helix. Since then we have added more logging and improved our documentation. Special thanks to Aphyr for writing a clj-helix library (a Clojure wrapper around Helix)

Skuid https://github.com/Factual/skuld
clj-helix (https://github.com/Factual/clj-helix) // you can see Helix + Zookeeper in the major drawbacks section :-). I hope its more to do with Zookeeper than Helix.

The Best News (Instagram Using Helix to Build IG Direct)

This came as a real surprise to me. Thanks to Aphyr for suggesting this to Rick Branson in spite of the tough time he had using Helix. The best part was they built the system using Helix without asking us a single question. And I got to know this after they launched the system in production. I never thought that some one would build a system using Helix without asking any questions!!!

Inside LinkedIn

Meanwhile, inside LinkedIn, Helix adoption continues to grow and manages various OLTP, OLAP, streaming and search systems.

Other use cases

Apart from Box and Instagram, Helix is used in other open source systems:

JBoss workbench: http://docs.jboss.org/jbpm/v6.0/userguide/wb.WorkbenchHighAvailability.html. Thanks to Alexandre Porcelli from RedHat for driving this.
BooKeeper: Built in metadata store (in progress)
Apache S4: Enhance cluster management in Apache S4 https://issues.apache.org/jira/browse/S4-110

Graduation to Apache TLP

Hopefully Helix will graduate to a Top Level Project in December’ 2013. If approved, I will be the “Vice President” (without any pay). This will allow us to make releases more often. It should also help get more adoption, for some reason people not familiar with Apache process think that if a project is in Incubation, it means its not ready for production use.

Next Play

2014 will be exciting and interesting. There are lot of frameworks being built on top of low level frameworks, and Helix is designed for just those use cases as it prevents systems builders from “reinventing the wheel” every time. Helix provides high-level APIs so that implementers can think in terms of their cluster instead of primitives in consensus protocols. Here are some areas that we are exploring:

Getting Helix working to make new systems distributed, like RocksDB
Managing the entire life cycle of the cluster with the help of systems like Apache YARN and Apache Mesos
Creating a way for Helix components to efficiently communicate among themselves
Integrating with systems like Riemann to allow Helix to support automated failure response through monitoring
Improving on high-level APIs introduced in the 0.7.0 alpha release

Happy Holidays!

Categories: Uncategorized Tags: Apache Helix, cluster management, distributed systems