CI at Oribi

he

At Oribi, like many other fast paced startups, we strive to get our code to the end user as fast as possible. This means that once a developer pushes new changes, these could be running in production within minutes.

This post is a high level overview on our continuous integration environment, some of design decisions and tradeoffs we considered when building it, and a shallow introduction to the tooling and technologies we utilize.

TL;DR – Standardize your deployment, automate as much as possible, limit your technological stack and pick infrastructure tools and technologies that are powerful yet simple to learn and use.

Intro

When we first started designing our CI process, we had several goals in mind:

  • Getting code to production should be simple.
  • Automate as much as possible.
  • Be fearless (but not careless).

The first and second bullets basically mean that, for a given service or orchestration, a developer needs only to push a button for the code to be deployed. The latter describes a state of mind more than a physical implementation, and it is one of the guiding principles set by Avi (our VP R&D) – push code to production as soon as you consider its user facing feature complete. We would rather see code break in production than have it accumulating digital dust in a forgotten VCS branch or a pull request. As such, our systems see dozens of updates across multiple services in a given day. An astute reader may comment that having systems break production is potentially worse than waiting on code to pass certain reviews – and she would make a good point. That’s what rollbacks are for, but I’ll get to that later.

Technology

In order to be able to run a successful CI process for a small team such as ours (8 engineers at the moment), we had to keep things simple and most importantly – uniform. As is common nowadays, we work in a micro-services oriented infrastructure and therefore attempt to break down components to isolated units of work and responsibility.

Our services are written using Spring Boot for the backend, and a combination of React + MobX in the frontend. We run over AWS computing and DB infrastructure, and all our services are shipped as Docker images. Fairly simple, and fairly standard. While it’s always exciting to try new tools and technologies, there are advantages to keeping our stack basic:

  • Everyone can contribute everywhere, our code structure is standardized and is therefore familiar whether you’ve seen it before or not.
  • Build & deploy process is always the same, regardless of the service.
  • Setting up a new service is straightforward and we don’t have to patch things and handle edge cases to ship it.
  • Ops problems become familiar and odds are someone on your team has seen them before.

For the continuous integration part, we use Jenkins CI server – it’s free, powerful, has great community and is very simple to install and maintain. Our automation system is written using Ansible; we use it for anything from configuring servers, deploying code or simply running commands and maintenance on remote hosts. Ansible is relatively straightforward to learn and offers a very extensive tool belt in the form of internal and external modules (AWS and Docker modules, for example).

Process

When we create a new service, we set up a few things:

A GitHub Repo – for the most part this is a standard Java project. It also contains a dead simple Dockerfile: copy a prebuilt JAR and run it.

A Jenkins build job – triggered by a web hook when code is pushed to GitHub, this jobs builds the project’s JAR, builds the Docker image and pushes it to our Docker registry. It’s worth nothing that while we maintain our own registry, there are many cloud based registries like Amazon ECS, docker.com or Google’s Container Engine.

An Ansible deploy configuration – small JSON file containing metadata for the service’s deploy: the EC2 tags for the servers that will run the code, the project’s Docker settings (ports to expose or volumes to mount), any environment variables that need to be set, etc. A typical configuration file might read like this:

example deploy configuration

A typical service is usually deployed with a variation of this configuration

Setup EC2 instances to run the code – Next step is to launch one or more machines to run the application. We have a generic machine instance (AMI) that’s already configured with everything we need – basically docker and misc monitoring and convenience utils – and we run an Ansible playbook that sets up all the relevant tags and environment properties these instances need to thrive in our deployment ecosystem. This step can and will be further automated… Later.

A Jenkins deploy job – We set up a deploy job for each service. This job simply runs an Ansible playbook that does the following for all servers in a given deploy group:

  1. Pull the service’s docker image from our docker registry onto the target machine. It will pull the latest docker image, unless a specific revision is specified1.
  2. Remove the current server instance from the service’s load balancer (if applicable).
  3. Boot the service using the configuration parameters from earlier.
  4. Run a basic health check – in our case we check that cURL returns 200 status for a specific endpoint.
  5. Register the instance back with the load balancer.

Our deploy playbook runs on all the hosts specified in the deploy configuration (the “hosts” key). Further, we use Ansible’s serial property to trigger a rolling deployment – meaning that the deployment will attempt a single host and, if successful, proceed to other instances. This way if the deployment breaks, you still have instances running running the previous revision and your system remains alive.

It’s worth noting that we still consider some of these steps a bit tedious and plan to automate the entire service creation/deployment/configuration procedure, but that’s still on our TODO list. Early phase startups constantly have to weigh engineering tasks’ priorities against those of the actual product, and considering our limited developer resources we deemed this feature somewhat less urgent.

Monitoring and Notifications

Even though the build/deploy pipeline is mostly automated, it’s still important to keep tabs on how your stack is performing. We measure a plethora of metrics at nearly every point in our code (these are collected by Graphite) and then define graphs and alerts over this data. A notification is sent to the appropriate Slack channel when problems arise – our team is usually quick to respond and issues are resolved within minutes. We also use Slack to receive notification on any code push, build or deploy job launched. This way everyone is up-to-speed on everything that going on, including non-developers; our marketing, UI and product teammates receive real-time Slack notifications on any user facing changes made to the app.

Final Words

We don’t consider this system to be perfect – far from it. It still involves some manual work and occasional tuning, but it was fairly simple to setup and is easy to maintain. All things considered, it took about two engineer’s weeks to design and build.

If you’re just getting started with CI, consider building something basic that solves most of your needs, and improving from there. This is especially true for small teams with limited resources – we want to spend most of our time building the actual product and strive to reduce the overhead around our process. For us, our CI system provided a lot of value very quickly, and at a fairly low cost.

Tips (in no particular order):

  • Pick an IT automation tool and write a playbook/recipe for any task you or someone else might have to do more than once. Ideally, you shouldn’t have to SSH into remote machines (we still do, sometimes, but a man can dream).
  • Tag your docker images with something sensible like a git revision, so you’ll be able to easily refer to them and understand their content. And, should things break, rollback is simply a matter of figuring out the latest stable build.
  • Tag your services with that same revision so you’re able to understand which version of your application is currently running. Each of our services has an endpoint that returns the git revision of the currently deployed code.
  • Use external service providers where you can, depending on your financial capabilities. We use AWS and Github, which saves a lot of IT overhead.
  • Use rolling deployments coupled with service redundancy2 – it has saved us on more than one occasion when a certain deploy has gone belly up.
  • Ensure your tech stack is uniform and lean – this will unify and simplify your build/deploy process, and save you the trouble of dealing with edge cases and patching.
  • Keep your team up to speed on the internals of your CI process; it will become useful when things break and you’re out of the office ;).

1 We tag our docker images using the git revision for the commit that triggered the build. This is especially useful when you need to roll back changes to a specific version, or debug a build locally.

Maintain at least 2 instances for each service, and deploy your code to each separately – such that if one fails during deployment, the rest are still alive and active.