How CodeGuard gives back
We’ve talked in the past about how we’ve leveraged Amazon’s compute services to build a scalable and efficient architecture to serve our customers. The biggest downside to this configuration from an infrastructure standpoint was that each server had it’s own public address. That may not sound like a big deal, but when we’re starting and stopping hundreds of servers per day, it makes it difficult for us to tell customers with certainty what IP addresses would be used to perform their website backups.
Many hosting providers recommend or even require that servers initiating incoming SFTP, FTP, MySQL or SSH connections be added to a whitelist or firewall rule. This meant that there were some customers that we could not serve and some that, understandably, did not want to allow all traffic from Amazon’s public cloud. Today that all changes!
(Actually, we’ve been slowly transitioning and load-testing the new infrastructure components over the last several weeks, but today we’re ready to share it with everyone!)
Here is what you need to know if you would like to add our IP addresses to your firewall or otherwise whitelist connections from our service. All of our outbound connections are now originating from these IP addresses:
We will reduce this list over the next few months, but we want to provide a transition period for those customers that may need to update their existing MySQL whitelist or firewall configurations. The most up-to-date list of IP addresses will be maintained on our Support Center if you need to reference them in the future. That’s all you need to know, but if you’re interested in the technical details, read on!
It’s no secret that we use Amazon Web Services to power CodeGuard. We’re big fans of their products and continue to be impressed with the pace of development that they maintain. What we set out to do is move our backup servers from the public EC2 cloud into a Virtual Private Cloud (VPC). This would allow us to concentrate our backup resources behind a single gateway server that provides Network Address Translation (NAT). This NAT instance can be assigned a public IP address, so all outgoing traffic from our backup servers to our customers would originate from a single IP address. This approach would give us the desired external IP control while still allowing us to spread our workload across many servers.
Unfortunately, executing this approach was not as straightforward as we had hoped, primarily due to our use of Amazon’s S3 service for backup storage. There are no special accommodations made within the VPC to access S3, which means that all S3 traffic would also go through our NAT. We move several terabytes of data in and out of Amazon’s S3 service every day and our testing showed that with the S3 traffic going through our NAT, we would have a maximum server-to-NAT ratio of about 5:1. That’s 5 servers processing backups for each NAT. More than that and we saturate the NAT’s network connection and start dropping traffic. With our current scale, that would mean at least 30 NAT servers at peak and, consequently, 30+ IP addresses for our customers to manage. Needless to say, we didn’t like that solution – it was expensive, inefficient and very complex to automatically scale.
Fortunately, in late November, Amazon’s relentless development machine quietly rolled out a small bit of functionality that allowed us to change directions with this project. What they provided was a single API that had an up-to-date list of IP ranges for Amazon’s services. From this, we could determine which IP ranges contain our S3 endpoints and subsequently our backup servers could be configured to route traffic bound for S3 directly to it rather than through the NAT. With this arrangement we were able to achieve a server-to-NAT ratio of more than 200:1. That gives us a comfortable margin for our current workload and plenty of room to continue growing.
This is functionality that’s been on our list for a long time now and we’re happy to cross it off. For those of you that have been waiting, we appreciate your patience and hope that this provides some added insight to our approach.
The ability to schedule the time that a particular backup runs is something that we’ve been happy to help customers with through our support team, but until today we had not exposed this functionality in our dashboard. Why the long beta period?
The short answer is: scale. This is a class of problems that is easy to solve when there are not many operations competing for resources, but becomes much more difficult as the number of operations increase. Imagine a simple case where a single backup has to run one time per day and there is a dedicated server for this task. In this contrived scenario, the backup could be scheduled to run at any time, and since the server it’s running on is idle all day, you can almost guarantee that the backup operation will start at the scheduled time and complete successfully. In a more real-world scenario, imagine a server that has a backup scheduled to run at a particular time, but this server is also hosting an active website and a database. The backup could be scheduled to run at any time, but the certainty that the operation will start as scheduled and complete successfully diminishes.
In our case, we’re running anywhere between 10 and 150 servers to service the more than 200,000 backups we perform on a daily basis. So, how do you solve the challenge of scheduling while maintaining CodeGuard-levels of reliability? We make sure that we have the spare capacity available to run the backups at the times they are scheduled using an infrastructure management service that we’ve developed internally. This service, which we affectionately call Steward, watches after all of our servers, the backup operations running on them and the queue of pending backup and restore operations. When the need arises to add capacity to handle the upcoming load, Steward will start more servers to accommodate. Similarly, as backups finish and the servers become idle, Steward shuts them down. This arrangement allows us to have just-in-time resources available for all of our scheduled, on-demand and unscheduled backup and restore operations. Steward has also helped with server configuration management, versioning, deployment, fault-tolerance and cost reduction, but those are topics for another post!
In addition to the back-end, infrastructure changes, we have also updated our dashboard to reflect the new scheduling ability. Not only do we have an option for it on the website backup settings page, but we’ve updated all of the times and reporting functionality to accurately reflect the backup times in the customer’s local time zone. If you’ve ever tried to schedule a meeting with a colleague or customer in a different time zone, you know that this is not as easy as it sounds! We wanted our interface to be very clear about the scheduling time to ensure that there is no confusion for customers, regardless of what part of the planet they happen to be on relative to our servers.
A quick note about database backups – currently, database backups for a particular website will run at the same time as the website backup. For backup consistency, especially for database-backed applications like WordPress, Joomla! and Drupal it’s important that the backups of the database and file content are taken in close proximity to each other. We have worked very hard to ensure that our website file backup and database backup processes impose minimal load on your server and, therefore, running applications. If you are concerned about load or are using legacy MySQL storage engines, you can always schedule your website and database backups to occur at an off-peak time.
Ready to give backup scheduling a try? Check out the article in our support center for detailed instructions.
At CodeGuard, we’ve been able to push a high quality product to market quickly and scale it reliably. Today, we would like to share what happens on a daily basis to make this type of rapid production possible.
Everyone in the office has different schedules and different priorities. While each team member’s daily routine is different and changes from day to day, here’s what a typical day at CodeGuard for one of our employees might look like.
8:00 Wake up, eat, and get ready for work
9:00 Arrive at work, respond to emails, and plan for the day
9:30 Begin testing new selective restore features
10:00 Gather with the rest of the team for our daily scrum meeting
10:15 Modify the selective restore pipeline to address an unhandled case
10:45 Write an article for our weekly blog post
12:00 Review and implement suggestions made by another team member
12:30 Eat lunch
1:00 More testing for the new selective restore features
1:30 Monthly meeting with supervisor
2:00 Final round of testing for the new selective restore features
3:00 Deploy new selective restore features
3:15 Final testing for the new internal performance metrics dashboard
4:00 Deploy new internal dashboard
4:15 Celebrate two successful deploys by ringing our ceremonial CodeGuard gong
4:30 Leave early and enjoy the holiday weekend
Now that you’ve seen what a typical day at CodeGuard looks like, let’s talk about our development process.
Our day begins with our daily morning scrum meeting. For those unfamiliar with Agile development practices a scrum meeting is a way for members of a team to share updates regarding ongoing projects. In our scrum meetings we form a circle and have each member share a high level overview of the tasks they completed yesterday, the tasks they plan to complete today and the tasks that need help with in order to move forward. These meetings generally last 10-15 minutes. The rest of the day involves completing the tasks covered during our morning scrum. Typical tasks include planning for new projects, developing current projects, or testing completed projects.
Planning begins at our conference table with each project participant expressing their own thoughts and concerns about project execution. Because our team is small, there is no designated project leader. Instead, one team member is asked to research the project before hand and help guide the discussion. Team members are encouraged to contribute, and sitting idly is not an option. One of the great things about working for a small company is that we have the freedom to try new things. For example, although our main application is written in Ruby, a few of our most recent projects have been written in Go. This level of flexibility would be hard to find at a larger company.
Once we have a plan, we break the project apart into manageable tasks and begin development. We are lucky enough to have access to top of the line tools for writing software. Each employee has their own workstation consisting of an adjustable height desk, 27 inch retina display, and the latest Macbook Pro. Although each team member is free to choose their own development environment, most of gravitate towards command line tools like vim and tmux. We use GitHub issues to keep track of what is being worked on and by whom.
After a task is completed, it must be submitted for review. Depending on the importance of the task, one or several other team members will test and review changes in our staging environment before submitting the final changes to our production servers. In addition to automated testing, the review process often involves analyzing code for uncaught syntactic and logical errors.
Although we spend most of our time working hard to improve our service, being a CodeGuardian isn’t all work and no play. To keep things lighthearted and fun, we have Happy Hour every Friday afternoon, and we routinely plan outings to nearby events. Our most recent adventure involved participating as extras in a big movie production being filmed in downtown Atlanta.