Today marks 10 years since the first version of memcached was committed to the Live Journal source tree by Brad Fitzpatrick (Original commit here).
Back then it was written in Perl! But still served a valuable role of fast response times and reducing database load. In those 10 years it has become one of the most widely used pieces of technology and an important part of the modern web stack. It has also progressed a lot from the Perl implementation, the recent Symposium on Networked Systems Design and Implementation (NSDI), one of the top systems research conferences, had two papers on Memcached. One by Facebook (here) and one by Carnegie Melon and Intel (here).
So happy birthday Memcached! And congratulations to Brad Fitzpatrick for producing such an influential piece of software (among his many other awesome achievements).
Thanks to Will Clayton for the great birthday cake photo.
Today Heroku announced that they have expanded their platform to include a European data center. We are happy to confirm that MemCachier is fully available in the new European data center, offering the same quality of service, features and price that our users have come to expect and love with Heroku’s existing US datacenter.
Using MemCachier in Europe on Heroku
The process is simply the same as provisioning MemCachier in the US, to provision a 500MB cache simply execute the following command:
$ heroku addons:add memcachier:500
No additional flags are needed as Heroku figures out from your application which region MemCachier will be provisioned in.
Safe Harbor Compliance
MemCachier is not yet a registered participant in the Safe Harbor program. We are looking into this right now an expect to be able to make a decision and layout a roadmap for this soon.
We love hearing from all our customers or potential customers, its the best way to make MemCachier a service that our customers love. If you have any comments or questions about the Europe availability or other comments, please email us at email@example.com. We don’t expect any difference between the US and European regions but will be updating our documentation over the coming weeks to better fit this brave new multi-datacenter world for Heroku users.
On the 12th of March between 2am and 11am PST MemCachier suffered a serries of outages and performance problems for many of our production customers. This post briefly summarizes the cause of the problem and the actions that have been taken since to address it.
Firstly, at around 2:11am PST we had a surge on one of our largest clusters of demand for memory and computing resources. This was largely caused by two fairly large customers signing up and using their cache for the first time. While this should have been fine, there was an unknown timing bug in the communication between the cluster and the provisioner. The provisioner is the server that collects resource usage statistics from the cluster and makes decisions about provisioning new machines and also some simple scheduling / migration decisions to distribute demand efficiently. The bug was related to a problem with the interaction between needing to provision new machines to handle extra demand, while simultaneously one of our virtual machines was failing (greatly degraded performance) due to issues with Amazon EC2 that were outside our control.
The end result was the provisioner getting into a confused state and making the incorrect decision to not bring new machines into the cluster quick enough. As a result a few of the machines had processes on them that ran out of memory and restarted. This is completely unacceptable but in it self would not cause a large impact to many of our customers as most of them treat MemCachier correctly as a cache and so data loss can be withstood. However, as is generally the case with such systems and outages, this behaviour triggered a recently introduced issue elsewhere in the system.
The second, follow on issue was that the then deployed version of the proxy layer of MemCachier — the layer that manages communication from one front-end server to all backends — didn’t handle restarted backend correctly. While it normally detected sockets that were broken and removed them from its known list of servers, the latest version had introduced a bug in that code, resulting in broken sockets to now non-existent backend process staying around. This meant, a request would sometimes silently fail as the proxy was trying to send it to a broken socket. This second issue caused the bulk of the problems that customers experienced as our monitoring infrastructure wasn’t well enough equipped to detect it properly. Now, a request would or wouldn’t work depending on the key since that affects what backend node is chosen. Our range of tested keys picked up some of these broken connections but not all of them and the silent error behaviour muddled it further.
The result of all of this was the most serious problem we have faced yet and the first one that resulted simply from mistakes by us and not problems flowing on from the hardware and network layers. We’d like to apologise deeply to all our customers for this. This is unacceptable and we are ashamed by it.
We have taken a number of actions since the problem occurred to improve our processes to ensure such issues don’t occur in the future. These include:
- The entire code base has been reviewed and improved with a lot of detail going into how components inter-connect. We have greatly improved our testing of these interactions and documented various invariants and properties we rely on.
- We’ve spent the last two weeks improving our testing infrastructure and monitoring systems. A large part of this includes the ability to reply recorded (real) production data and synthesized data that model various scenarios.
- We’ve formalized our release process. While one previously existed, it had evolved over time and the culture around it wasn’t careful enough. Various checkpoints and reviews now exist to ensure no such bug make it into future production releases.
We have made some great changes in the last two weeks but these are just the beginnings of a process of continual improvement to our code, culture and practices. We happy and grateful to have so many wonderful users, thank you.
At 3:36pm PST we had a large number of failures across the board with our production cluster. We quickly found that a single machine (MC4) was responsible for this. We run a distributed cluster that scales extremely well but also ties many machines together such that a single failing machine can cause unresponsiveness though out a lot of the cluster. We are actively working on strategies to manage this coupling better.
At 3:40pm MC4 had been largely isolated from the rest of the cluster, bringing all other machines to full performance. We then migrate all customers of MC4 over the next few minutes, completing this operating by 3:44pm PST.
The issue with MC4 is not completely known yet. Its responsiveness to ICMP (ping) packets went from the expected 0.3ms range to 500ms, already at the limit of timeout of memcache. Nothing was changed by us on that machine anytime prior to the incident and load was fine. So we suspect an issue with Amazon’s network or underlying hardware at this point. We will continue to investigate and let you know as we know more.
We apologize to all affected customers.
MemCachier can now power your WordPress object cache thanks to a plugin Per Søderlind created. The WordPress object cache is used to cache computationally expensive operations such as complex database queries. By using Per’s MemCachier WordPress plugin, you can speed up page load times for your readers and get all the added benefits of using MemCachier to power your cache — easy setup, high availability, and an analytics dashboard.
To get started with the MemCachier WordPress plugin, visit the the WordPress MemCachier plugin page and click on “installation”. The instructions are simple, and although they’re catered specifically for AppFog, the MemCachier WordPress plugin will work in any of our supported partners. As long as you’ve configured the MEMCACHIER_SERVERS, MEMCACHIER_USERNAME, and MEMCACHIER_PASSWORD environment variables, the plugin will work.
Today we took part in a webinar hosted by Joyent covering the topic of Platform-as-a-Service vs. Do-it-yourself. It was great fun and covered some important questions on MemCachier, what it is the team here believes in and why we are proud to be part of the PaaS movement. You can find a recording of the webinar here.
We need to be very careful when we write and deploy software. If we ship a critical bug or screw up a release, our cache might go down and take down our customers’ websites with it. Everything we do is focussed around offering a powerful, stable service to our customers. And we’ve developed a software engineering process to encourage this.
We have several guidelines we follow when we’re writing and deploying software:
- Ship code when it’s finished and no sooner.
- Code review every diff unless you pair programmed.
- Communicate asynchronously when possible. Try not to interrupt others.
- Canary release when possible.
- Test and benchmark everything.
- Be an artist and write beautiful code.
- Support decisions with data and reason.
We also optimize for uninterrupted hacking: no meetings, no tracking tickets, no burn down charts, no sprints, no mandated work hours, no mandated work location, no vacation policy, and no deadlines.
All of this may seem counterintuitive. Usually careful code is written under heavy organization and process — think commercial airplane code. Yet our process has worked very well for us. Our recent outages weren’t caused by bad code or release mistakes. The code we write has thus far made its way safely to the memcache requests of our customers.
Obviously not breaking our service is very important to us. But so is our happiness. The culture we’ve adopted has given us incredible freedom to be happy. We like exploring programming languages and different designs and encodings. We care about the stack from the hardware level of NICs, caches and registers, up to assembly and higher level languages. We read assembly code one minute, Ruby the next, and then RFC’s on DNS and TCP/IP after that. We value our code and our knowledge, and we’ve chosen a culture that accommodates.
If you’re an engineer and you want to work in a hacker culture, you should come work with us at MemCachier –we’re hiring.
Image credit:hacking the Gibson
We’re doing this with three other great companies (in addition to Joyent themselves) that may be of interest to MemCachier customers:
- Skookum Digital Works - A leading custom Mobile & Web App development company.
- Message Bus - A service for scalable and reliable delivery of email and mobile messaging.
- PodOmatic - A large podcast hosting website geared towards ease-of-use and light-weight social features.
You can sign up for the free webinar here as well as check out the bio of the speakers.
Some of our customers in Amazon’s us-east-1 region (Virginia) noticed degraded performance and partial outages starting Friday, 11/16 through Sunday, 11/18, and again on Wednesday, 11/21. We want to take this opportunity to explain what happened.
Friday, 11/16 through Sunday, 11/18
On Friday we started noticing what looked like a DDoS attack. Two of our machines would be totally fine one second, then instantly would be hit with 16,000 TCP connections. The spike in TCP connections caused new TCP connections to be rejected, which resulted in less performance for customers who attempted to create new TCP connections at that time. Most memcache clients use persistent TCP connections, so most of our customers didn’t experience downtime. However, because the cluster was so overloaded, some customers experienced slower performance. This largely affected our proxy servers (the servers customers actually connect to), so only a subset of customers were affected as most of the cluster continued operating normally.
By Friday evening the attacks had subsided and we were able to breathe a little and take a close look at what was happening. Upon inspection, we noticed that the majority of these 16,000 TCP connections were in a CLOSE_WAIT state, meaning the client had closed the connection, but our server was still holding on to the file descriptor. Upon further investigation, we found that the degraded performance was causing memcache clients to timeout, causing them to retry a connection, which created a new TCP connection. This snowballed and resulted in an effective DDoS attack from our customers. The degraded performance was due to a large customer who experienced a massive traffic spike.
We have implemented a few measures to help prevent one customer from degrading other customer’s performance. We’ve also booted several more high-CPU, high-IO servers to help spread the load. We’ve also improved our status page to be more informative and useful, with more changes coming soon.
This outage was caused by a misbehaving machine. We had recently turned on a new machine (mc6.ec2.memcachier.com) to expand the cluster. Part of our provisioning process is running tests on new the machine to ensure that it’s operating normally. However, our tests didn’t catch an issue with mc6. mc6 had an abnormally slow network connection, which caused the rest of the cluster to respond slowly as well. mc6 contains data nodes, meaning other servers such as mc1 will make requests to it to retrieve data. Being at the lowest level of a (shallow) distributed architecture, this caused the slow response times to ripple though the rest of the cluster. This slowness caused many memcache clients to timeout, forcing a new TCP connection, which overwhelmed all of our servers with new TCP connections. The end result was very similar to Friday but worse in that the whole cluster was affected for a short while.
We removed the misbehaving machine and the cluster immediately started operating as normal.
We’re deeply sorry for the outages, and we’re grateful for your patience. We understand our customers place a lot of trust in us: if we go down, often they go down, too. We take this responsibility very seriously.
We worked very hard over the weekend and Thanksgiving week to manage these problems and learn from them. We have several changes coming that will improve the performance and reliability of our service, allowing us to grow to the next level. We will announce them shortly once we start deploying them. Finally, we’ll continue to do everything we can to offer you the best memcache service available.
We made a slight change to the analytics dashboard to show 30 days of usage data. Previously we were only showing 7 days. Take a look at your dashboard and tell us what you think.
PaaS customers can access their analytics dashboard by visiting their PaaS provider’s website and clicking on the MemCachier add-on. Other customers can get access to their analytics by logging in to my.memcachier.com, clicking on your application, and then clicking “analytics”.
We’re always trying to improve our analytics dashboard, so don’t be shy in giving us feedback.