Is OpenStack really for you? An aftermath of a failed attempt

This post comes after a failed attempt to help a Swiss small ISP in building their cloud offering. The market of selling Internet access is shrinking down to the big players, so the owner believed that the next 5-10 years would have focused his business in reselling Virtual Private Servers (VPS) and thought that OpenStack can help him on this new business.
I’ve probably seen more failures than happy ending projects, however most of the times failure is not due to OpenStack at all. And this was (unfortunately) the case as well.
You probably know how much I love OpenStack and that I’m a strong supporter since my former boss Mark Shuttleworth put me on the project when I was in Canonical (Ubuntu) in early 2011. But let’s face it: OpenStack is not for everybody. And it’s not a matter of size of the business, nor the money you put on the project, rather the mindset with which you embrace OpenStack.
Two years ago, in when I published my book “OpenStack Explained”, I wrote that “the reality is that OpenStack is just a technology and it enables you to do more if you embrace its philosophy. This requires a company to change deeply in the way IT is conceived”.
Even if I’m an experienced consultant, my biggest mistake was not to deeply analyze the company before starting the project, I believed their words of having “long experience with Linux”, “tried Ceph deeply” and claimed to be “masters of networking”. It turned out that wasn’t true.
So I will write a few suggestions based on what went wrong in this project:

Real savings are in automation and no vendor-lock-in. If you are seeking for “something like VMWare, but cheaper”, my suggestion is to either re-consider vmware or go to other virtualization projects like Proxmox or o-virt / RedHat Enterprise Virtualization (RHEV). The real advantage of OpenStack is the extreme automation of your infrastructure and the freedom from any hardware/software vendor.
OpenStack requires care and attention. Don’t think on OpenStack like a “point and click solution”: it’s definitively not. The project is meant to be a full stack for building cloud, like Amazon Web Services (AWS) on your premises, so you need to accept its complexity. Live (at the moment) with the fact that you need to upgrade every six months and you require enterprise-level monitoring and operations.
Invest on people with good Linux skills. I can’t stress this enough. You can’t just live with somebody that “have installed Ubuntu” or other distributions and pretend you’re a Linux “super-hero”. You really need to know the Linux system in its root, knowing the storage and network subsystem. Basically, you need to find someone geek.
You need to have a dedicated team. It’s somehow linked with the above points, but managers or company owners most of the times think that they can “survive” with the existing people. But an OpenStack project requires people with focus. Especially at the beginning, you will need a lot of tuning of the parameters according to your needs. Three people on the team is the bare minimum, but consider that mid-sized ISP has usually around 10 members to cope with shifts and holidays.
Invest in tested/certified hardware. Hardware incompatibilities can be a nightmare: during this experience, I had a lot of hardware issues, like the CPU being frozen due to incompatibility with the motherboard or NVME faults due to a cheap PCI adapter. I wasted a lot of days (and nights) on demonstrating it was an hardware issue. If you need to save money, get hardware with less performance, but reliable and rock-solid.
Get the right storage for you. OpenStack can use a variety of block storage for the virtual machines. If you are attracted by Ceph because of savings, then you need to know that -according to my calculations- you need to have a few Terabytes before it gets cheap for you.
Ceph is like a ship: the bigger, the better. A cruise ship is far more stable than a dinghy, because the bigger size will bring stability to the vessel even when a thunderstorm is hitting it. Ceph has exactly the same concept. If the cluster has a smaller size and just a bunch of disks, then you don’t get much performance and it’s prone to lose the quorum or -worse- data. Bigger clusters deliver far more performance, stability and can recover better any error can occur.
Do an extensive PoC/test phase. Proof of Concepts and test phases should be taken very seriously: consider this phase to get acquainted with the technology and go for a deep dive with an experienced consultant. Try to understand in the test phase if you and –mostly- your team is ready for OpenStack. The longer this phase is, the lesser surprises you will get in pre-production and production stages.
If you’re going public, invest in network protection solutions. If you’re an ISP, you and your users will be likely a target of DDoS attacks. Use the appropriate techniques to protect your infrastructure…. I know it’s basic stuffs, but not everybody gets it.

Unfortunately, there’s no happy ending in this story. All possible things that could go wrong, went wrong. To summarize it, the cluster had multiple hardware failures, also due to an unplanned relocation of the equipment. It got worse when I discovered that nobody has sufficient Linux knowledge inside the company, even to do some basic troubleshooting.
The situation was against all OpenStack best practices, therefore I (sadly) I told the owner that I can’t be of any help any longer until they reshape the company and I suggested either go back to VMWare or investigate on other “point-and-click” solutions.
After six months, the OpenStack cluster has been decommissioned and the hardware being assigned to other customers.
OpenStack is a fantastic framework for building your own cloud services and is in use by a lot of customers in production. Now, if you’re thinking on having OpenStack on your premises, the question I have is: is OpenStack really for you?

2017-08-08