TL;DR: Work forced me to deal with a server that was as far away from well-maintained as possible. Being angry about that got me to think about what a server, a system or a service even was & led me to try and find the smallest common denominators.
I recently had to bring an application back to life that has died. An application that I was neither familiar with nor responsible for. But challenges are fun, even when they are caused by non-paying customers getting nervous.
After consulting the scarce internal documentation and the a bit less scarce
external documentation I decided that the most likely fix was simply restarting
the application and hoping for the best - which requires a
sudo systemctl restart $service, nothing more. That sound easy enough.
Except for the fact that, as it turned out, it ended up taking nearly 40 minutes to be able to send that command to the server. To spare you the boring details and a lot of swearing I’ll summarize the events for you:
- As it turns out the server (which, by the way, is quite beefy) isn’t really maintained by anyone. Technically there is a service owner, but they see themselves as (somewhat) responsible for the application, not the machine underneath.
- This means that the server is not under configuration management. And the documentation doesn’t really include who has access.
- Despite assurances by the service owner that I should have an account on the machine that was decidedly not the case.
- The operations team then found out, by looking at an old script they had once
used to access the machine, that there were only two users on the machine.
One being the personal user of the person responsible for the application in
question (.. you’ll never guess under which user the application was running
for absolutely no reason at all) and
- Nobody really had an idea how to access
root, until the operations team found a random password named
$applicationin the last corner of their password manager.
- Luckily / sadly, depending on which professional hat I wear, that was the password that finally enabled me to log into the machine and to restart $service.
After my initial annoyance / anger has passed I realized that while I was displeased by the system not being what I considered “production ready”, I did not really have a concrete idea what that would mean, or what criteria it would be defined by - so I got to thinking. And the result of said thinking is what you can read from here on.
Disclaimer: This is highly opinionated and also talking about the issue with regards to a business environment. There can be, and most likely are, more relaxed criteria for personal systems.
For the sake of this post I define a ‘system’ as follows:
A system is the basic functional setup of hardware and software providing everything that is needed to implement computing performance for one or more application/s that run on top of it.
To be considered “ready for production” it has to fulfill the following criteria.
- The system has to have a unique hostname that is “backed” by a DNS record of
type A and / or AAAA.
- Ideally the hostname is generic and hints at the purpose of the
system, with numbers used to differentiate between different hosts
that serve the same purpose (e.g.
- Ideally the hostname is generic and hints at the purpose of the system, with numbers used to differentiate between different hosts that serve the same purpose (e.g.
- Network access vectors are defined and properly secured
- This includes at least one possibility for Out-of-Band-access. Depending on the type of system this could be an “external” management interface (e.g. HP iLO), the management console of the hypervisor (as, e.g., provided by ESXi) or tools that allow inspection of containers.
- The system is configured through the use of a configuration management system, not manually
- The configuration is to be checked into a version control system
- Secrets utilized by the system are properly secured in some form of digital vault or secret store
- The operating system that is running on the system must be supported by the
vendor, with at the very least security updates still being provided.
- The scope and schedule of updates can be adjusted in accordance with operational and / or business needs
- The system has the current time zone and synchronizes against a reliable (set of) timeserver(s).
- The system is covered by monitoring and alerting
- This includes both availability monitoring as well as collection and long-term storage of system information for the purpose of capability planning
- All relevant aspects of the system are subject to remote logging
- Examples for relevant aspects include audit logs, command logging, access log, ..
- Ideally the logs are fed into a log management system that makes accessing and analysing the logs trivial, but are at the very least stored securely
- All relevant data is backed up in accordance with the 3-2-1 rule
- This includes regular restore tests, both on a small scale and with regards to disaster recovery testing
- There has to be a standardized way for spare and replacement parts to be procured in case of a physical system.
- Ideally this is done through the original vendor as part of a maintenance contract, but if a proper process exists “buying stuff from eBay” can be tolerable under some circumstances.
There has to be a system owner that is, or a group of people that are, responsible for maintaining all aspects of the system.
- If a group of people share the responsibility the group itself can consist of sub-groups that are responsible for different parts of the system (e.g. the network team is responsible for connectivity, the datacenter team for hardware replacements, ..)
There has to be central documentation about the system. The documentation might be distributed (e.g. through version control), but there is only one source that is considered to be authoritative. At the very least the documentation needs to contain information about the following aspects:
- The technical details of the system (for example hardware details, networking details, .. / in case of virtual- or containerization: information about the resources allocated by the hypervisor)
- The physical location of the system (in case of virtual- or containerization: information about or the link to the documentation of the underlying hypervisor)
Changes made to the system have to be cleared through some form of change mangement
- “4 eye principle” might be appropriate, but a formalized change management policy is preferred
The system has to have an assigned production level in accordance with the different levels defined by the responsible owner (e.g. Production, Staging, Pre-Production, Testing, Development, ..)
- This can, but does not necessarily have to, include an availability level.
- This can, but does not necessarily have to, include the maintenance level.
- This can, but does not necessarily have to, include SLIs / SLOs/ SLAs.
The system has to be covered by Disaster Recovery policies and Business Continuity plans, as well as an Incident Response policy
All of that describes a system, but what about a ‘service’? Interestingly I struggle a lot with defining proper criteria for the levels above the operating system. I came up with the following set of definitions / requirements, but I’d be more than happy to add things:
A service is everything that runs on top of the operating system / within a container, providing continuous functionality based on code and / or data, such as an application or a script.
On top of the criteria for systems that can be applied to services as well (especially with regards to documentation, ownership and coverage by different policies) I could only come up with these points:
- The service is deployed in a reproducable manner.
- The service is supplied with adequate resources by the underlying system, such as data storage, processing power and network bandwidth.
- Authentication is done only over secure connections and by utilizing secure authentication mechanisms that adhere to current best practices.
And then there’s another stage that would be interesting to me, the code-level of the application. Unfortunately I do not have any professional experience with software development, so neither can nor should I comment on this. If any of you reading this would be willing to chime in with their opinion on criteria for what constitutes code that is ready for production - I’d be delighted!
The same goes for corrections or additions to the list above. I tried to be careful and deliberate, including everything that I can reasonably think of and putting it into coherent, meaningful writing. But I’m well aware that there are aspects that I might not have thought about, so please let me know if there’s something you think I missed.