10 Quick Checks for Easily Preventable Server Failures
Some checks are easy, some are hard. All sysadmins have to prioritize how much time they are going to invest in monitoring their systems. Often it is not feasible to test the restorability of a backup each week. There are, however, some things which are so quick & easy to check and prevent issues with such a fatal impact that they are well worth your time.
Create a list with these checks so that you can frequently go through it and make sure everything is running smoothly. You can use this list as a template in Checkpanel and make sure that you never again forget one of these basic checks. Log in or sign up for free to comfortably manage this checklist.
Disk Space Running Out
Is there enough free disk space available on the server or is it in risk of running out? You should always make sure that there is a comfortable safety margin.
One of the worst things that can happen to your server is for it to run out of disk space. Without disk space your server might become unresponsive, simply crash or anything in between. Uploads from a CMS might file because the uploaded files cannot be stored. FTP uploads fail. Incoming emails cannot be stored. Cron jobs fail because they cannot store the results of their processes. The exact failure modes vary significantly depending on the use case – but none of them is good.
What makes running out of disk space so dreadful is that often the server will not fully crash right away. Instead it will continue working but might fail in subtle, not immediately obvious ways. In the worst case this can lead to silent data corruption.
For added horror, the resulting errors might not even be logged because there is no disk space to store the logs...
Fortunately, such disasters can be easily prevented by frequently checking the available disk space. This should take no more than 10 seconds.
On Linux, you can use the command 'df' to list the available space on all partitions. On Windows, just use the Explorer.
Partial raid failure
Verify that all disks in your RAID array are fully functional. On Linux, you can use mdadm to detect failures. On windows you can refer to Disk Management.
Most severs protect themselves against disk failures by using redundant disks in a RAID array. If one disk fails the data is still secure on the other disk(s). Obviously, this requires at least two working disks.
Unfortunately, not every system will notify you automatically when a drive in the array has failed. So it might be that a drive has silently failed and the array has only one working drive remaining. Obviously, with only one drive there is no redundancy and you are just one failure away from a fatal storage outage.
If possible you should set up your system to alert you automatically if a drive fails. If not, the state of a RAID array can be checked in seconds:
Pending security updates
Check for new security updates and apply them.
This is probably preaching to the choir, but it is important to always install security updates as soon as possible. New vulnerabilities are often exploited in a manner of hours after becoming public – and sometimes even before becoming public. Exploitation is automated and hackers have databases of vulnerable systems. Time is of the essence.
The cost for a missed security update can be disastrous. At worst, hackers might extract customer data and you not only have to deal with cleaning / restoring your system but also with the PR nightmare that follows.
Fortunately, installing the latest security updates for your operating system is usually a matter of minutes.
In principle, the same applies to security updates for installed applications. Depending on the application this might take a little longer, though.
SSL/TLS Certificate Expiration
Open your site in a browser via https and check if the browser shows any problems. Also open the certificate details and make sure that it will not expire soon (e.g. not within the next month).
If you are running a web server with enabled SSL/TLS encryption (https) you need to make sure that the required certificates are always valid. Certificates expire after some time and need to be renewed before the expiration date. Your server must never use an expired certificate. Using an expired certificate would show a blocking error message in browsers, effectively shutting your site down.
Most certificate sellers will send frequent reminders once a certificate nears its expiration date. It is still best to frequently check independently that the currently deployed certificate still has sufficiently enough time left before expiry.
File system backups failing
Make sure the backup job is running correctly. Manually check if all recent backups are actually where they are expected to be.
This is a rather obvious one, but should nevertheless not be neglected. You frequently need to make sure that your automatic backups of the server file system are still being created.
Once set up these backup jobs are usually pretty robust. There can still be some unforeseen edge cases which introduce problems (e.g. a full disk – see above). Also, there's always Murphy's Law.
It is also important to frequently check if the backups are actually restorable. Unfortunately, this is usually nothing that can be checked quickly so it is out of the scope of this article. There is an item for this on our server maintenance checklist template though.
SQL database not backed up to file system
Check that the databases are frequently stored onto the file system. Make sure the timing is in synch with the file system backup so that they are included therein.
For many servers, a file system backup won't be enough. Many systems hold data in memory that is not covered by file system backups and need to be dumped to disk separately. Most commonly these will be database servers like MySQL, PostgreSQL or MS SQL. Although databases are not only stored in memory but also on disk you cannot be sure that the data on disk is complete all of the time. If you try to restore a backup of the data files taken from an active (not cleanly shut down) database server you might only be able to restore corrupt data or none at all.
You need to make sure that all data from your databases is cleanly dumped onto the file system before the file system runs so that it will be included in the file system backup. Most commonly this will be done with a tool like mysqldump (for MySQL servers).
Remote backup transfer failing
Connect to your remote backup space and check if the most recent backups are there.
Usually backups are not only stored on the local machine but also copied to remote storage. If this transfer fails, anything that takes out your local storage will also wipe out your backups. So you should not only check if your file system backups and database backups are created, you should also make sure that the transfer to the remote location succeeds.
SQL replication out of synch
Check that the slave severs are connected and in synch. If you have an automated monitoring / alerting process for replication, make sure that it is still operational.
If you have set up SQL replication, one if the worst things that can happen is for one of the replication slaves to silently loose synchronization with the master server. Optimally, you should have an automated monitoring process in place that will alert you immediately should any of the slaves desynch. Still, it pays to frequently double-check that everything is running smoothly.
Domain Name Expiration
Check that there is still plenty of time left until your current registration period expires. When the expiration date approaches, double-check that the renewal of the domain is in progress.
Fortunately, almost all domain registrars renew domain registrations automatically without the need for manual intervention. Still, every once in a while there is some unfortunate chain of events which leads to unexpected domain expirations. The most prominent victim in recent times has been Google but also Microsoft had embarrassing experiences.
The risk of this happening to you is low, but there are few scenarios which can cause more damage than a lost domain name. After all, it is not possible to make backups of a domain name. So to be on the safe side, it is best to pay a little attention around the expiration time of your domain(s) and make sure that everything is running according to plan.
Where to go from here
Now that you know what to look out for, make sure that you never forget to check these things again. You can use this list as a template in Checkpanel and let yourself be reminded frequently to check it. Give it a try, it's free!
If you are looking for more checks that might require a little more effort to execute, go on to our complete server maintenance checklist.
Questions? Suggestions? Feedback? We're always happy to help and eager for your unfiltered opinion. Just drop us a line or leave a comment below!