Taking steps to avoid data centre outages should be one of the highest priorities for data centre managers, whether they’re operating a private or colocation facility.

Data centres today have more tools than ever that help them protect their infrastructure and keep their systems up and running to deliver superior levels of server uptime.

A data centre will be more likely to perform at its optimum levels by implementing a set of high-quality working practices complemented by a comprehensive support and maintenance regime.

For all the emphasis on the technical challenges of maintaining server uptime, a substantial body of research by the ‘Uptime Institute’ points to human error as one of the top causes of data centre downtime.

In fact, some of the most high-profile data centre outages experienced by major companies over the last few years can be traced back to either an accident or outright negligence.

One such outage at content delivery network ‘Fastly’ took down thousands of websites in different countries for nearly an hour – including big names such as Amazon, Twitter and Spotify.

However, some failures can be measured by pure financial impact, like the outage which affected over two million TSB customers resulting in the bank paying over £370 million pounds in compensation. And another was at British Airways, where their outage left 75,000 passengers stranded and the airline facing a bill of £150 million in compensation.

It’s not all about human error though, other factors like cooling and cabling play their part.

Cabling remains the foundation for many high-performance and high-functioning data centres, therefore if the cabling system experiences a failure, then this will put the entire data centre at risk.

Such checks are needed to help ensure cables are not too tightly packed together or that there are no bends in cables. In addition, poorly constructed cable with equally poor performance or near-end crosstalk can greatly impact a data centre’s performance.

Overheating can cause a data centre to fail too. This is when equipment gets too hot it shuts down to protect itself thus creating downtime. Overheating occurs for several reasons, such as when not enough cold air is being sent to the cold aisle in a cold-aisle containment system.

Another example is when there’s a lack of airflow throughout the cabinets or the cooling system redundancy is lost.

To minimise the chances of a cooling failure, periodic checks on cooling equipment should be carried out to ensure that everything is operating to the standard expected. We recommend investing in an environmental monitoring system that will alert you as soon as temperatures begin to shift to an unsafe direction.

The simplest oversight can result in some serious downtime that can be both difficult and costly to recover from, which is why data centres should regularly undertake a risk assessment to mitigate the chances of failure.

Some of these common mistakes include:

  • Activation of the emergency power-off (EPO) switch
  • Changing the temperature from Fahrenheit to Celsius
  • Disconnecting power cords from equipment
  • The overloading of a circuit
  • Not following standard protocol or procedures

With today’s powerful data centre infrastructure management (DCIM) tools, facilities can monitor the overall health of their own equipment as well as co-located assets. While it may not be possible to predict every failure, sophisticated algorithms can monitor equipment performance continually to anticipate when hardware is reaching the end of its lifecycle or is prone to break down.

When these problems are identified, data centre personnel can plan to switch outdated equipment without having to take critical systems offline. With the right redundancies and backups in place, even the most unexpected failure can be managed without compromising network performance.

To minimise these and other factors, data centre managers should ensure that whether it’s a refurbishment, move or designing and building a new data centre, they are investing in training, engagement and documentation amongst their employees, together with defining ownership and handling specific tasks to teams gives clarity for oversight of a data centre.

Only by implementing a set of high-quality working practices, supported by a comprehensive support and maintenance regime can a data centre perform at its optimum limits, with limited risk of downtime – remember ‘prevention is far better than a cure’.

To discuss your facility get in touch with our team of data centre experts.