Veeam Backup & Replication Monitoring Feature: Tape Drive Alerts

Hi,

Everyday is a school day, and today I found out something really cool that Veeam was doing, that I never knew about because “It Just Works”.

I had an alert generated from a customer system today that I had never seen before. Now I’ve seen plenty of alerts for different backup issues, whether they’re caused by networks, BSODs, disk space constraints etc, but I got surprised by this completely new one.

A Tape Drive Alert, but of an unexpected variety

Image Description: “The voltage supply to the tape drive is outside the specified range...”

Warning: “TapeDrive alert: The voltage supply to the tape drive is outside the specified range.”

As I said above, I’d never seen this warning before! I didn’t know that Veeam was tracking such attributes of the tape drives it uses. So I set about looking up the root cause of the problem and busted out some “Google-Fu” to find who else had these issues in the past and I found this page of Veeam Documentation:

Tape Drive Alerts – Veeam Backup Guide for vSphere

Image Description: Overview of the Tape Drive Alerts page’s main table.

This web page has all of the alert codes, the severity of the issue, a description about the issue and what has caused the issue.

Now some of these errors you might be expecting to see such as “Media Error”, your data is at risk or read/write warnings, but some of the warnings here really go above and beyond what Veeam NEEDS to look at to do its job, examples:

  • Cooling fan failure
  • Voltage supply low/high / Power Consumption low/high
  • Humidity
  • Firmware
  • Redundant Power supply failed.

Ok cool, but why is this a big deal?

As technologies have evolved, we’ve added more abstract layers and isolations between tiers, and it’s great from some perspectives but restrictive to others. These tiers are often unaware of faults or early warning signs above or below them, for example:

  • In the storage world, an operating system doesn’t know that a disk has failed in a RAID array if it’s controlled by a RAID controller, as long as it can continue to read/write it’s unaware of a problem with the underlying layer.
  • In the networking world, a device doesn’t know that a highly available route has lost one of its paths, the traffic still flows, so it is unaware.
  • In the computing world, a virtual machine doesn’t know the physical host it resides upon has lost a redundant power supply, it is still running and so it is unaware.

Due to these scenarios we expect a lack of interaction between our systems, that we ourselves must aggregate this information, or use dedicated reporting systems to aggregate this information. Some exceptions exist of course, but largely the awareness of other layers is completely siloed. This is a missed opportunity as awareness of each of these scenarios could provide benefits.

Now, Veeam has their own monitoring and reporting product, Veeam One, so surely this would be the perfect time to talk about this and how it will collect all of this information? Wrong!

Veeam Backup & Replication actually is performing some key monitoring tasks built in, with these alerts for the Tape Drive being one of them. Realistically, Veeam only needs to concern itself when data isn’t being read or written successfully, and normally we would expect this siloed approach to our infrastructure, as long as Veeam can write a backup to tape, its job is done, right?

Veeam goes above and beyond here, sensor information provided by the Tape Drive is fed back into the job report, if the Tape Drive is reporting the humidity is too high, Veeam will tell you! Loss of power redundancy? Veeam will tell you that too! And Veeam is going to put that information straight into the output of the tape jobs impacted by this so you can get an immediate scope of what percentage of your tape jobs are at risk!

Conclusion

Credit needs to be given to Veeam for including this in their base product and not hiding this away behind an additional monitoring tool. Using these sensors to detect potential issues allows customers to proactively replace failing components and product their data availability. I’ve gone through the help center version history and can see this alert table was first added in version 9.5u4 so the functionality has existed for some time.

Image By: Dan-Cristian Pădureț

By micoolpaul

Technical Consultant at Nexus Open Systems. Focusing on Veeam, VMware & Microsoft Productivity and Infrastructure stacks.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s