9 Jan 2020

BACKUP FAILED OR RAN OUT OF TIME

Backups usually comprise at least two elements. In the first place, a point-in-time copy of primary data taken on a repeated cycle (daily, monthly or weekly). This is followed by backups of all subsequent transactions (differential backup) or of the logs(transactional backup).
In the event that primary data storage is lost or becomes unusable, the data can be restored from the full backup and then brought up-to-date to the point where access was lost, by applying either the most recent differential backup or transaction log backup (or both if available).

Backup should execute as quickly as possible to avoid clashes where one backup has failed to complete, and another is scheduled to start. As well, the backup process is highly I/O intensive and will affect regular SQL server response times while running.
There are three different types of backup for SQL databases, which should be used in combination.

One option of backup is doing a snapshot which is a point-in-time copy of a system, application, database, or file system. This snapshot captures the entire state of the system at a specific moment, including data, configuration settings, and databases.

Our DBA experts recommend doing SQL backups. There are three different types of backup for SQL databases, which should be used in combination:

Full database backup – provides a complete copy of the database at a single point in time, to which the database can subsequently be restored.

Differential backup performs the same operations as a full backup but only captures the data that has changed or been added since the previous full backup. It is cumulative, and successive differential backups after a full backup will include all the data stored in the previous differential backup plus subsequent changes. It, therefore, will increase in size as more data is changed or added until the next full backup.

Transaction Log backups are performed in a sequence, with each link capturing changes since the prior transaction log backup. Any subsequent full or differential backups will not break the log chain and the next transaction log backup will be from the last transaction log backup and not the last full backup.

Restoring data to a backup copy may be required in the following scenarios:

Logical corruption: – Data can become corrupted through application software bugs, storage software errors, and hardware failures such as a server crash.
Human error: – An administrator may delete a file or directory, a user could erase a set of emails or even records from an application, etc.
Hardware failure: – Failure scenarios can include hard disk drive (HDD) or flash drive failure (multiple failures can cause data loss even when RAID is used), server failure, or storage array failure.
Catastrophic Hardware loss: – Possibly the worst scenario is an event such as fire that renders hardware inoperable and permanently unrecoverable.

Find out how you can save hours of work when investigating the root cause of this issue.

Symptoms:

Backup failed or ran slow

Impact: Critical

Where normal operations are interrupted and data needs to be restored, the further back in time the last backup was done, the more complicated and longer the recovery will be.

Expected behavior:

SQL Full backup should execute successfully at least once per day.

Transactional log and differential backups should be executed regularly in intervening periods and often enough to minimize the time needed in case restoration is needed.

For log backups, Our experts recommend doing them every 15 minutes.

Possible causes of failure

1- Hardware failure Priority: Critical
Most of the causes and failure conditions for backup hardware are the same as for other kinds of hardware.

Problem identification:

Look for error messages or slow backup performance. Identifying it requires a complete understanding of the backup system and its components.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

Look for error messages. Error messages can be generated by the backup software or the operating system. It might be challenging and take time.
Look for incomplete backups, it might occur due to malfunction of hardware components. Even for an expert DBA, finding out about it might be hard.
If the backup process is running slower than usual, it could be a sign that the hardware components are not functioning optimally. Probably this check must be done with an expert DBA.
Look for a failure in backup verification, it could be a sign that the hardware components are not functioning well. Probably this check must be done with an expert DBA.

Get the answer in just seconds!

Our solution immediately sends alerts when a specific hardware issue is identified.

It might be disk I/O related, or CPU interrupts related, or a specific system error.

Recommended action :
Track the failure back through the device chain, starting from the source server (through the network if using remote devices) to the backup hardware. Repair/replace any faulty components or shift backups onto different resources.

2- Network failure Priority: High
Backing up over a network increases overall efficiency by reducing the number of backup devices. However, it also introduces another point of failure in the backup process.

Problem identification:

Look for error messages or slow backup performance while monitoring network performance.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

Track the backup process. If the backup process is running slower than usual, it could be an indication of network issues related to high latency or settings factors. This might be done by an expert DBA and might take time to follow up.
Look for timeouts. If the backup process times out or fails to complete within the allotted backup window, it could be an indication of network connectivity issues. This probably requires an expert DBA and tracking with profiling tools such as SQL server profiler.
Look for error messages, it might be related to network issues. This required skilled personnel that knows how to look for it.
Look for incomplete backups. It might occur due to network connectivity issues.

Get the answer in just seconds!

In our solution, each query connection has a summary of the wait resource; therefore, it’s easy to track what caused slow query performance.

In addition, tracking errors with our solution is easy to do so you can check in minutes the error types impacting the network.

Recommended action :
Check and restore network connections on both the server and backup device. Replace any failed components. If necessary, yet unrecommended, shift backups onto local hard-wired resources.

3- No available disk space Priority: High
Space on the drive where backups are stored has run out. A common cause is that the database grew. Because of this growth, it needs more space for the backup than is available. Another cause is choosing to create a separate new backup file for every backup so that multiple copies reside on the same backup drive.

Problem identification:

Identify the backups’ root paths, look for the SQL job failure message, run a full scan of the disk’s content, and look for errors caused because of not enough disk space.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

Look for job failure messages. For that, you have to enter the SQL Server Management Studio, navigate to SQL Server Agent, locate the correct job name, and find it and its error messages. Usually, the error message will mention that the full disk has caused the issue. This task might take time, requiring a DBA that knows the backup routines and jobs.
Look for slow system performance, it might be caused because of low disk space. It requires knowledge of routine performance and might be hard to track.
Run a full scan of the disk content using the correct programs, and locate if there is a file that takes higher space than usual.

Get the answer in just seconds!

You’ll be immediately notified if there is a low disk available space.

Simply check the backups’ root paths and error logs in order to find out more information about this issue.

Recommended action :
Monitor available drive space, relative to database sizes. If historical versions are retained on the backup server, delete the older backup files. If necessary, add drive space.

4- Inactive agent service Priority: Low.

When backups are done with SQL Server through a SQL job, the SQL server agent service must be running during the complete backup process.

Problem identification:

Check if the SQL server service agent was activated during the backups procedure.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

The next steps vary depending on the operating system you’re using. For Windows OS, enter Task Manager Services, and check if the SQL Server Agent service is running. For Linux, you can check the service status by using ‘systemctl’ command, which shows the current service status and short period history of its state. You can try using ‘journalctl’ command , for viewing logs. You can check it also on SQL Server Management Configuration.
For Windows OS, check in the event viewer when the service stopped running and if it was at the same time when the backups procedure failed. .For Linux, you can check logs using ‘journalctl’ command.
Look if windows updates are scheduled automatically, we would not recommend an automatic schedule for updates that might cause a server’s reboot.

Get the answer in just seconds!

Our solution immediately sends a notification once a service was stopped, if it’s still stopped, and what happened during that time.

Recommended action:

Activate the SQL Server Agent service, and cancel unnecessary system updates that might occur during the backup operation.

Possible causes of slow backup

1- Large size of backup Priority: High
In the case of a full backup, a common cause is database growth. In the case of transactional backup, a long gap since the previous full backup increases the size and time to complete.

Problem identification

Look for the data file size of the databases and check if the growth is rapid or if the size is significantly higher than before.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

Write a script that shows the outcome of the descending size of data files or check each database manually using the SQL Server Management Studio. Otherwise, check it in the data drive, if you know what it is. it might be hard since you don’t have a history of sizes.
Check older backup files and compare whether the backup files grew for the last attempts. These checks might take time and require knowledge of the current backup and data components.

Get the answer in just seconds!

Our solution provides an easy way of knowing when issues are happening in parallel.

It identifies and alerts about data growth, which might happen in parallel with the backup alert.

With our solution, it’s also easy to know the growth rate of a data file and where it’s located.

Recommended action :
Requires analysis and immediate action – preferably while the backup is still running to get all available statistics and metrics.

2- Slow network Priority: High

Once the network is slow, it might affect the backup process speed. It might decrease the efficiency of the backup operations.

Problem identification

Track the network performance and check whether once it’s slow, then the backup operation is slower.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

Track the backup process and the network performance in parallel. If the backup process is running slower than usual, it could be an indication of network issues related to high latency or settings factors. This might be done by an expert DBA and might be tiring to follow-up, up since most backup processes happen at night time.
Look for incomplete backups, it might occur due to network connectivity issues.

Get the answer in just seconds!

It’s easy to track the network performance with our solution.

You just receive an additional alert if issues are happening parallel.

Recommended action :
When backing up across the network, there can be all sorts of contentions and bottlenecks. Review our posts on Network Latency and Network Jitter metrics here.