19 May 2024

ORACLE: BACKUP FAILURE

Oracle Recovery Manager (RMAN) is the standard and most commonly used type of backup program available to Oracle databases.

Backups usually comprise at least two elements. The first is a point-in-time copy of primary data taken on a repeated cycle (daily, monthly, or weekly). This is followed by backups of all database structures, data files, or logs.

If primary data storage is lost or becomes unusable, the data can be restored from the full backup and then brought up-to-date to the point where access was lost by applying the most recent archive log or control file backups.

Backup should execute as quickly as possible to avoid clashes where one backup has failed to complete, and another is scheduled to start. Also, the backup process is highly I/O intensive and will affect normal Database server response times while running

One option is to create a snapshot, a point-in-time copy of a system, application, database, or file system. In Oracle, snapshots are typically associated with two main functionalities: materialized views and Oracle’s flashback technology. A materialized view is a database object containing a query’s results. Oracle provides several flashback features that allow users to view, query, and recover past states of the database.

Our DBA experts recommend backups instead of snapshots to guarantee full recovery in case of need. Three different types of backup for Oracle databases should be used in combination.

Full database backup provides a complete copy of the database at a single point in time, to which the database can subsequently be restored. It contains the entire database, including data files and archived redo logs. It allows comprehensive recovery in the event of data loss, corruption, or disaster. You must have it in order to restore for a specific point in time.

A CTL (Control File) RMAN backup contains data about the database structure, settings, file locations, and configuration.

Archive Log backups are performed in a sequence, with each link capturing changes since the prior backup. They are essential for maintaining data consistency and ensuring the recoverability of important organizational databases.

Restoring data to a backup copy may be required in the following scenarios:

Logical corruption: Data can become corrupted through application software bugs, storage software errors, and hardware failures such as a server crash.

Human error: An administrator may delete a file or directory, a user could erase a set of emails or even records from an application, etc.

Hardware failure: Failure scenarios can include hard disk drive (HDD) or flash drive failure (multiple failures can cause data loss even when RAID is used), server failure, or storage array failure

Find out how you can save hours of work when investigating the root cause of backup failures

Symptoms:

The backup failed or ran slow.

Impact: Critical

Where normal operations are interrupted, and data needs to be restored, the further back in time the last backup was done, the more complicated and longer the recovery will be.

Expected behavior:

Oracle Full backup and CTL backups should execute successfully at least once daily.

Archive log backups should execute regularly in intervening periods, between every 15 minutes to 1 hour, which is often enough to minimize the time needed when a restore is needed.

Possible causes of backup failure

1- Hardware failure Priority: Critical

Most of the causes and failure conditions for backup hardware are the same as for other kinds of hardware. The failure might be related to inaccessible storage, failed settings, etc.

Problem identification:

Look for error messages or slow backup performance. Identifying it requires a complete understanding of the backup system and its components.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

Look for error messages. The backup software or the operating system can generate error messages. This might be a time-consuming and frustrating task.
Look for incomplete backups. These might occur due to malfunctioning hardware components, and even for an expert DBA, finding out about them might be challenging.
If the backup process runs slower than usual, it could be a sign that the hardware components are not functioning optimally. An expert DBA should perform this check.
Look for a failure in backup verification. It could be a sign that the hardware components are not functioning well. This is also a task for an expert DBA.

Get the answer in just seconds!

AimBetter identifies and alerts about the specific hardware issue, whether disk, I/O, or CPU interrupts related, or a specific system error.

Recommended action :
Track the failure back through the device chain, from the source server (through the network if using remote devices) to the backup hardware. Repair/replace any faulty components or shift backups onto different resources.

2- Network failure Priority: High
Backing up over a network increases overall efficiency by reducing the number of backup devices. However, it also introduces another point of failure in the backup process.

Problem identification:

Look for error messages or slow backup performance while monitoring network performance.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

Track the backup process. If it is running slower than usual, it could indicate network issues related to high latency or settings factors. An expert DBA might do this, and it might take time to follow up.
Look for timeouts. If the backup process times out or fails to complete within the allotted backup window, it could indicate network connectivity issues. This probably requires an expert DBA and tracking with profiling tools such as AWR( Automatic Workload Repository) and Oracle or OEM (Oracle Enterprise Manager) diagnostic tools.
Look for error messages; they might be related to network issues. This requires skilled personnel who know how to look for them. Common places to check include system logs, such as “/var/log/messages” on Linux or Event Viewer on Windows.
Look for incomplete backups. It might occur due to network connectivity issues.
For Linux environments, you can use diagnostic tools such as the “netstat” or “ss” commands, which can display network connections and statistics. In addition, the Linux system logs various network-related events and warnings, which can be useful for investigation.

Get the answer in just seconds!

Our solution provides an easy way to identify the exact root cause of a problem. Each query connection has a summary of the wait resource, so it’s easy to track, which causes slow query performance. In addition, tracking errors with our solution is easy.

Recommended action :
Check and restore network connections on both the server and backup device. Replace any failed components. If necessary, yet unrecommended, shift backups onto local hard-wired resources.

3- No available disk space Priority: High
Space on the drive where backups are stored has run out. A common cause is that the database grew. Because of this growth, it needs more space for the backup than is available. Another cause is choosing to create a separate new backup file for every backup so that multiple copies reside on the same backup drive.

Problem identification:

Identify the root of the backup, look for the RMAN jobs failure message of error logs, run a full scan of the disk’s content, and look for errors caused by not enough disk space.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

Look for RMAN procedure messages. To do this, you have to log in to the Oracle database while using SQL Plus or SQL Developer, locate the correct job name, and find error messages. Usually, the error message mentions that the full disk caused the issue. This task might take time, requiring a DBA who knows the backup routines and jobs.
Another way to look for error messages of RMAN backup job failure is by using RMAN logs. By default, these logs are written to standard output and its default location is where it’s recorded but can be changed. These logs are typically located in the diag directory within the database’s home directory. Common places to check include system logs such as “/var/log/messages” on Linux or Event Viewer on Windows.
Look for slow system performance. It might be caused by low disk space. It requires knowledge of routine performance and might be hard to track.
Run a full scan of the disk content using the correct programs to locate if there is a file that takes more space than usual. For Windows OS, you can also check manually within the file explorer. For Linux, run “disk—l” to list all disk devices or the “smartctl” command, which provides analysis and reporting tools with information about disks.

Get the answer in just seconds!

You’ll be immediately notified if there is a low disk available space!

Simply check the backup root and error logs for more information about this issue.

Recommended action :
Monitor available drive space relative to database sizes. If historical versions are retained on the backup server, delete the older backup files. If necessary, add drive space

4- RMAN backups are not configured. Priority: Low.

To perform RMAN backups, you must configure RMAN settings within the Oracle Database instance. This includes setting up RMAN parameters, specifying backup destinations, defining backup schedules, and managing backup policies.

This cause is relevant to Oracle database environments that currently have misconfigured RMAN backups; in this case, AimBetter will raise an alert about it.

Problem identification:

Check if the RMAN is activated.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

Log in to the Oracle database with the required login permissions using SQL Developer or SQL *Plus and check for RMAN configuration with a query. You can also verify the backup settings and history, if they exist.

Get the answer in just seconds!

From the moment you start using AimBetter, you will be immediately notified about a database backup that doesn’t exist.

Recommended action:

You need to configure RMAN settings within the Oracle Database instance. It should be done with a DBA.

5- Inactive Oracle services. Priority: Low.

You can check if they are activated by using SQL *Plus

The Oracle services must run during the complete backup process. Oracle Database Service and Oracle Listener Service (allows connection to the Oracle database) must be activated.

Problem identification:

Check if the Oracle services were activated during the backup procedure.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

Check the activation of both Oracle Services: Database and Listener. For Windows OS, enter the Task Manager. For Linux, using the commands “service_name” and “active” will allow you to check the activation of a service by its name. You can also check it by using SQL *Plus.
Check the system logs for when the service stopped running and if it was at the same time as the backup procedure failing.
Look if system updates are scheduled automatically. We would not recommend an automatic schedule for updates that might cause a server to reboot.

Get the answer in just seconds!

AimBetter sends immediate notifications once there are changes on your server. You can then see when the service was stopped if it’s still stopped, and what happened during that time.

Recommended action:

If currently stopped, activate the Oracle Database and Listener services and cancel unnecessary system updates that might occur during the backup operation.

Possible causes of slow backup

1- Large size of backup Priority: High
In the case of a full backup, a common cause is database growth. In the case of archive log backup, a long gap since the previous full backup increases the size and time to complete

Problem identification

Look at the database data files’ sizes and check if the growth is rapid or the size is significantly higher than before.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

Write a script that shows the outcome of the descending size of data files or check the Oracle database manually using the SQL Developer of Oracle. Otherwise, check it in the data drive if you know what it is. It might be hard since you don’t have a history of sizes.
Check older backup files and compare whether they grew for the last attempts. Those checks might take time and require knowledge of the current backup and data components.

Get the answer in just seconds!

AimBetter provides an easy way of knowing when issues are happening parallel.

It identifies data growth and notifies about it by alert, which might happen parallel with a backup alert. With our solution, it’s easy to know the growth rate of a data file and where it’s located.

Recommended action :
It requires analysis and immediate action—preferably while the backup is still running—to obtain all available statistics and metrics.

2- Slow network Priority: High

Once the network is slow, it might affect the backup process speed. It might decrease the efficiency of the backup operations.

Problem identification

Track the network performance and check whether once it’s slow, then the backup operation is slower.

Hands-on approach

Get the answer in just seconds!

Hands-on approach

Track the backup process and the network performance in parallel. If the backup process is running slower than usual, it could indicate network issues related to high latency or settings factors. An expert DBA might do this, and it might be tiring to follow up since most backup processes happen at night.
Look for incomplete backups. They might occur due to network connectivity issues.
For both Windows OS and Linux systems, you can check whether the network is slower:For Windows, using a network monitoring tool, you should check how much traffic flows through your network and which applications or devices use the most bandwidth. Take into account that most tools help to pinpoint when a problem starts with which you can’t compare time frames. In addition, you can review network logs to look for unusual patterns. This can also take time.For Linux, look at the network logs related to network events and warnings. Check network statistics and connections. Using “traceroots” commands can help identify the route packets take to reach a destination and highlight network hops.

Get the answer in just seconds!

With AimBetter, it’s easy to track network performance. You receive multiple alerts to determine if issues are occurring simultaneously.

Recommended action :
When backing up across the network, there can be all sorts of contentions and bottlenecks. Review our posts on Network Latency and Network Jitter metrics here.