SQL Server: Job complete does not always mean job success!

by Henry Kaye | July 14, 2018

Introduction

It may be comforting to look at your SQL Server log and see that an important job like Archive had completed successfully, but there are times when a process deep inside the engine failed.  This means the intended recovery function in turn failed to completely rollback and did not report this. In such a case, the outside function could report that it finished correctly.  There may be no overt sign of the problem until it manifested in some more serious condition later.

Such a scenario occurred in one company recently.  The Archiving procedure was designed to use the TRY – CATCH logic which generally acts as an error detector.  In this particular instance, an error was detected and the archive procedure was halted and a rollback was kicked-off. So far, so good.  However, the unexpected happened, and the rollback itself failed to complete.  This was not reported back to the calling archive process, so that it reported itself as having completed satisfactorily.

The result of this failure was a gradual decline in server response times. Initially this was interpreted as being due to steady increasing load, which could only be addressed by increasing resources such as hardware, memory and storage, and maybe a whole new infrastructure.

Before making such an expensive decision, the company asked AimBetter to investigate, using its advanced diagnostic solution together with the skills of the Ryltech Expert Support team. One immediate advantage that AimBetter has over other solutions is that a complete history of all monitored metrics over the previous 30 days can be examined so that a ‘real-time’ picture of what had happened is easily retrieved.  This means there was no need to reconstruct an artificial model to try an replicate the exact conditions that caused the actual problem.

The AimBetter way

AimBetter monitors hundreds of processes both inside the SQL engine and in the operating environment in real-time and will detect the occurrence of any condition where a metric returns a value that crosses a threshold.  Using the Query Analysis process in the first place, an examination of the screen presented below will show that AimBetter detected an issue with the ROLLBACK TRANSACTION process, which is flagged as an exception on our console.

However, the SQL Server log show no abnormal behavior, and reports the Archive process as having run successfully at the same time:

In discussion with the customer, they explained that in the Archiving process they use the TRY – CATCH clause to invoke the rollback on exception.  What is evident here is that the worst scenario has come up – the procedure failed internally and rollback of the archived rows back into the production table missed some data, even though the procedure reported a success.

The effect was bad performance, worsening slowly day by day simply because there was no warning. There was a time they thought to upgrade the server “due the data usage growth” since nothing pointed to a deeper cause.  Along came AimBetter, and in less than one day we found the reason, using the single Query Analysis screen seen above.  The logical fix we followed was simple:

  • AimBetter identified an exceptionally long Query – 30 minutes compared with 3 minutes on a prior run
  • We advised to KILL this query, and to fix the code that invoked it
  • The fix is a basic one that should be in all TRY – CATCH logic – to include a RAISEERROR condition inside the code.  This will trap any failure and ensure that the calling function is aware of the failure. As a result, the log will now report Failure instead of Success

Summary

In just a few steps, AimBetter was able to advise the customer of how the problem had arisen, and how to prevent re-occurrence.  In this way, a waste of substantial amount of money and time was avoided with a few simple steps.

AimBetter works with the most comprehensive range of database metrics and does extensive analysis. Our advances dashboard presents simple visual output that presents the complete picture and enables logical steps to be followed for detection and solution of any problems.  Even without extensive training in database administration, IT personnel can get a handle on all performance issues from a single monitor screen, and concentrate on issues needing attention.

How AimBetter delivers.

AimBetter extends control from its central dashboard over every server under your control from a single screen. In our central agent we collect data from over 400 primary performance metrics and analyze the results to detect any abnormal patterns. There is only one place that needs to be observed, and only exceptional behavior, in the form or alerts, needs attention.  No longer do managers have to spend valuable time looking through masses of raw information in order to detect things that need action.

AimBetter allows you to concentrate on the important issues without making you search through multiple screens and reports – alerting you to real issues in real-time.

In general, AimBetter works for you by:

  • Monitoring core SQL database performance and CPU, memory, storage and network behavior
  • Reporting immediately on exceptional readings
  • Enabling comparison of current with historical metrics
  • Displaying it all in a single comprehensive dashboard
  • Offering the advice of our Expert Support team, with many decades of experience in monitoring and managing SQL Server environments

AimBetter is the best tool in the market that helps you handle malfunctioning code and doesn’t require you to use complex procedures in order to troubleshoot. Additionally it monitors all of your database operations for you automatically, therefore it saves you a lot of time and effort and is the best solution for your Db management.