SQL Query causing CPU overload
April 2018
Introduction
Recently one of our customers was presented with a sudden warning of abnormal behavior on their database server. The case illustrates how by having the right tools in place to firstly pick-up quickly that something needed attention and then to have enough information immediately in front of them to spend a minimum amount of time investigating, but rather be able to go straight to the root cause and fix it.
The “secret weapon” in this case is AimBetter’s monitoring and alert mechanism, which raised the necessary flags, and pointed to the proper remedy, in just a few steps.
Background information
The company has a large number of data collection terminals spread over a wide geographic area as well as many on-site internet-connected devices. These devices communicate in real-time with central servers and an essential element is for rapid response measured in microseconds. Failure to achieve this standard can have a major impact on the company’s operations and profitability, which means that management places high demands on the database administrators.
Summary of this study
We give a complete description of all the details in the rest of this study below, but for simple step-by-step analysis, the following gives you a clear picture of how this problem was solved:
- AimBetter monitor console alerts to exceptional CPU load (notification via email to the customer, as well as to AimBetter expert support team)
- AimBetter support DBA opens the alert, sees cause is deadlocks on one SQL instance
- Analysis of current Queries – showing a very large number of queries in wait-for status
- QAnalyzer – shows guilty query (OtherEvents) is consuming > 50% of CPU with 2K instances
- Using Performance monitor, we eliminate other possible explanations – e.g. new users, additional activity
- Performance monitor for a number of queries run per CPU core shows an exceptional increase
- DBA advises customer’s administration to look specifically at the query and examine how it can raise core load so suddenly
- Conclusion by customer that data stream to terminal was missing or out-of-sequence
- Data integrity and normal functionality restored
- AimBetter advises the customer for prevention of this in the future is to route traffic via an API service instead of directly to the SQL server.
- The whole process, from start to completion, took less than 10 minutes!!!
Important to note is that all of this happens in real-time on a single console, with AimBetter support and customer’s administration able to see and respond to exactly the same console information, without need for over-the-shoulder or VPN access into the customer’s site.
The symptoms
The company received an alert from the AimBetter monitoring system that CPU load had risen quickly above the critical level defined in their thresholds. Such notifications are triggered automatically by AimBetter’s service to alert on-site administrators via email with details of the behavior requiring attention.
The investigation
Our own experience has shown that a primary culprit for sudden jumps in CPU load is a wayward query. By clicking onto the alert bar, the causes in this case of CPU misbehavior show an exceptionally high number of deadlocks, along with CPU load of more than 85%.
This is an immediate confirmation that queries are the likely cause, and leads to the following display of query status:
The obvious point of interest on this screen is a large number of query instances that are in a wait-for status. The need here is, therefore, to analyze these queries to find out why they cannot complete.
AimBetter’s QAnalyzer function is a fast and accurate analyzer that takes you down to the basic structure of all queries running at the time of interest.
From this screen, the obvious fact is that one query (OtherEvents) is on its own consuming more than 50% of CPU, and has run nearly two thousand times. The question to be answered is whether this is typical behavior for this query, or has something changed in this instance.
Using the Performance monitoring screens, the patterns of four relevant separate metrics can be easily presented, in order to draw an accurate picture of what the progression of this event was.
First, we look at the CPU load, which triggered this whole event:
We see here CPU load increasing nearly 400% in the space of barely one hour.
Next, to determine whether the increase in CPU load was the result of some normal rise in load from user activity, we examined the frequency of batch requests:
Although the number of requests has increased in the same timeframe, the magnitude of the increase is much lower – not quite reaching 50% over a longer period.
Next, in order to determine whether the extra load is coming from another source, namely access by adding new users, we looked at the number of connections to the database over the same period:
In this case, although connections jumped by 200% in the short-term, the overall levels are not significantly higher than historical levels 24 hours earlier, so the likely conclusion is that most of the increase is due to time-based normal activity rather than some exceptional event.
Finally, the related picture of a number of queries run per CPU core:
Here the evidence is very strong – an increase of more than 500% in the space of a few minutes.
The Diagnosis
Putting these together, we were able to advise the customer’s database managers that the source of the CPU extra load was coming from a single query (OtherEvents) and that we could rule out a sudden increase in new users (the number of connections has increased, but not enough to trigger such a dramatic CPU load jump). Similarly, the number of active users is not enough to trigger the load (number of batch requests is higher, but only reaches about 100% over historical levels – not enough to result in a 700% jump in task load!). The final, and conclusive evidence, is the large increase in queries per core. The key metric is the number of queries per CPU core, which jumped from around 20K to 140K in the time frame exactly matching the jump in CPU load. This is relevant because this query is designed for parallel query execution, using multi-threaded execution.
With this information, the customer’s administrator knew exactly where to look for a problem. The conclusion had to be that some data inside a session with one of the terminals is missing. This is likely to be caused by sequentially labeled data streams either coming out of sequence or with packets missing, resulting in data being locked in memory and blocking all the other dependencies. The customer’s experienced administrators now knew what to look for – events that could cause data to go missing, and where to look.
The cure
Within a few minutes, the customer’s administrators had identified which terminal was faulty, and could release the state of all the pending queries, as well as to repair the communication integrity of the data stream to the terminal that had triggered this event. Further, we were able to advise on how to prevent this from happening in the future, by channeling data through a front-end API service rather than directly into the database server, which would prevent blockages arising from a single terminal being out-of-sync.
AimBetter does it better.
Firstly, abnormal patterns are highlighted in real-time, with notification to our Expert Support Team and the user’s administrators immediately the problem is registered.
Secondly, the analysis of the event happens in real-time. There is no need to reconstruct the circumstances – everything that was happening is stored inside our database and can be recalled, analyzed and displayed in a single operation without the need to directly access the SQL server itself.
Finally, AimBetter is a complete service. Proper health maintenance for a data environment has many similarities to personal health. The saying “prevention is better than cure” applies equally in both cases, and any good practitioner will prefer to identify problems before they manifest and take appropriate action. AimBetter is the best diagnostic tool for the database environment because it does both of those things – it minimizes problems by keeping your database healthy, and whenever something goes wrong, it is immediately useful in finding the cause, and pointing you in the right direction for correction and cure.