Here is a great example of how one customer was experiencing a severe performance drop. Using our unique toolbox for monitoring and analyzing the whole environment, together with our support team’s tremendous experience, we were able to nail the cause, and recommend a fix, within a matter of a few minutes.

THE CHALLENGE

For one of our customers, users started complaining that SQL Server performance suddenly dropped and access slowed down. This happened during a period of normal user activity, and no other exceptional systems functions outside SQL appeared to be the cause.

Where do you start looking?

In a complex environment such as SQL Server, simply knowing where to start looking when resolving performance issues is more than half the battle.

Native network and application-based monitoring tools that come built-in to SQL Server require substantial ‘hands-on’ management, and are inherently granular, meaning that you need to try all possibilities one-by-one to build a picture ‘from the ground up’ without any sense of where to start. You can waste a lot of time and effort before honing-in on the actual cause.
In this case, there are multiple potentially suspect areas, and typically several specialist teams need to get involved – in network, DB administration, application design and storage management at the very least.

AimBetter Console dashboard provides immediate, real-time identification, explanation and resolution of all database, system and application performance challenges.
AimBetter monitors more than 400 system and application metrics, detects situations of unexpected behavior, and presents a full graphic picture giving priority to issues that need attention.

AimBetter Alert Notification is the standard kick-off point, providing a simple overview of any aspect of the server than is performing outside of your expected thresholds. Simply by following the flow of alerts, and applying our relevant knowledge and experience, we not could only say WHAT was happening, but WHY is had happened, and HOW to fix it to prevent re-occurrence.

THE AimBetter SOLUTION

The universal starting point for AimBetter is our monitor dashboard, identifying all aspects of the environment that are behaving outside of normal parameters.

 

 

 

In this case, the alert console is showing a burst of repeating situations that form a recognizable pattern : CPU saturation, followed by alert of long query time (in real terms, the long query is causing the CPU load, not the other way round, but the alert trigger on CPU is much more sensitive so it shows up first.) In our comprehensive monitor, CPU load can typically be associated with a wide range of causes, but most of these will also trigger independent AimBetter alerts on their own.  Some examples are:

• Operating System issues (network load, CPU load etc.)
• SQL efficiency issues (low Page Life Expectancy, low buffer cache hit ratio, high level of paging etc.)

So the co-incidence of CPU load and long query time, and the lack of any other pointers to more general CPU issues leads directly to the conclusion that the first place to look is in the Query analysis function built-in to AimBetter monitor. Directly from the dashboard, we switched into Query Analysis, by jumping to the query screen. This view assembles summaries of the queries that were running in the actual time window we were looking at.

 

 

Here you can see that there’s an immediate point of interest here (circled in red). Unlike the other SELECT commands, one has created massive volume of cache reads (12.5gb). So the next thing to do is to look at what that SELECT function is doing. It’s just one click away!

This case could have needed a deep investigation by your application design team to identify what, if anything, there could be in this snippet of code causing such a slow-down.  But this is where having access to the AimBetter Expert Support Team makes the difference.  The team’s wide experience in all matters relating to SQL, and our ability to think “out-of-the-box” when confronted by unusual patterns, in this case lead us to a quick solution, and brought this matter to a successful conclusion for our customer in a matter of hours.

If you are interested in the technical details, see the exact steps to our solution in the next section.

This customer is using an advanced ERP solution that has been highly specialized for their customer base of over 8,000 companies across a wide range of industries in 40 countries.  We have worked closely with many of these customers to introduce our monitoring platform into their environments (for more information, see our recent article  here.)  As a result, our Expert Support Team has extensive knowledge of how it works and of our customers’ own implementation of their software package.

From experience, we know that in cases where a JOIN between two tables is using a common key (in the code the common key highlighted on the snapshot is DOC), it is best-practice to use the COALESCE clause in T-SQL (The COALESCE function in SQL returns the first non-NULL expression among its arguments.)  If the developer implementing the package misses that, it defaults to the CROSS JOIN mechanism.

Inside Microsoft SQL Server engine, a table JOIN command can produce unexpected results in instances that use CROSS JOIN conditions in a SELECT query. (See Microsoft’s technical paper  here.) The implications of what could happen in the case of a CROSS JOIN are summarized by saying that a cross join produces a result set that is the number of rows in the first table multiplied by the number of rows in the second table.

To take a simple example, if your query is joining table A which has 25 rows with table B which has 20 rows, the result set has 500 rows.  However, if the two tables are extremely unbalanced, the result set may be highly volatile.  Take a moment to look at the following theoretical scenarios:


From this example, in the most extreme case, a CROSS JOIN to a table that has grown only modestly (by 10 rows) will produce a result set that has suddenly grown by 10 million rows!  The impact of this scenario exactly matches what our customer experienced, and shows up in the metrics we saw on our console.

The solution was relatively simple – to change the appropriate section of code inside the application, to explicitly define any join statement so as to avoid the possibility of a cross join query. The unique aspect of our solution was knowing where to start looking for the cause, and not having to expend time, energy and resources checking up on areas that were not directly contributing to this specific behavior.