Analyzing PSSDIAG logs without using SQLNEXUS

In this blog we will discuss about how we can analyze the PSSDIAG logs without using SQLnexus tool.  In this post I will talk in reference to my blog on troubleshooting high CPU issues. Please make sure you open it while reading this blog because I have followed the same steps to analyze the PSSDIAG.

After installing and configuring the PSSDIAG as mentioned in my blog post, here are the next steps we have to follow.

After configuring the PSSDIAG , you will get a cab named pssd.cab in the chosen location:

image

Please make sure the cab is saved in some folder because we will have to extract files and it will become easier to manage .

Here is how it will look when the files are extracted:

image

Before running the PSSDIAG data collection , we have to make sure that the issue is happening  or going to happen and we will have to capture the logs for at least 5-10 minutes “whilst” the issue is happening. It will help us to get enough data for analysis.

To start the PSSDIAG we have to double click the PSSDIAG.cmd and it will look like :

image

Once you see the below text in the command prompt , that means now the PSSDIAG collection process has started.

image

Below is the scenario of high CPU for which I had collected the PSSDIAG logs.

image

Now lets look at the collected logs in the folder named “output” which will be created in the same location from where pssdiag.cmd was run.

The output folder will look like:

image

There will be lots of files in the output and I will discuss about the major files which will help us to find out the cause of high CPU. I will follow my blog on troubleshooting high CPU issues step by step to troubleshoot.

Step 1: – When you see high CPU utilization on the server check in task manager, which process is consuming high CPU. If it’s SQL server then open Perfmon and add the counter for Processor: % Privileged time, % Processor time and %user time

If you see the %privileged time to be high like hovering around 40% and above then it could be some driver issue. To troubleshoot these kinds of issues you could try:

Using Xperf: – http://blogs.msdn.com/b/ntdebugging/archive/2010/03/22/using-xperf-to-root-cause-cpu-consumption.aspx — Video; http://msdn.microsoft.com/en-us/performance/cc709422

If it’s user time which is high then find the process which is consuming high CPU.
image

In this case which apparently is , now we have to check the process consuming high CPU:

image

In this case SQL is consuming around 60% of the CPU and rest is consumed by the remaining processes. As SQL is major contributor to the high CPU at the system and is consistently consuming the same amount.

Another way to find out how much CPU SQL server is consuming – Open the file(from the output folder) , name ending with SQL_2005_Perf_Stats_Startup.OUT.

Look for: Recent SQL Processor Utilization (Health Records)  – you will get the output as follows:

runtime                         system_idle_cpu sql_cpu_utilization
——————————  ————— ——————-
2012-03-23T16:26:33.137        90                   1
2012-03-23T16:26:33.137        64                  20
2012-03-23T16:26:33.137        58                  25
2012-03-23T16:26:33.137        56                  25
2012-03-23T16:26:33.137        55                  25
2012-03-23T16:26:33.137        57                  26
2012-03-23T16:26:33.137        57                  26
2012-03-23T16:26:33.137        44                  35
2012-03-23T16:26:33.137        14                  53
2012-03-23T16:26:33.137         4                  60
2012-03-23T16:26:33.137         3                  59
2012-03-23T16:26:33.137         3                  59
2012-03-23T16:26:33.137         2                  58
2012-03-23T16:26:33.137         3                  58
2012-03-23T16:26:33.137         8                  54
2012-03-23T16:26:33.137        10                  53
2012-03-23T16:26:33.137         2                  56

Step 2. Lets move to the next step of finding the architecture which in our case is x64.

Snippet from Msinfo2 file:

Item    Value
OS Name    Microsoft Windows Server 2008 R2 Enterprise
Version    6.1.7600 Build 7600
System Name    ANMOL-PC
System Type    x64-based PC
Processor    Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz, 2933 Mhz, 4 Core(s), 8 Logical Processor(s)

Step 3: – Check if there are any trace flags enabled:

To get this we can check in the SQL error log at the startup of SQL server. the file name is SQL_Base_Errorlog_Shutdown.TXT. In this case we don’t have any trace flags(snippet from error logs):

2012-03-23 11:50:55.060 Server Microsoft SQL Server 2008 (RTM) – 10.0.1600.22 (X64)
Jul  9 2008 14:17:44
Copyright (c) 1988-2008 Microsoft Corporation
Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7600: )
2012-03-23 11:50:55.060 Server (c) 2005 Microsoft Corporation.


Step 4
: – Check the below mentioned SQL parameters

To check the SP_configure output we can check the file : –  name ending with sp_sqldiag_Shutdown.OUT

In this case:

max degree of parallelism                     0          64            0           0
max server memory (MB)                       16  2147483647         2048        2048
min server memory (MB)                        0  2147483647            0           0
priority boost                                0           1            0           0

In this case we are safe because we have:
1. Set max server memory
2. Not set Priority boost
3. As we have 8 processors, we don’t need to set any value for this unless required.


Step 5
: – Run DBCC memorystatus and check for the USERSTORE_TOKENPERM memory clerk.

To check the DBCC memorystatus, please check the file:  filename ending with DBCC_MEMORYSTATUS_Startup.OUT

In our case, it is normal:

USERSTORE_TOKENPERM (node 0)             KB
—————————————- ———–
VM Reserved                              0
VM Committed                             0
Locked Pages Allocated                   0
SM Reserved                              0
SM Committed                             0
SinglePage Allocator                     584 <=very less so no problem here.
MultiPage Allocator                      288

CACHESTORE_SQLCP (node 0)                KB
—————————————- ———–
VM Reserved                                        0
VM Committed                                       0
Locked Pages Allocated                             0
SM Reserved                                        0
SM Committed                                       0
SinglePage Allocator                           10128 <= Not very high
MultiPage Allocator                             2152

MEMORYCLERK_SQLQERESERVATIONS (node 0)   KB
—————————————- ———–
VM Reserved                                        0
VM Committed                                       0
Locked Pages Allocated                             0
SM Reserved                                        0
SM Committed                                       0
SinglePage Allocator                          374256 <= High based on the allocated memory
MultiPage Allocator                                0


Step 6
: – If the memory consumed by CACHESTORE_SQLCP is high then we check compilations and re-compilations which in our case is not required.

But I will still mention about the file in the PSSDIAG output which you will need to check for procedure cache pollution:  File name will end with :- SQL_2005_Mem_Stats_Startup.OUT and you will have to search for the string: –   — proccache_pollution

The output will look like :

runtime     plan_count  total_size_in_bytes  cacheobjtype       objtype    usecounts
———– ———– ——————– —————— ———————-
2011-08-22        48768           2898108416 Compiled Plan      Adhoc                4
2011-08-22         5762            281034752 Compiled Plan      Adhoc                1
2011-08-22         5755           1100980224 Compiled Plan      Adhoc                1
2011-08-22         4125            103612416 Compiled Plan      Adhoc                1
2011-08-22         1380            396263424 Compiled Plan      Adhoc                1
2011-08-22         1379            200916992 Compiled Plan      Adhoc                1
2011-08-22         1379            291463168 Compiled Plan      Adhoc                1
2011-08-22         1378            149831680 Compiled Plan      Adhoc                1
2011-08-22         1166            107454464 Compiled Plan      Adhoc                4
2011-08-22         1056            158662656 Compiled Plan      Adhoc                1
2011-08-22         1056             72105984 Compiled Plan      Adhoc                1
2011-08-22         1054             48947200 Compiled Plan      Adhoc                1
2011-08-22          157             11329536 Compiled Plan      Prepared             1
2011-08-22          109              4751360 Compiled Plan      Adhoc                1

But this is just dummy data to show you how it looks like. There will be more columns e.g. short_qry_text which will show you the query for which the parameterization is required.

In the above output, we have to look at the plan_count , size and the use counts which will simply mean that the query for which a new plan was created per execution. For more explanation, please check step 6 in  troubleshooting high CPU issues.

Step 7:- This relates to our issue i.e. lots of hashing and sorting is happening :

image

If we see, the output exactly matches with :

MEMORYCLERK_SQLQERESERVATIONS (node 0) KB
—————————————- ———–
VM Reserved 0
VM Committed 0
Locked Pages Allocated 0
SM Reserved 0
SM Committed 0
SinglePage Allocator 374256
MultiPage Allocator 0

Now, the time to check worktables/sec and workfiles/sec

Worktables/sec are high(~15) based on the number of batches/sec (~ 2 ) – which simply lots of sorting spilling to tempdb :

image

Workfiles/sec are are very low (It shows that there was no hashing spilled to the tempdb) :

image

In essence:

1. No compilation and re-compilation which means no parameterization issue.
2. No configuration issues at SQL server end.
3. we need to find out a query which might have cause the issue.

Please note : – The values are low because I have run small batched to reproduce high CPU.

Now we will move in the direction of hunting the query which is causing the trouble.We will make use of SQL server 2008’s feature query plan hash to drill down the issue. In the file – name ending with  Perf_Stats_Snapshot_Shutdown.OUT. Look for top 10 CPU by query_hash

Here is the output you will get:

runtime     total_worker_time    sample_statement_text
———-  ——————– ———————————-
2012-03-23            1683404272 insert into test (select * from test123 order by col1)
2012-03-23               4432252 SELECT * FROM sys.sysperfinfo
2012-03-23               2461141 SELECT   CONVERT (varchar(30), @ru
2012-03-23               1595088 select db_id() as dbid,     case
2012-03-23               1495085 SELECT CONVERT (varchar(30), @runt
2012-03-23               1062060 SELECT       CONVERT (varchar(30),
2012-03-23               1026058 SELECT CONVERT (varchar(30), @runt
2012-03-23                905051 SELECT CONVERT (varchar(30), @runt
2012-03-23                686038 SELECT      sess.session_id, req.r
2012-03-23                684039 select TOP 10000 * from sys.dm_exe

In the above output, if we see the query which is causing the high CPU is: insert into test (select * from test123 order by col1)

We can tune this query to avoid excessive sorting and which will further help to avoid high CPU issue.

I will keep adding more content in this blog as I find more interesting scenario of high CPU.

Troubleshoot SQL server high CPU issue without using profile traces

The scope of these issues is very wide but I would discuss about the approach which I follow to resolve these issues. I have resolved more than 50% of the cases by using this approach without even looking at profiler traces. I hope this will help you troubleshooting issues in your environment.

It’s specific to SQL server 2008 – 64 bit environments but could be tried on 32 bit environments too.

Let’s discuss about how to start it step by step.

Step 1: – When you see high CPU utilization on the server check in task manager, which process is consuming high CPU. If it’s SQL server then open perfmon and add the counter for Processor: % Privileged time, % Processor time and %user time

If you see the %privileged time to be high like hovering around 40% and above then it could be some driver issue. To troubleshoot these kinds of issues you could try:

Using Xperf: – http://blogs.msdn.com/b/ntdebugging/archive/2010/03/22/using-xperf-to-root-cause-cpu-consumption.aspx — Video; http://msdn.microsoft.com/en-us/performance/cc709422

If it’s user time which is high then find the process which is consuming high CPU. If it’s SQL server which is consuming high CPU then the below steps will be applicable for troubleshooting:

Step 2: – First and very important one is to know what’s the architecture i.e. 64 bit (X64 or IA 64) or 32 bit(x86):

Check the configuration using command: – Xp_msver

In this blog I will talk specifically about SQL server 2008 64-bit.

Step 3: – Check if there are any trace flags enabled:

dbcc tracestatus (-1); It’s always good to check the reason of enabling the trace flags, if enabled.

Step 4: – Check the below mentioned SQL parameters

sp_configure'advanced',1
go
reconfigure with override
go
sp_configure
go

  1. Priority      Boost: – it’s      not recommended to be enabled as it can cause high CPU utilization. If      it’s enabled (value =1) please disable(set value to 0) it – it’s a static      parameter which needs SQL restart
  2. Max server memory : – If we have 64-bit architecture, please make sure you have set the value for it leaving 3-4 GB for the OS. This sometimes contributes indirectly to cause high CPU.
  3. Max degree  of parallelism: – Although this parameter doesn’t make much of a      difference but sometimes I have seen SPID stuck in      ACCESS_METHODS_SCAN_RANGE_GENERATOR, cxpacket or other parallel waittypes.      If you see any such waits when you run this query :Select a.session_id, a.wait_type, a.wait_time, a.* from sys.dm_exec_requests a, sys.dm_exec_sessions b where a.session_id=b.session_id and b.is_user_process=1

If you don’t see cxpacket or parallel waits, please make sure you set the MAXDOP as per the recommended practices:-  http://blogs.msdn.com/b/sqltips/archive/2005/09/14/466387.aspx

If in your case waittypes to be any of related to parallelism and the waittime is more than 20 ms reduce the parallelism to e.g. if 8 then to 4 and so on unless the waits go away.

Please note: – Changing the MAXDOP because of cxpacket waits is a temporary solution. 

Step 5: – Run DBCC memorystatus and check for the

1. USERSTORE_TOKENPERM memory clerk. If the value is high say above 400 MB then follow the article: –

http://blogs.msdn.com/b/psssql/archive/2008/06/16/query-performance-issues-associated-with-a-large-sized-security-cache.aspx
http://support.microsoft.com/kb/955644

2.  CACHESTORE_SQLCP memory cachestore : – This cachestore is responsible to keep the execution plans for ad-hoc SQL queries,autoparameterized plans, and prepared plans. If the size of this cache becomes abnormally high, then there are fair chances of huge number of queries getting compiled because no plan reuse.  In PSSDIAG logs, we have called it as procedure cache pollution. If this is consuming lots of memory then follow step 6.

3. MEMORYCLERK_SQLQERESERVATIONS memory clerk : – This memory clerk is use for hashing and sorting by SQL server and if the memory seems less the resultsets are spilled over to tempdb while executing query. If the result sets spill to tempdb, we check in  perfmon  for  object – Access methodsWorktables/sec (for sorting)  and  Access methodsWork files/sec(hashing). If this memory clerk is consuming lots of memory then follow step 7.

Step 6
: – If  CACHESTORE_SQLCP  is high then next thing we have to check is compilations and recompilations :

  1. Open      perfmon and add the counters: –

MSSQL: SQL statistics à Batch requests/s, compilations/s and recompilations/s.

If you notice compilations/sec are greater than 20% then there is definitely fair amount of chances for Procedure cache pollution. To confirm that please run the query:

WITH cached_plans (cacheobjtype, objtype, usecounts, size_in_bytes, dbid, objectid, short_qry_text) AS
(
SELECT p.cacheobjtype, p.objtype, p.usecounts, size_in_bytes, s.dbid, s.objectid,
CONVERT (nvarchar(100), REPLACE (REPLACE (
CASE
-- Special cases: handle NULL s.[text] and 'SET NOEXEC'
WHEN s.[text] IS NULL THEN NULL
WHEN CHARINDEX ('noexec', SUBSTRING (s.[text], 1, 200)) > 0 THEN SUBSTRING (s.[text], 1, 40)
-- CASE #1: sp_executesql (query text passed in as 1st parameter)
WHEN (CHARINDEX ('sp_executesql', SUBSTRING (s.[text], 1, 200)) > 0)
THEN SUBSTRING (s.[text], CHARINDEX ('exec', SUBSTRING (s.[text], 1, 200)), 60)
-- CASE #3: any other stored proc -- strip off any parameters
WHEN CHARINDEX ('exec ', SUBSTRING (s.[text], 1, 200)) > 0
THEN SUBSTRING (s.[text], CHARINDEX ('exec', SUBSTRING (s.[text], 1, 4000)),
CHARINDEX (' ', SUBSTRING (SUBSTRING (s.[text], 1, 200) + ' ', CHARINDEX ('exec', SUBSTRING (s.[text], 1, 500)), 200), 9) )
-- CASE #4: stored proc that starts with common prefix 'sp%' instead of 'exec'
WHEN SUBSTRING (s.[text], 1, 2) IN ('sp', 'xp', 'usp')
THEN SUBSTRING (s.[text], 1, CHARINDEX (' ', SUBSTRING (s.[text], 1, 200) + ' '))
-- CASE #5: ad hoc UPD/INS/DEL query (on average, updates/inserts/deletes usually
-- need a shorter substring to avoid hitting parameters)
WHEN SUBSTRING (s.[text], 1, 30) LIKE '%UPDATE %' OR SUBSTRING (s.[text], 1, 30) LIKE '%INSERT %'
OR SUBSTRING (s.[text], 1, 30) LIKE '%DELETE %'
THEN SUBSTRING (s.[text], 1, 30)
-- CASE #6: other ad hoc query
ELSE SUBSTRING (s.[text], 1, 45)
END
, CHAR (10), ' '), CHAR (13), ' ')) AS short_qry_text
FROM sys.dm_exec_cached_plans p
CROSS APPLY sys.dm_exec_sql_text (p.plan_handle) s
)
SELECT COUNT(*) AS plan_count, SUM (size_in_bytes) AS total_size_in_bytes,
cacheobjtype, objtype, usecounts, dbid, objectid, short_qry_text
FROM cached_plans
GROUP BY cacheobjtype, objtype, usecounts, dbid, objectid, short_qry_text
HAVING COUNT(*) > 100
ORDER BY COUNT(*) DESC
RAISERROR ('', 0, 1) WITH NOWAIT

In the output you will need check:

Plan_count : – number of times the similar plan was found in the procedure cache

use count : – How many times the plan was reused and in the output of this query you will see the use_count = 1 most of the times which will tell that the query was not being reused.

Total_size_in_bytes: – It’s the memory consumed by the execution plans of the similar kinds.

For the description of the counters please check: – http://msdn.microsoft.com/en-us/library/aa173933(v=sql.80).aspx

In many cases you will find this to be the problem.

If you see the procedure cache pollution then you need to check the memory clerk: CACHESTORE_SQLCP in “dbcc memorystatus” command output (run in query analyzer).

In 64 bit system I have seen this memory to grow up to 8GB or more which is sign of excessive compilation. After you confirm this, there can be two possibilities

  1. Memory      pressure which can cause cleanup of the execution plans and make SQL to      compile the plan and in this case there will be no procedure cache      pollution.
  2. There are      excessive Ad-hoc queries which are causing this compilation and you will      see high pollution in the queries.

For point number 2, there are steps which could be followed:

  1. Identify      the text and database of the query which is being compiled in the output      of the above mentioned query and set parameterization to forced
  2. Parameterize      the ad-hoc query :- http://msdn.microsoft.com/en-us/library/ee343986(v=sql.100).aspx

If you see high re-compilations then check the article : – http://msdn.microsoft.com/en-us/library/ee343986(v=sql.100).aspx .

Also, make sure statistics of the tables are updated with fullscan. Query for that is:
EXEC sp_MSforeachtable 'UPDATE STATISTICS ? WITH FULLSCAN'

To check the statistics updated date, run the query:

EXEC sp_msforeachtable ‘select object_name(object_id) tablename,name as index_name,stats_date(object_id,index_id) as statistics_update_date from sys.indexes where Object_id=object_id(“?”)’

If after implementing the above action plans issue doesn’t resolve then:

7.  Check for the hashing and sorting in SQL server:

Open perfmon and add the counters: – MSSQL: memory manager (Granted workspace area) or check the output of the query:

select * from sys.dm_os_memory_clerks where type=’MEMORYCLERK_SQLQERESERVATIONS’

If the value is very high e.g. more than like 3-4 GB and I have seen this value to go till 10 GB in high memory machines. It’s a clear sign of big hash tables /huge sorting being done.

Then run the query (only for sql server 2008) : –

  1. Run the query :

print '-- top 10 CPU by query_hash --'

select getdate() as runtime, * into tbl_QueryHashByCPU

from

(

SELECT TOP 10 query_hash, COUNT (distinct query_plan_hash) as ‘distinct query_plan_hash count’,

sum(execution_count) as ‘execution_count’,

sum(total_worker_time) as ‘total_worker_time’,

SUM(total_elapsed_time) as ‘total_elapsed_time’,

SUM (total_logical_reads) as ‘total_logical_reads’,

max(REPLACE (REPLACE (SUBSTRING (st.[text], qs.statement_start_offset/2 + 1,

CASE WHEN qs.statement_end_offset = -1 THEN LEN (CONVERT(nvarchar(max), st.[text]))

ELSE qs.statement_end_offset/2 – qs.statement_start_offset/2 + 1

END), CHAR(13), ‘ ‘), CHAR(10), ‘ ‘)) AS sample_statement_text

FROM sys.dm_exec_query_stats AS qs

CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st

group by query_hash

ORDER BY sum(total_worker_time) DESC

) t

go

select * from tbl_QueryHashByCPU
go

In the above query’s output you will get to know the TOP CPU query. You can check the execution plan of the query by running the query:

select a.query_plan_hash , b.* from sys.dm_exec_requests a

cross apply sys.dm_exec_text_query_plan(a.plan_handle,0,-1) b

where a.session_id=<SPID>  ß Pass the SPID accordingly

Find out which one is on the TOP and then run the below query to find the missing indexes:

DECLARE @runtime datetime

DECLARE @cpu_time_start bigint, @cpu_time bigint, @elapsed_time_start bigint, @rowcount bigint

DECLARE @queryduration int, @qrydurationwarnthreshold int

DECLARE @querystarttime datetime

SET @runtime = GETDATE()

SET @qrydurationwarnthreshold = 5000

PRINT ”

PRINT ‘===============================================================================================’

PRINT ‘Missing Indexes: ‘

PRINT ‘The “improvement_measure” column is an indicator of the (estimated) improvement that might ‘

PRINT ‘be seen if the index was created. This is a unitless number, and has meaning only relative ‘

PRINT ‘the same number for other indexes. The measure is a combination of the avg_total_user_cost, ‘

PRINT ‘avg_user_impact, user_seeks, and user_scans columns in sys.dm_db_missing_index_group_stats.’

PRINT ”

PRINT ‘– Missing Indexes –‘

SELECT CONVERT (varchar, @runtime, 126) AS runtime,

mig.index_group_handle, mid.index_handle,

CONVERT (decimal (28,1), migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans)) AS improvement_measure,

‘CREATE INDEX missing_index_’ + CONVERT (varchar, mig.index_group_handle) + ‘_’ + CONVERT (varchar, mid.index_handle)

+ ‘ ON ‘ + mid.statement

+ ‘ (‘ + ISNULL (mid.equality_columns,”)

+ CASE WHEN mid.equality_columns IS NOT NULL AND mid.inequality_columns IS NOT NULL THEN ‘,’ ELSE ” END + ISNULL (mid.inequality_columns, ”)

+ ‘)’

+ ISNULL (‘ INCLUDE (‘ + mid.included_columns + ‘)’, ”) AS create_index_statement,

migs.*, mid.database_id, mid.[object_id]

FROM sys.dm_db_missing_index_groups mig

INNER JOIN sys.dm_db_missing_index_group_stats migs ON migs.group_handle = mig.index_group_handle

INNER JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle

WHERE CONVERT (decimal (28,1), migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans)) > 10

ORDER BY migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) DESC

PRINT ”

GO
For information on the DMV’s used in the query, please check http://msdn.microsoft.com/en-us/library/ms345421.aspx

You will definitely find some index recommendations for the TOP queries. Create the indexes and see if that makes the difference.

If creating the indexes doesn’t help then you will have to find the execution plan of the Top CPU query and try to tune it.