RCA for SQL resources failover on Windows cluster

I was working on case where we had to find the root cause of the SQL resources failover to the other node. It was actually mix of SQL restart on the same machine and SQL resources failover.

Agenda of this blog is as follows:

1. Scenarios where SQL restarts on the same node and it fails over to the other node
2. Approach for the root cause analysis of the issue
3. What data to capture to find the RCA(for next occurrence)

1. Scenarios where SQL restarts on the same node and it fails over to the other node :- I have noticed it be confusing for many customers to understand why sometimes SQL restarts on the same node and sometimes it fails over. I have seen may people have confusion on this. To clarify the doubt: –

image

we have group property where the parameters are

Threshold : – It’s the number of resource failures same cluster group can withstand before causing the failover.
e.g. if in a group I have 5 resources and they fail 2 times , they will be restarted both the times on the same node. If the next failure happens the group will be failed over to the other node.

Period : –  It means that if there are 10 failures in a group within 6 hours time, the resources will be kept in the failed state and you will need to manually bring them online. If affect group option for the resource is set , the group will be failed over to the other node.

Now for SQL in general:

image

If we see the property of SQL resource , we have 2 options
1. Do not restart: – It means that if SQL resource fails even once, it will simply be in the failed state.
2. Restart – It means that if SQL resource fails , it will be restarted 3 times on the same node(It will override the setting of 10 which is for overall failures in group mentioned in the first section) . If we select the affect group option(see in the pic) then it will failover the entire group to the other node after the failure 4th time. If not, the SQL resource will be in the failed state.

Moving further on the approach to find the root cause analysis of the cluster failover or SQL resource restart on the same node.

2. Approach for the root cause analysis of the issue(RCA) : –  we will start with the basics on how to find the RCA. There can be multiple reasons of SQL restart/failover , some of the major ones are as follows:

a). SQL server instance crash
b). SQL server connectivity issue
c). Unrequired “Affect the group” setting
d). Disk/network resource failure

a). SQL server instance crash : –  To find out which one of the above is the problem, it’s always good to start with the SQL server error log. If we see something like :

I. Access violation
II. Non-yielding scheduler
III. Some memory dump

generated just before the crash and you don’t see :

2012-03-25 22:16:48.49 spid11s SQL Server is terminating in response to a ‘stop’ request from Service Control Manager. This is an informational message only. No user action is required.

– it will simply mean that there was some issue from the SQL end which caused the crash.  The troubleshooting will remain same as done for SQL crashes.

2. SQL server connectivity issue: – The start will be same i.e. check SQL server error logs and check the SQL server error logs at the time of failover.

If the error logs say :

2012-03-25 22:16:48.49 spid11s     SQL Server is terminating in response to a ‘stop’ request from Service Control Manager. This is an informational message only. No user action is required.

This will simply mean the failover has occurred because of the remaining reasons mentioned in the starting of this point i.e.

I. SQL server connectivity issue
II. unrequired “Affect the group” setting
III. Disk/network resource failure

Now question arises how do we come to know that the issue is due to connectivity?

In application logs if we see something like:

[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
[sqsrvres] OnlineThread: QP is not online.
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]Communication link failure
[sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]TCP Provider: An existing connection was forcibly closed by the remote host.

In SQL error logs : –

SQL Server is terminating in response to a ‘stop’ request from Service Control Manager. This is an informational message only. No user action is required.

In cluster logs: –

ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]TCP Provider: An existing connection was forcibly closed by the remote host.

ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]Communication link failure
ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] OnlineThread: QP is not online.
ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure
ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed
ERR  SQL Server <SQL Server (Dummy)>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure

If there is no error of any resource failure prior to this in the cluster logs and no errors in SQL error logs which can relate to performance etc., it’s likely to be a connectivity issue. But It’s still an assumption because we don’t know yet what could have caused it. We will discuss about which data to collect in the next section of the blog.

But for now, how we can make sure we haven’t missed on basics before we move to the next phase of collecting more logs.

Here are few very common configuration issues which I have seen on many cluster which can cause failovers. It covers the scenarios of SQL server crashes mentioned in the first section and connectivity issues. In short, It has to checked as  basics of troubleshooting:


1. Sp_configure output : –

affinity I/O mask    -2147483648    2147483647    0    0
affinity mask    -2147483648    2147483647    0    0

Please make sure you don’t assign same processors for the above parameters. To understand why , please check the blog:- http://blogs.msdn.com/b/psssql/archive/2010/11/19/how-it-works-io-affinity-mask-should-i-use-it.aspx

lightweight pooling    0    1    0    0

If we set the above parameter to 1 , it will enable fiber mode for SQL server. It’s not a recommended setting in general unless tested. It can be disabled to check if that helps to run SQL in normal mode without any failover.

max degree of parallelism    0    64    0    0

If we haven’t set  the value of this parameter and and we have more than 16 logical processors, we need to set this value to 8 or below 8(If HT is enabled equal to the number of processors).

max server memory (MB)    16    2147483647    102400    102400

If you have 64 bit SQL server, then please cap this value after 3-4 GB for the OS.

priority boost    0    1    1    1

This setting should strictly be disabled on clustered environment, the reason being it can make the cluster environment very unstable.

Once we are sure of the above, we have to check a very important parameter i.e. Page file size. I have seen cluster sinking because of the page file size set to default i.e. 2 GB . 

On 32-Bit  systems it should be 1.5 the size of the physical memory on the machine and
On 64 Bit systems It should be in double figures and to calculate the size of the page file on 64 bit, please check the article: – http://support.microsoft.com/kb/889654 .

I have heard people saying that we have huge RAM then why we need page file as we may not do paging. It may hold true in some scenario when we have oversized the memory. But proactively, we need to set the page file to toughly some double sized figure e.g. 15-20 GB. Because when paging happens on 64 bit machines it’s in GBs and if the page file is not large enough to handle the space, the cluster fails.

TCP/IP chimney : –

1. Please disable TCP Chimney at the Windows Level and NIC Level on all servers and all network cards. Please follow the steps below to disable TCP Chimney at Windows :

a. netsh int tcp set global chimney=disabled

b. netsh int tcp set global rss=disabled

c. netsh int tcp set global netdma=disabled

· After the above is done, go into ncpa.cpl, go into the advanced network settings for your card and disable the Chimney offload settings. This is specific to the card but will usually look something like this.

· Go into your network connection properties and click on “Configure”

clip_image001

· In the advanced tab, set Large offload to disabled for all IP versions

clip_image002

· Ensure Receive Side Scaling is disabled.

clip_image003

· Ensure TCP and UDP checksum offload is disabled for all IP versions.

clip_image004

· Reboot the machine to ensure these settings take effect.

Also, please check the article :- http://blogs.msdn.com/b/psssql/archive/2010/02/21/tcp-offloading-again.aspxfor more information.

2. Update the NIC driver to the latest version on all the servers and all network cards.

3. If there is a switch device on the network, please ensure the network switch and two servers’ network interface cards (NICs) have the same duplex settings. I would suggest you hard code the duplex settings to “100Mbps full” or “1000Mbps full”. For more information, you may refer to:
Network switch with full-duplex incompatible with half-duplex network interface cards (NICs)http://support.microsoft.com/kb/224131
3. unrequired “Affect the group” setting : – I have used “unrequired” term because I have seen people having DTC /fulltext resource in SQL group and they check affect the group setting in the resource property. What happens in those scenarios is when the DTC resource fails it takes along all it’s brother and sisters. As a result, we see application is failing to connect to  and SQL of course failed over/restarted.

Here is what you see in System windows event logs:

ClusSvc,Information,Failover Mgr ,1205,N/A,BLRS2R12-17,”The Cluster Service failed to bring the Resource Group “”Group 0″” completely online or offline.”

ClusSvc,Information,Failover Mgr ,1153,N/A,BLRS2R12-17,Cluster service is attempting to failover the Cluster Resource Group ‘Group 0’ from node BLRS2R12-17 to node

Service Control Manager,Information,None,7036,N/A,BLRS2R12-17,The SQL Server (MSSQLSERVER) service entered the stopped state.

Service Control Manager,Information,None,7035,DOM256581administrator,BLRS2R12-17,The SQL Server (MSSQLSERVER) service was successfully sent a stop control.

Service Control Manager,Information,None,7036,N/A,BLRS2R12-17,The SQL Server Agent (MSSQLSERVER) service entered the stopped state.

Service Control Manager,Information,None,7035,DOM256581administrator,BLRS2R12-17,The SQL Server Agent (MSSQLSERVER) service was successfully sent a stop control.

ClusSvc,Information,Failover Mgr ,1203,N/A,BLRS2R12-17,”The Cluster Service is attempting to offline the Resource Group “”Group 0″”.”

Service Control Manager,Information,None,7036,N/A,BLRS2R12-17,The SQL Server FullText Search (MSSQLSERVER) service entered the stopped state.

If we see in the above logs, after the full-text resource failed, cluster service took entire group offline and brought it back online.  In these kinds of cases we need to disable the affect the group setting and troubleshoot the Fulltext or any resource causing the group failure.

4. Disk/network resource failure  : –  In these scenarios, it’s mandatory for the cluster group to fail because unless we have the disks or network, we can’t work. Anyways, Affect the group settings is enabled by default and it makes sense to have this setting.  For these kinds of failure you will see the error:

Service Control Manager,Information,None,7036,N/A,BLRS2R12-17,The SQL Server (MSSQLSERVER) service entered the stopped state.

Service Control Manager,Information,None,7035,DOM256581administrator,BLRS2R12-17,The SQL Server (MSSQLSERVER) service was successfully sent a stop control.

Service Control Manager,Information,None,7036,N/A,BLRS2R12-17,The SQL Server Agent (MSSQLSERVER) service entered the stopped state.

Service Control Manager,Information,None,7035,DOM256581administrator,BLRS2R12-17,The SQL Server Agent (MSSQLSERVER) service was successfully sent a stop control.

Cluster resource ‘SQL IP Address 1 (v256581)’ in Resource Group ‘Group 0’ failed.
If the failover happens because of anything except SQL server, we have got a good idea on why the issue could have happened. If anything mentioned above doesn’t help and you want to capture the logs for the next occurrence, here is what you could do:

What data to capture to find the RCA:

1. Collect MPS reports  : – http://www.microsoft.com/download/en/details.aspx?id=24745
2. Enable the failoveranalysis.sql
We have  FailoverAnalysis.sql under the install folder and remote DAC is enabled {remote DAC is disabled by default on cluster. Look at sp_configure}. Run the following command in console :

cluster resource “SQL Server” /priv sqldmvscripttimeout=85000

85000 the 85 sec is the script timeout.

3. In cluster failover scenarios, the SQL Server resource DLL now can obtain a dump file before the failover occurs. When the SQL Server resource DLL determines that a SQL Server resource has failed, the SQL Server resource DLL uses the Sqldumper.exe utility to obtain a dump file of the SQL Server process.

(http://support.microsoft.com/default.aspx?scid=kb;EN-US;917825)

cluster resource “SQL Server” /priv SqlDumperDumpFlags = 0x8100

cluster resource “SQL Server” /priv SqlDumperDumpPath= DirectoryPath

cluster resource “SQL Server” /priv SqlDumperDumpTimeOut= Timeout

To verify that the settings have been enabled, you can run the following command:

cluster resource “SQL Server” /priv “

If we configure above mentioned steps, we will get all the information we need to find out the RCA of the cluster failure/failover.

4. If the issue is because of SQL server performance then, you could capture the PSSDIAG as mentioned in the blog: – http://sql-blogs.com/2012/02/18/configure-pssdiag-to-capture-high-cpu-issues/

In summary: – we discussed about:
1. What issues can cause cluster failover.
2. What basics we need to avoid failover.
3. What data to capture to find RCA if nothing in the blog helps.

In my next blog, I will come up with how to analyze the captured logs.

HTH!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s