Recently, I delivered a presentation of level 100+ to some of my colleagues in which I discussed about troubleshooting service broker. I will try to cover very basic steps which you could follow to troubleshoot the issues with service broker.
Let’s build some base on service broker. You know level 100 of service broker if you know :
1. Message types : – http://msdn.microsoft.com/en-us/library/ms187744
2. Contracts : – http://msdn.microsoft.com/en-us/library/ms178528
3. Services : – http://msdn.microsoft.com/en-us/library/ms190332
4. Queues : – http://msdn.microsoft.com/en-us/library/ms190495
5. Routes : – http://msdn.microsoft.com/en-us/library/ms186742
6. Remote service binding : – http://msdn.microsoft.com/en-us/library/ms178024.aspx
- Dialog security — Encrypts messages in an individual dialog conversation and verifies the identities of participants in the dialog. Dialog security also provides remote authorization and message integrity checking. Dialog security establishes authenticated and encrypted communication between two services.
- Transport security — Prevents unauthorized databases from sending Service Broker messages to databases in the local instance. Transport security establishes an authenticated network connection between two databases.
For more information, please check: – http://technet.microsoft.com/en-us/library/ms166079(SQL.105).aspx
Service Broker Activation
Describes the event and strategies for receiving and responding to the event
I have seen, most of the customers use internal activation. So, I will talk about troubleshooting of only internal activation.
For more information, please check :- http://msdn.microsoft.com/en-us/library/ms171617(v=SQL.105).aspx
There are few very important DMV’s we have to be aware of:
Sys.conversation_endpoints : – This DMV is used to check the status of the message. This DMV will help us understand the status of the message e.g. CO,DI etc. This DMV will contain the message till we cleanup the conversation .
Sys.transmission_queue: – For the service broker with the local database(as initiator and target) configuration, it will carry only those messages which were routed to a service which doesn’t exist.
For remote instance configuration (initiator and target on separate instances) :- it will contain the message till it is received on the target or till the retention of the conversation.
Sys.service_queue: – To find out if the queue is poised or not and the activation stored procedure – Check the is_receive_enabled and is enqueued_enabled properties if the value is 0 that means the queue is poised. For activation stored procedure check – activation_procedure column and is_activation_enabled column
Here is a very good MCM video by Bob :- http://technet.microsoft.com/en-us/sqlserver/gg313771.aspx : – It will clarify all the concepts of service broker
We have got enough basic now to troubleshoot the service broker issues. Let’s move forward with the troubleshooting now :
Troubleshooting : –
When you use service broker and it’s configured correctly. The service broker issues will be very hidden whenever they come. It will be something similar to what we saw in movie Shooter in which Bob Lee Swagger shoots someone from long distance and boo ; someone got killed. So, be careful
Issues will be something like : – Out of memory:
2010-02-17 07:28:43.06 Server Failed allocate pages: FAIL_PAGE_ALLOCATION 12010-02-17 07:28:43.06 spid14s Error: 701, Severity: 17, State: 123.2010-02-17 07:28:43.06 spid14s There is insufficient system memory to run this query.
When you drill down to the cause of the issue , you may find it’s
VM Reserved = 0 KB
VM Committed = 0 KB
AWE Allocated = 0 KB
SM Reserved = 0 KB
SM Committed = 0 KB
SinglePage Allocator = 1523880 KB –>Consuming most of the memory
MultiPage Allocator = 0 KB
I have mentioned the resolution of this issue in the article: – http://blogs.msdn.com/b/batala/archive/2010/11/23/sql-server-encounters-memory-pressure-and-generate-701-error-due-to-incorrect-configuration-of-service-broker.aspx
Tempdb running out of space : – I was recently working on a scenario, where we found system SPID 24 was running something for last 24 hours and was causing blocking on the other sessions.
Well, as it’s a system SPID we couldn’t even trace it using profiler trace. We did general troubleshooting and found the input buffer of some stored procedure. There were no waits on the SPID.
Still no clue, what’s going on. We didn’t even know if the customer was using service broker. So, the customer was getting little aggressive on this. Then, as always my best friend SQLdumper – used this blog to generate a stack dump : – https://dbcouncil.net/2012/01/25/complete-explanation-of-sql-server-memory-dumps-and-its-types/– the command was dbcc stackdump(-1)
I took a memory dump of all the SPIDs and checked what’s going on. We found the thread of SPID 24 was a service broker thread and it was running an activation stored procedure to process the messages on the the queue. And, the procedure was running extremely slow. Due to which the messages were not getting processed and queue was building up.
Snippet of thread stack:
Child-SP RetAddr Call Site
00000042`23c8a370 00000000`0098d789 sqlservr!LatchBase::AcquireInternal+0x7b
00000042`23c8a410 00000000`0098e1ca sqlservr!BUF::AcquireLatch+0x8d
00000042`23c8a730 00000000`0098e0a0 sqlservr!BPool::Get+0xc7
00000042`23c8a790 00000000`00a1152a sqlservr!PageRef::Fix+0xbc
00000042`23c8a7f0 00000000`00f4dd8a sqlservr!FixPageNotThruLinkage+0x11c
00000042`23c8a890 00000000`00f4d3d7 sqlservr!HeapPageManager::GetNextPage+0x1f9
00000042`23c8a940 00000000`009abb6f sqlservr!HeapDataSetSession::GetNextRowValuesInternal+0x551
00000042`23c8b380 00000000`009aba0b sqlservr!RowsetNewSS::FetchNextRow+0x14a
00000042`23c8b470 00000000`009e345b sqlservr!CQScanRowsetNew::GetRowWithPrefetch+0x47
00000042`23c8b4c0 00000000`00a11a0b sqlservr!CQScanTableScanNew::GetRow+0x8a
00000042`23c8b500 00000000`009b2224 sqlservr!CQScanNLJoinNew::GetRowHelper+0x1c2
00000042`23c8b540 00000000`009feac4 sqlservr!CQScanStreamAggregateNew::GetRowHelper+0x1c9
00000042`23c8b580 00000000`00a03dc7 sqlservr!CQScanTopNew::GetRow+0xf3
00000042`23c8b5e0 00000000`009e3835 sqlservr!CQScanUpdateNew::GetRow+0x6b
00000042`23c8b680 00000000`009e53a8 sqlservr!CQueryScan::GetRow+0x69
00000042`23c8b6b0 00000000`00a09ee9 sqlservr!CXStmtQuery::ErsqExecuteQuery+0x78a
00000042`23c8b840 00000000`00a09d42 sqlservr!CXStmtDML::XretDMLExecute+0x222
00000042`23c8b8e0 00000000`009bef92 sqlservr!CXStmtDML::XretExecute+0xa5
00000042`23c8b910 00000000`009c0b28 sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x375
00000042`23c8ba20 00000000`009c02fc sqlservr!CMsqlExecContext::FExecute+0x983
00000042`23c8bba0 00000000`01e70190 sqlservr!CSQLSource::Execute+0x7b5
00000042`23c8bcd0 00000000`01e6fae6 sqlservr!CStmtExecProc::XretLocalExec+0x24c
00000042`23c8bd40 00000000`01e6dadd sqlservr!CStmtExecProc::XretExecExecute+0x416
00000042`23c8c3f0 00000000`009bef92 sqlservr!CXStmtExecProc::XretExecute+0x19
00000042`23c8c430 00000000`009c0b28 sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x375
00000042`23c8c540 00000000`009c02fc sqlservr!CMsqlExecContext::FExecute+0x983
00000042`23c8c6c0 00000000`01e70190 sqlservr!CSQLSource::Execute+0x7b5
00000042`23c8c7f0 00000000`01e6fae6 sqlservr!CStmtExecProc::XretLocalExec+0x24c
00000042`23c8c860 00000000`01e6dadd sqlservr!CStmtExecProc::XretExecExecute+0x416
00000042`23c8cf10 00000000`009bef92 sqlservr!CXStmtExecProc::XretExecute+0x19
00000042`23c8cf50 00000000`009c0b28 sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x375
00000042`23c8d060 00000000`009c02fc sqlservr!CMsqlExecContext::FExecute+0x983
00000042`23c8d1e0 00000000`015c5c0f sqlservr!CSQLSource::Execute+0x7b5
00000042`23c8d310 00000000`015c513b sqlservr!CSbActivationProcTask::Execute+0x7db
00000042`23c8e230 00000000`016279ee sqlservr!CSbActivationBaseTask::Invoke+0x2b
00000042`23c8e260 00000000`01627596 sqlservr!CSbTask::TaskStart+0x412
00000042`23c8f490 00000000`0097e7d8 sqlservr!CSbTask::BeginTaskStart+0x1a
00000042`23c8f4d0 00000000`0097e5b2 sqlservr!SOS_Task::Param::Execute+0x12a
00000042`23c8f5e0 00000000`0097e3f7 sqlservr!SOS_Scheduler::RunTask+0x96
00000042`23c8f640 00000000`00e9dd56 sqlservr!SOS_Scheduler::ProcessTasks+0x128
00000042`23c8f6b0 00000000`00e9e0b5 sqlservr!SchedulerManager::WorkerEntryPoint+0x2b6
00000042`23c8f790 00000000`00e9e31c sqlservr!SystemThread::RunWorker+0xcc
00000042`23c8f7d0 00000000`00e9df5a sqlservr!SystemThreadDispatcher::ProcessWorker+0x2db
00000042`23c8f880 00000000`74df37d7 sqlservr!SchedulerManager::ThreadEntryPoint+0x173
00000042`23c8f920 00000000`74df3894 msvcr80!_callthreadstartex+0x17
00000042`23c8f950 00000000`7750f33d msvcr80!_threadstartex+0x84
00000042`23c8f980 00000000`77642ca1 kernel32!BaseThreadInitThunk+0xd
00000042`23c8f9b0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
After working on the stored procedure a little bit, the issue was resolved. And, there is one more classic issue of tempdb which is explained the blog: – http://itknowledgeexchange.techtarget.com/sql-server/writeup-of-a-strange-service-broker-and-tempdb-problem/
In summary, we have to start something like this:
1. Check the configuration of service broker. Try to find out how’s that configured using the DMV’s
print ‘Get the databases where Broker is enabled’
select name,is_broker_enabled,service_broker_guid from sys.databases
print ‘ Queues configured for the service broker database’
select name,activation_procedure,is_activation_enabled,is_receive_enabled,is_enqueue_enabled from sys.service_queues
print ‘ Services configured for the service broker database’
select * from sys.services
print ‘ Routes configured for the service broker database’
select * from sys.routes
print ‘ Remote service Binding’
select * from sys.remote_service_bindings
print ‘Top 1500 messages in flight for transfer to target the transmission queue – check for transmission_status’
select top(1550) * from sys.transmission_queue
print ‘No. of the messages in flight for transfer to target the transmission queue’
select COUNT(*) from sys.transmission_queue
print ‘Status of top 1500 messages in the conversation queue’
select top(1500)* from sys.conversation_endpoints
print ‘Total number of messages in the conversation queue’
select COUNT(*) from sys.conversation_endpoints
print ‘Certificates for the service broker’
select * from sys.certificates
print ‘Enpoints configured for the service broker’
select * from sys.tcp_endpoints
print ‘Taks waiting and status of the queue’
select * from sys.dm_broker_queue_monitors
select * from sys.service_contracts
select * from sys.service_contract_message_usages
select * from sys.service_contracts
select * from sys.service_contract_usages
You will have run these queries for all the databases where you have service broker enabled or on demand basis.
If you analyze the output of the above dmvs. it will give you clear understanding of what the service broker is upto.
Once we know the issue e.g. messages are not being processed or queue is poised(discussed in the later part of this blog) – you will have take necessary actions.
transmission_status column in the sys.transmission_queue and status column of the sys.conversation_endpoints will give you the pointer of what’s going on. If you don’t get anything concrete then we need to capture all the Service broker trace events along with RPC started and RPC completed, Sp:stmtstarting and sp:stmtcompleted , sp: starting and sp: completed. And look for the errors and find out what’s going on.
Below are the steps which you could consider to resolve the issue once the architecture is identified.
For troubleshooting initial configuration or dialog delivery issues, here is a snippet from a very nice blog written by Rusanu who is a champ in service broker:
1. The sender cannot send the message, for whatever reason. If this is the case, the transmission_status column in sys.transmission_queue will contain an error message that will point at the problem. The appropriate action depends on the error being displayed.
Common problems include security problems (no database master key, no remote service binding, no certificates etc.), classification problem (no route for the target service etc.) or adjacent transport connection issues (connection handshake errors, unreachable target host etc.)
2. The sender does send the message and the message reaches the target but the target does not accept the message. In this case, the sender’s transmission_status will be empty. To diagnose this issue, you must attach the Profiler to the target machine and enable the following events: ‘Broker/Broker:Conversation’, ‘Broker/Broker:Message Undeliverable’ and ‘Broker/Broker:Remote Message Acknowledgement’. When the message arrives, you will see the event ‘Broker/Broker:Remote Message Acknowledgement‘ with the EventSubClass ‘Message with Acknowledgement Received‘ followed by ‘Broker/Broker:Message Undeliverable‘ event. The TextData of this last event will contain an error message that will point at the problem.
Common problem in this case are security problems (you must turn on in addition the ‘Audit Security/Audit Broker Conversation‘ event in the Profiler to investigate these problems, the TextData should pinpoint to the failure cause), typos in service names or broker instance id, disabled target queues.
To read more , please check this blog: –http://rusanu.com/2005/12/20/troubleshooting-dialogs/
Once the issues related to initial configuration are sorted out, then there is a very famous issue of poised queue:
We all understand about queues but now what’s this poised queue??
Well, what happens when we send the message to the queue and it rolls back more than 5 times, the queue becomes poised. It means whenever the messages reaches the queue, it wont be processed i.e. activation procedure won’t be run.
Whenever you see that suddenly, queue has started to build up – you have to run this query :
select name,is_enqueue_enabled,is_receive_enabled,is_activation_enabled,activation_procedure from sys.service_queues
If the queue is poised you will see the value of is_enqueue_enabled and is_receive_enabled column to 0. Which will indicate, the queue is poised. Now, we know the queue is poised – what could be the reason. Sometimes, the activation stored procedure fails to process the message sent by the initiator and if that happens more than 5 times, the queue becomes poised.
To find out the cause of the issue, we need to capture all the Service broker trace events along with RPC started and RPC completed, Sp:stmtstarting and sp:stmtcompleted , sp: starting and sp: completed. Then we will have to enable the status again by running the query:
alter queue <queue_name> with status = on
To remove the poised messages, please refer: – http://msdn.microsoft.com/en-us/library/ms166137(v=sql.105).aspx
. SQL Server Service Broker Team Blog
· SQLCAT Tech Note on SSB DB Identity Uniqueness
· Reusing Conversation
· Recycling Conversation
· Resending Messages
· Dynamic Routing
· Service List Manager
· Troubleshooting dialogs
· Error Handling