dbTalk Databases Forums  

Hung scheduler and heartbeat question

microsoft.public.sqlserver.clustering microsoft.public.sqlserver.clustering


Discuss Hung scheduler and heartbeat question in the microsoft.public.sqlserver.clustering forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
frankm
 
Posts: n/a

Default Hung scheduler and heartbeat question - 12-30-2003 , 08:00 AM






Scenario is:
SQL Server 2000 sp3a, W2k Adv Server, 2 node cluster (active /
passive)...........

If the active node experiences a
"Error 17883 - Process 0:0 (490) UMS Context 0x12AEB9A0 appears to be
non-yielding on Scheduler 0."
condition and SQL Server stops or severely slows responses to users, is it
possible that the SQL Server Heartbeat (I believe is an @@VERSION call or
similar) could still be occurring while users are experiencing problems,
thereby not failing over but appearing dead?

My guess is that SQL Server could still respond to the Heartbeat call while
users are having response problems.

Frankm



Reply With Quote
  #2  
Old   
Yuan Shao
 
Posts: n/a

Default RE: Hung scheduler and heartbeat question - 12-30-2003 , 10:13 PM






Hi Frankm,

My name is Michael and I would like to thank you for using Microsoft
newsgroup.

As I understand, that your active node experiences a 17883 error and you
also want to know if it is possible the active is not regarded as dead,
although the performance is very poor. If I have misunderstood, please feel
free to let me know.

Based on my research, Error 17883 is a new error that has been added in the
SQL Server 2000 Service Pack 3. This error is observed very often in SQL
Server Support. The error 17883 basically means that a scheduler in SQL
Server might have stopped responding. It mostly likes a performance problem
on the active node.

So far as I know, the situation you described is possible to occur. From a
SQL Server perspective, the node hosting the SQL Server resource does a
looks-alive check every 5 seconds. This is a lightweight check to see
whether the service is running and may succeed even if the instance of SQL
Server is not operational. The IsAlive check is more thorough and involves
running a SELECT @@SERVERNAME Transact SQL query against the server to
determine whether the server itself is available to respond to requests; it
does not guarantee that the user databases are up. If this query fails, the
IsAlive check retries five times and then attempts to reconnect to the
instance of SQL Server. If all five retries fail, the SQL Server resource
fails.

However, because of this performance problem, the fail over from one node
to another is always unable to solve the performance problem. It is
recommended that we can find the root cause of the error 17883 on the
active node but failover.

Because of the error message 17883, I would like you to check if you have
applied the security Bulletin MS03-031, If not, I would like you to apply
the Microsoft Security Bulletin MS03-031 firstly.

For additional information about how to obtain that security patch from the
Microsoft Download Center, click the following article numbers to view the
articles in the Microsoft Knowledge Base:

815495 MS03-031: Cumulative Security Patch for SQL Server
http://support.microsoft.com/?id=815495

821277 MS03-031:Security Patch for SQL Server 2000 Service Pack 3
http://support.microsoft.com/?id=821277

Note: Make sure that you read the security bulletin articles completely and
thoroughly before you apply the security patch. This hotfix must be applied
on each client computer that is experiencing the problem.

Also, due to the complexity of this issue, it would be best to contact
Microsoft Product Support Services via telephone so that a dedicated
Support Professional can assist with your request. To obtain the phone
numbers for specific technology request please take a look at the web site
listed below.
http://support.microsoft.com/default...S;PHONENUMBERS

Thanks for using Microsoft newsgroup.

Regards,

Michael Shao
Microsoft Online Partner Support
Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.


Reply With Quote
  #3  
Old   
frankm
 
Posts: n/a

Default Re: Hung scheduler and heartbeat question - 12-31-2003 , 11:03 AM



Thank you...So I can still get a Heartbeat but have the db inaccessible from
a user standpoint.

I will check to see if the MS03-031 was applied, and schedule it if it was
not.

We had applied the KB810185 patch on Sunday - on Monday we have this
problem, could there be a connection?

We would like to find the root cause but I have not seen anything that would
help track down where to look. When this happened, outside connections to
SQL Server seemed to be fine - at the same time users on the box showed
diminishing performance over a 2 hour period. Then the SQL Server and OS
seemed to stop responding. It was then that the box was rebooted. All we
have is the single 17883 error as the last message in the SQL Server
errorlog and various timeout type errors with apps and process in the event
log. From here it is anyones guess, no stack dumps, no log entries to point
at anything - just a lot of slow and diminishing performance until the end.


""Yuan Shao"" <v-yshao (AT) online (DOT) microsoft.com> wrote

Quote:
Hi Frankm,

My name is Michael and I would like to thank you for using Microsoft
newsgroup.

As I understand, that your active node experiences a 17883 error and you
also want to know if it is possible the active is not regarded as dead,
although the performance is very poor. If I have misunderstood, please
feel
free to let me know.

Based on my research, Error 17883 is a new error that has been added in
the
SQL Server 2000 Service Pack 3. This error is observed very often in SQL
Server Support. The error 17883 basically means that a scheduler in SQL
Server might have stopped responding. It mostly likes a performance
problem
on the active node.

So far as I know, the situation you described is possible to occur. From a
SQL Server perspective, the node hosting the SQL Server resource does a
looks-alive check every 5 seconds. This is a lightweight check to see
whether the service is running and may succeed even if the instance of SQL
Server is not operational. The IsAlive check is more thorough and involves
running a SELECT @@SERVERNAME Transact SQL query against the server to
determine whether the server itself is available to respond to requests;
it
does not guarantee that the user databases are up. If this query fails,
the
IsAlive check retries five times and then attempts to reconnect to the
instance of SQL Server. If all five retries fail, the SQL Server resource
fails.

However, because of this performance problem, the fail over from one node
to another is always unable to solve the performance problem. It is
recommended that we can find the root cause of the error 17883 on the
active node but failover.

Because of the error message 17883, I would like you to check if you have
applied the security Bulletin MS03-031, If not, I would like you to apply
the Microsoft Security Bulletin MS03-031 firstly.

For additional information about how to obtain that security patch from
the
Microsoft Download Center, click the following article numbers to view the
articles in the Microsoft Knowledge Base:

815495 MS03-031: Cumulative Security Patch for SQL Server
http://support.microsoft.com/?id=815495

821277 MS03-031:Security Patch for SQL Server 2000 Service Pack 3
http://support.microsoft.com/?id=821277

Note: Make sure that you read the security bulletin articles completely
and
thoroughly before you apply the security patch. This hotfix must be
applied
on each client computer that is experiencing the problem.

Also, due to the complexity of this issue, it would be best to contact
Microsoft Product Support Services via telephone so that a dedicated
Support Professional can assist with your request. To obtain the phone
numbers for specific technology request please take a look at the web site
listed below.
http://support.microsoft.com/default...S;PHONENUMBERS

Thanks for using Microsoft newsgroup.

Regards,

Michael Shao
Microsoft Online Partner Support
Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.




Reply With Quote
  #4  
Old   
Yuan Shao
 
Posts: n/a

Default Re: Hung scheduler and heartbeat question - 01-02-2004 , 01:50 AM



Hi Frankm,

Thanks for your feedback. Based on my experience, such performance problem
with error 17883 usually cannot be solved via failover, it is recommended
that we can narrow down this issue and solve the problem on the active
node.

In SQL Server 2000, when a scheduler is not yielding or giving control from
one thread to another thread; the error 17883 occurs. Such a situation is
most likely, but not always, a bug in SQL Server 2000. It is most likely a
bug because a thread is supposed to yield the scheduler ever so often.
However, such situation may not always be the result of a bug because the
cause of the problem can be an external application's CPU consumption or
hardware failures. Therefore, it is hard to say the hotfix (KB810185) is
related to this performance problem before we performed further research.

Due to the complexity of this kind of issue, it would be highly suggested
that you are able to contact Microsoft Product Support Services via
telephone so that a dedicated Support Professional can assist with your
request. To obtain the phone numbers for specific technology request,
please take a look at the web site listed below.
http://support.microsoft.com/default...S;PHONENUMBERS

Regards,

Michael Shao
Microsoft Online Partner Support
Get Secure! - www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.