dbTalk Databases Forums  

Failover doesn't work properly whith network problems.

microsoft.public.sqlserver.clustering microsoft.public.sqlserver.clustering


Discuss Failover doesn't work properly whith network problems. in the microsoft.public.sqlserver.clustering forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Chris Carmichael
 
Posts: n/a

Default Failover doesn't work properly whith network problems. - 01-22-2004 , 01:56 PM






Hello All,

We have a SQL cluster in active/passive configuration. Unfortunately, our
network has been having issues. More specifically, the line protocol on the
Cisco 3550 that the primary node on this cluster is connected to seems to
drop off temporarily from time to time.

So the problem is that when this happens, failover does not work. It hangs
up on the first node. When this happens, you can't connect to cluster
administrator to see what is going on. Yet if you physically tun off the
primary node, then failover will complete. We tried testing failover by
unplugging the network cable from the primary node. When you do this the
failover happens without incident.

Our take on it is that clustering will only work if you have a failure on
level 1 of the OSI layers?!?! That hardly seems right for a system as robust
as SQL. Yet a level one failure kicks off the failover perfectly. Then if
we lose the line protocol on the switch (whihc would be layer 3-4, it
doesn't. So it appears as though the network connection is up to the
server, but you can't communicate with it on the public side. Our heartbeat
connection is simply a crossover cable, so I don't think that is the
problem.

We have been pulling out our hair on this for 2 weeks. does anyone here
have any suggestions. Or, is there a way to force the cluster to failover
EVERYTHING if any one resource dies?

Thank you!

Chris




Reply With Quote
  #2  
Old   
Geoff N. Hiten
 
Posts: n/a

Default Re: Failover doesn't work properly whith network problems. - 01-22-2004 , 03:12 PM






Do any of the following help?
http://support.microsoft.com/default...b;EN-US;242600

http://support.microsoft.com/default...b;EN-US;176320

http://support.microsoft.com/default...uct=winsvr2003

You didn't specify the host OS, so I included links for 2000 and 2003. If
you are on NT4, upgrade it now.

My take is that the LooksAlive and Isalive tests fail, forcing the failover,
but the IP address is still alive on the net, preventing the Virtual server
from coming up on the second node. If this is the case, you should see
duplicate IP address errors in the second node's event log.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com




"Chris Carmichael" <cc (AT) someisp (DOT) com> wrote

Quote:
Hello All,

We have a SQL cluster in active/passive configuration. Unfortunately, our
network has been having issues. More specifically, the line protocol on
the
Cisco 3550 that the primary node on this cluster is connected to seems to
drop off temporarily from time to time.

So the problem is that when this happens, failover does not work. It
hangs
up on the first node. When this happens, you can't connect to cluster
administrator to see what is going on. Yet if you physically tun off the
primary node, then failover will complete. We tried testing failover by
unplugging the network cable from the primary node. When you do this the
failover happens without incident.

Our take on it is that clustering will only work if you have a failure on
level 1 of the OSI layers?!?! That hardly seems right for a system as
robust
as SQL. Yet a level one failure kicks off the failover perfectly. Then
if
we lose the line protocol on the switch (whihc would be layer 3-4, it
doesn't. So it appears as though the network connection is up to the
server, but you can't communicate with it on the public side. Our
heartbeat
connection is simply a crossover cable, so I don't think that is the
problem.

We have been pulling out our hair on this for 2 weeks. does anyone here
have any suggestions. Or, is there a way to force the cluster to failover
EVERYTHING if any one resource dies?

Thank you!

Chris






Reply With Quote
  #3  
Old   
Chris Carmichael
 
Posts: n/a

Default Re: Failover doesn't work properly whith network problems. - 01-22-2004 , 07:21 PM



Thanks for the input. I looked at these scenarios and nothing seems to help.
In looking at the event logs, it looks as though the second node tried
unsuccessfully to take control of the disk array 6 times, then turned off
it's cluster service. Since the ip address had already failed over, that
left the server in a half failover state. This does not happen if we just
unplug the network cable though?!?! Any other ideas?

Thank You again,

Chris

"Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote

Quote:
Do any of the following help?
http://support.microsoft.com/default...b;EN-US;242600

http://support.microsoft.com/default...b;EN-US;176320


http://support.microsoft.com/default...uct=winsvr2003

You didn't specify the host OS, so I included links for 2000 and 2003. If
you are on NT4, upgrade it now.

My take is that the LooksAlive and Isalive tests fail, forcing the
failover,
but the IP address is still alive on the net, preventing the Virtual
server
from coming up on the second node. If this is the case, you should see
duplicate IP address errors in the second node's event log.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com




"Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message
news:1010amhmati54e0 (AT) corp (DOT) supernews.com...
Hello All,

We have a SQL cluster in active/passive configuration. Unfortunately,
our
network has been having issues. More specifically, the line protocol on
the
Cisco 3550 that the primary node on this cluster is connected to seems
to
drop off temporarily from time to time.

So the problem is that when this happens, failover does not work. It
hangs
up on the first node. When this happens, you can't connect to cluster
administrator to see what is going on. Yet if you physically tun off
the
primary node, then failover will complete. We tried testing failover by
unplugging the network cable from the primary node. When you do this
the
failover happens without incident.

Our take on it is that clustering will only work if you have a failure
on
level 1 of the OSI layers?!?! That hardly seems right for a system as
robust
as SQL. Yet a level one failure kicks off the failover perfectly. Then
if
we lose the line protocol on the switch (whihc would be layer 3-4, it
doesn't. So it appears as though the network connection is up to the
server, but you can't communicate with it on the public side. Our
heartbeat
connection is simply a crossover cable, so I don't think that is the
problem.

We have been pulling out our hair on this for 2 weeks. does anyone here
have any suggestions. Or, is there a way to force the cluster to
failover
EVERYTHING if any one resource dies?

Thank you!

Chris








Reply With Quote
  #4  
Old   
Geoff N. Hiten
 
Posts: n/a

Default Re: Failover doesn't work properly whith network problems. - 01-23-2004 , 09:09 AM



What is your hardware and OS config? I am aware of at least one combination
that has problems releasing the SCSI reservation when there are
communication issues.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com




"Chris Carmichael" <cc (AT) someisp (DOT) com> wrote

Quote:
Thanks for the input. I looked at these scenarios and nothing seems to
help.
In looking at the event logs, it looks as though the second node tried
unsuccessfully to take control of the disk array 6 times, then turned off
it's cluster service. Since the ip address had already failed over, that
left the server in a half failover state. This does not happen if we just
unplug the network cable though?!?! Any other ideas?

Thank You again,

Chris

"Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote in message
news:%23RB37xS4DHA.1948 (AT) TK2MSFTNGP12 (DOT) phx.gbl...
Do any of the following help?
http://support.microsoft.com/default...b;EN-US;242600

http://support.microsoft.com/default...b;EN-US;176320



http://support.microsoft.com/default...uct=winsvr2003

You didn't specify the host OS, so I included links for 2000 and 2003.
If
you are on NT4, upgrade it now.

My take is that the LooksAlive and Isalive tests fail, forcing the
failover,
but the IP address is still alive on the net, preventing the Virtual
server
from coming up on the second node. If this is the case, you should see
duplicate IP address errors in the second node's event log.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com




"Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message
news:1010amhmati54e0 (AT) corp (DOT) supernews.com...
Hello All,

We have a SQL cluster in active/passive configuration. Unfortunately,
our
network has been having issues. More specifically, the line protocol
on
the
Cisco 3550 that the primary node on this cluster is connected to seems
to
drop off temporarily from time to time.

So the problem is that when this happens, failover does not work. It
hangs
up on the first node. When this happens, you can't connect to cluster
administrator to see what is going on. Yet if you physically tun off
the
primary node, then failover will complete. We tried testing failover
by
unplugging the network cable from the primary node. When you do this
the
failover happens without incident.

Our take on it is that clustering will only work if you have a failure
on
level 1 of the OSI layers?!?! That hardly seems right for a system as
robust
as SQL. Yet a level one failure kicks off the failover perfectly.
Then
if
we lose the line protocol on the switch (whihc would be layer 3-4, it
doesn't. So it appears as though the network connection is up to the
server, but you can't communicate with it on the public side. Our
heartbeat
connection is simply a crossover cable, so I don't think that is the
problem.

We have been pulling out our hair on this for 2 weeks. does anyone
here
have any suggestions. Or, is there a way to force the cluster to
failover
EVERYTHING if any one resource dies?

Thank you!

Chris










Reply With Quote
  #5  
Old   
Dan Johnson
 
Posts: n/a

Default Re: Failover doesn't work properly whith network problems. - 01-23-2004 , 11:34 AM



Chris, I am having the same problem. SQL2000/Win2000
server config with a CISCO 6500 switch. I have 3 other
clusters with the same config NOT having an issue.
Diags on the primary cluster node shows no issue with the
NICS. The switch is showing no alingment errors or
dropped packets to these ports. I do believe that this
is a network issue and am going to fail over to the
second node next weekend and run there for awhile to rule
out the server.
Drop me an email if you have any luck!
djohnson (AT) shc (DOT) org
Quote:
-----Original Message-----
Thanks for the input. I looked at these scenarios and
nothing seems to help.
In looking at the event logs, it looks as though the
second node tried
unsuccessfully to take control of the disk array 6
times, then turned off
it's cluster service. Since the ip address had already
failed over, that
left the server in a half failover state. This does not
happen if we just
unplug the network cable though?!?! Any other ideas?

Thank You again,

Chris

"Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote in
message
news:%23RB37xS4DHA.1948 (AT) TK2MSFTNGP12 (DOT) phx.gbl...
Do any of the following help?
http://support.microsoft.com/default.aspx?scid=kb;EN-
US;242600

http://support.microsoft.com/default.aspx?scid=kb;EN-
US;176320


http://support.microsoft.com/default.aspx?scid=kb;en-
us;286342&Product=winsvr2003

You didn't specify the host OS, so I included links
for 2000 and 2003. If
you are on NT4, upgrade it now.

My take is that the LooksAlive and Isalive tests fail,
forcing the
failover,
but the IP address is still alive on the net,
preventing the Virtual
server
from coming up on the second node. If this is the
case, you should see
duplicate IP address errors in the second node's event
log.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com




"Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message
news:1010amhmati54e0 (AT) corp (DOT) supernews.com...
Hello All,

We have a SQL cluster in active/passive
configuration. Unfortunately,
our
network has been having issues. More specifically,
the line protocol on
the
Cisco 3550 that the primary node on this cluster is
connected to seems
to
drop off temporarily from time to time.

So the problem is that when this happens, failover
does not work. It
hangs
up on the first node. When this happens, you can't
connect to cluster
administrator to see what is going on. Yet if you
physically tun off
the
primary node, then failover will complete. We tried
testing failover by
unplugging the network cable from the primary node.
When you do this
the
failover happens without incident.

Our take on it is that clustering will only work if
you have a failure
on
level 1 of the OSI layers?!?! That hardly seems
right for a system as
robust
as SQL. Yet a level one failure kicks off the
failover perfectly. Then
if
we lose the line protocol on the switch (whihc would
be layer 3-4, it
doesn't. So it appears as though the network
connection is up to the
server, but you can't communicate with it on the
public side. Our
heartbeat
connection is simply a crossover cable, so I don't
think that is the
problem.

We have been pulling out our hair on this for 2
weeks. does anyone here
have any suggestions. Or, is there a way to force
the cluster to
failover
EVERYTHING if any one resource dies?

Thank you!

Chris







.


Reply With Quote
  #6  
Old   
Chris Carmichael
 
Posts: n/a

Default Re: Failover doesn't work properly whith network problems. - 01-23-2004 , 07:23 PM



We are running the cluster on two Poweredge 6650 servers. Quad Processors
and 8 gigs of ram each. They both connect up to a Dell power vault 220s.
Windows 2000 and SQL 2000

They connect up to a dell power
"Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote

Quote:
What is your hardware and OS config? I am aware of at least one
combination
that has problems releasing the SCSI reservation when there are
communication issues.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com




"Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message
news:1010tp39i5pp57b (AT) corp (DOT) supernews.com...
Thanks for the input. I looked at these scenarios and nothing seems to
help.
In looking at the event logs, it looks as though the second node tried
unsuccessfully to take control of the disk array 6 times, then turned
off
it's cluster service. Since the ip address had already failed over, that
left the server in a half failover state. This does not happen if we
just
unplug the network cable though?!?! Any other ideas?

Thank You again,

Chris

"Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote in message
news:%23RB37xS4DHA.1948 (AT) TK2MSFTNGP12 (DOT) phx.gbl...
Do any of the following help?
http://support.microsoft.com/default...b;EN-US;242600

http://support.microsoft.com/default...b;EN-US;176320




http://support.microsoft.com/default...uct=winsvr2003

You didn't specify the host OS, so I included links for 2000 and 2003.
If
you are on NT4, upgrade it now.

My take is that the LooksAlive and Isalive tests fail, forcing the
failover,
but the IP address is still alive on the net, preventing the Virtual
server
from coming up on the second node. If this is the case, you should
see
duplicate IP address errors in the second node's event log.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com




"Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message
news:1010amhmati54e0 (AT) corp (DOT) supernews.com...
Hello All,

We have a SQL cluster in active/passive configuration.
Unfortunately,
our
network has been having issues. More specifically, the line
protocol
on
the
Cisco 3550 that the primary node on this cluster is connected to
seems
to
drop off temporarily from time to time.

So the problem is that when this happens, failover does not work.
It
hangs
up on the first node. When this happens, you can't connect to
cluster
administrator to see what is going on. Yet if you physically tun
off
the
primary node, then failover will complete. We tried testing
failover
by
unplugging the network cable from the primary node. When you do
this
the
failover happens without incident.

Our take on it is that clustering will only work if you have a
failure
on
level 1 of the OSI layers?!?! That hardly seems right for a system
as
robust
as SQL. Yet a level one failure kicks off the failover perfectly.
Then
if
we lose the line protocol on the switch (whihc would be layer 3-4,
it
doesn't. So it appears as though the network connection is up to
the
server, but you can't communicate with it on the public side. Our
heartbeat
connection is simply a crossover cable, so I don't think that is the
problem.

We have been pulling out our hair on this for 2 weeks. does anyone
here
have any suggestions. Or, is there a way to force the cluster to
failover
EVERYTHING if any one resource dies?

Thank you!

Chris












Reply With Quote
  #7  
Old   
Geoff N. Hiten
 
Posts: n/a

Default Re: Failover doesn't work properly whith network problems. - 01-24-2004 , 09:20 PM



Hmmm. SCSI Cluster. Not my favorite config, but you work with what you
got.

Obvious question #1, was this put togehter by Dell and certified as a valid
cluster? If not, at least tell me that all the parts, ESPECIALLY the Perc
cards, are on the Cluster HCL. If not, you may have found out why it didn't
make the list.

You may want to open a case with PSS if the cluster is certified. This is
beginning to sound like it may be more than a newsgroup help situation.
Note, if it is not a certified cluster, PSS likely won't touch it.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
CareerBuilder.com


"Chris Carmichael" <cc (AT) someisp (DOT) com> wrote

Quote:
We are running the cluster on two Poweredge 6650 servers. Quad Processors
and 8 gigs of ram each. They both connect up to a Dell power vault 220s.
Windows 2000 and SQL 2000

They connect up to a dell power
"Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote in message
news:e7BteLc4DHA.2136 (AT) TK2MSFTNGP12 (DOT) phx.gbl...
What is your hardware and OS config? I am aware of at least one
combination
that has problems releasing the SCSI reservation when there are
communication issues.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com




"Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message
news:1010tp39i5pp57b (AT) corp (DOT) supernews.com...
Thanks for the input. I looked at these scenarios and nothing seems to
help.
In looking at the event logs, it looks as though the second node tried
unsuccessfully to take control of the disk array 6 times, then turned
off
it's cluster service. Since the ip address had already failed over,
that
left the server in a half failover state. This does not happen if we
just
unplug the network cable though?!?! Any other ideas?

Thank You again,

Chris

"Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote in message
news:%23RB37xS4DHA.1948 (AT) TK2MSFTNGP12 (DOT) phx.gbl...
Do any of the following help?
http://support.microsoft.com/default...b;EN-US;242600

http://support.microsoft.com/default...b;EN-US;176320





http://support.microsoft.com/default...&Product=winsv
r2003
Quote:
You didn't specify the host OS, so I included links for 2000 and
2003.
If
you are on NT4, upgrade it now.

My take is that the LooksAlive and Isalive tests fail, forcing the
failover,
but the IP address is still alive on the net, preventing the Virtual
server
from coming up on the second node. If this is the case, you should
see
duplicate IP address errors in the second node's event log.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com




"Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message
news:1010amhmati54e0 (AT) corp (DOT) supernews.com...
Hello All,

We have a SQL cluster in active/passive configuration.
Unfortunately,
our
network has been having issues. More specifically, the line
protocol
on
the
Cisco 3550 that the primary node on this cluster is connected to
seems
to
drop off temporarily from time to time.

So the problem is that when this happens, failover does not work.
It
hangs
up on the first node. When this happens, you can't connect to
cluster
administrator to see what is going on. Yet if you physically tun
off
the
primary node, then failover will complete. We tried testing
failover
by
unplugging the network cable from the primary node. When you do
this
the
failover happens without incident.

Our take on it is that clustering will only work if you have a
failure
on
level 1 of the OSI layers?!?! That hardly seems right for a system
as
robust
as SQL. Yet a level one failure kicks off the failover perfectly.
Then
if
we lose the line protocol on the switch (whihc would be layer 3-4,
it
doesn't. So it appears as though the network connection is up to
the
server, but you can't communicate with it on the public side. Our
heartbeat
connection is simply a crossover cable, so I don't think that is
the
problem.

We have been pulling out our hair on this for 2 weeks. does
anyone
here
have any suggestions. Or, is there a way to force the cluster to
failover
EVERYTHING if any one resource dies?

Thank you!

Chris














Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.