![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
Hello All, We have a SQL cluster in active/passive configuration. Unfortunately, our network has been having issues. More specifically, the line protocol on the Cisco 3550 that the primary node on this cluster is connected to seems to drop off temporarily from time to time. So the problem is that when this happens, failover does not work. It hangs up on the first node. When this happens, you can't connect to cluster administrator to see what is going on. Yet if you physically tun off the primary node, then failover will complete. We tried testing failover by unplugging the network cable from the primary node. When you do this the failover happens without incident. Our take on it is that clustering will only work if you have a failure on level 1 of the OSI layers?!?! That hardly seems right for a system as robust as SQL. Yet a level one failure kicks off the failover perfectly. Then if we lose the line protocol on the switch (whihc would be layer 3-4, it doesn't. So it appears as though the network connection is up to the server, but you can't communicate with it on the public side. Our heartbeat connection is simply a crossover cable, so I don't think that is the problem. We have been pulling out our hair on this for 2 weeks. does anyone here have any suggestions. Or, is there a way to force the cluster to failover EVERYTHING if any one resource dies? Thank you! Chris |
#3
| |||
| |||
|
|
Do any of the following help? http://support.microsoft.com/default...b;EN-US;242600 http://support.microsoft.com/default...b;EN-US;176320 http://support.microsoft.com/default...uct=winsvr2003 You didn't specify the host OS, so I included links for 2000 and 2003. If you are on NT4, upgrade it now. ![]() My take is that the LooksAlive and Isalive tests fail, forcing the failover, but the IP address is still alive on the net, preventing the Virtual server from coming up on the second node. If this is the case, you should see duplicate IP address errors in the second node's event log. -- Geoff N. Hiten Microsoft SQL Server MVP Senior Database Administrator Careerbuilder.com "Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message news:1010amhmati54e0 (AT) corp (DOT) supernews.com... Hello All, We have a SQL cluster in active/passive configuration. Unfortunately, our network has been having issues. More specifically, the line protocol on the Cisco 3550 that the primary node on this cluster is connected to seems to drop off temporarily from time to time. So the problem is that when this happens, failover does not work. It hangs up on the first node. When this happens, you can't connect to cluster administrator to see what is going on. Yet if you physically tun off the primary node, then failover will complete. We tried testing failover by unplugging the network cable from the primary node. When you do this the failover happens without incident. Our take on it is that clustering will only work if you have a failure on level 1 of the OSI layers?!?! That hardly seems right for a system as robust as SQL. Yet a level one failure kicks off the failover perfectly. Then if we lose the line protocol on the switch (whihc would be layer 3-4, it doesn't. So it appears as though the network connection is up to the server, but you can't communicate with it on the public side. Our heartbeat connection is simply a crossover cable, so I don't think that is the problem. We have been pulling out our hair on this for 2 weeks. does anyone here have any suggestions. Or, is there a way to force the cluster to failover EVERYTHING if any one resource dies? Thank you! Chris |
#4
| |||
| |||
|
|
Thanks for the input. I looked at these scenarios and nothing seems to help. In looking at the event logs, it looks as though the second node tried unsuccessfully to take control of the disk array 6 times, then turned off it's cluster service. Since the ip address had already failed over, that left the server in a half failover state. This does not happen if we just unplug the network cable though?!?! Any other ideas? Thank You again, Chris "Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote in message news:%23RB37xS4DHA.1948 (AT) TK2MSFTNGP12 (DOT) phx.gbl... Do any of the following help? http://support.microsoft.com/default...b;EN-US;242600 http://support.microsoft.com/default...b;EN-US;176320 http://support.microsoft.com/default...uct=winsvr2003 You didn't specify the host OS, so I included links for 2000 and 2003. If you are on NT4, upgrade it now. ![]() My take is that the LooksAlive and Isalive tests fail, forcing the failover, but the IP address is still alive on the net, preventing the Virtual server from coming up on the second node. If this is the case, you should see duplicate IP address errors in the second node's event log. -- Geoff N. Hiten Microsoft SQL Server MVP Senior Database Administrator Careerbuilder.com "Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message news:1010amhmati54e0 (AT) corp (DOT) supernews.com... Hello All, We have a SQL cluster in active/passive configuration. Unfortunately, our network has been having issues. More specifically, the line protocol on the Cisco 3550 that the primary node on this cluster is connected to seems to drop off temporarily from time to time. So the problem is that when this happens, failover does not work. It hangs up on the first node. When this happens, you can't connect to cluster administrator to see what is going on. Yet if you physically tun off the primary node, then failover will complete. We tried testing failover by unplugging the network cable from the primary node. When you do this the failover happens without incident. Our take on it is that clustering will only work if you have a failure on level 1 of the OSI layers?!?! That hardly seems right for a system as robust as SQL. Yet a level one failure kicks off the failover perfectly. Then if we lose the line protocol on the switch (whihc would be layer 3-4, it doesn't. So it appears as though the network connection is up to the server, but you can't communicate with it on the public side. Our heartbeat connection is simply a crossover cable, so I don't think that is the problem. We have been pulling out our hair on this for 2 weeks. does anyone here have any suggestions. Or, is there a way to force the cluster to failover EVERYTHING if any one resource dies? Thank you! Chris |
#5
| |||
| |||
|
|
-----Original Message----- Thanks for the input. I looked at these scenarios and nothing seems to help. In looking at the event logs, it looks as though the second node tried unsuccessfully to take control of the disk array 6 times, then turned off it's cluster service. Since the ip address had already failed over, that left the server in a half failover state. This does not happen if we just unplug the network cable though?!?! Any other ideas? Thank You again, Chris "Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote in message news:%23RB37xS4DHA.1948 (AT) TK2MSFTNGP12 (DOT) phx.gbl... Do any of the following help? http://support.microsoft.com/default.aspx?scid=kb;EN- US;242600 http://support.microsoft.com/default.aspx?scid=kb;EN- US;176320 http://support.microsoft.com/default.aspx?scid=kb;en- us;286342&Product=winsvr2003 You didn't specify the host OS, so I included links for 2000 and 2003. If you are on NT4, upgrade it now. ![]() My take is that the LooksAlive and Isalive tests fail, forcing the failover, but the IP address is still alive on the net, preventing the Virtual server from coming up on the second node. If this is the case, you should see duplicate IP address errors in the second node's event log. -- Geoff N. Hiten Microsoft SQL Server MVP Senior Database Administrator Careerbuilder.com "Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message news:1010amhmati54e0 (AT) corp (DOT) supernews.com... Hello All, We have a SQL cluster in active/passive configuration. Unfortunately, our network has been having issues. More specifically, the line protocol on the Cisco 3550 that the primary node on this cluster is connected to seems to drop off temporarily from time to time. So the problem is that when this happens, failover does not work. It hangs up on the first node. When this happens, you can't connect to cluster administrator to see what is going on. Yet if you physically tun off the primary node, then failover will complete. We tried testing failover by unplugging the network cable from the primary node. When you do this the failover happens without incident. Our take on it is that clustering will only work if you have a failure on level 1 of the OSI layers?!?! That hardly seems right for a system as robust as SQL. Yet a level one failure kicks off the failover perfectly. Then if we lose the line protocol on the switch (whihc would be layer 3-4, it doesn't. So it appears as though the network connection is up to the server, but you can't communicate with it on the public side. Our heartbeat connection is simply a crossover cable, so I don't think that is the problem. We have been pulling out our hair on this for 2 weeks. does anyone here have any suggestions. Or, is there a way to force the cluster to failover EVERYTHING if any one resource dies? Thank you! Chris . |
#6
| |||
| |||
|
|
What is your hardware and OS config? I am aware of at least one combination that has problems releasing the SCSI reservation when there are communication issues. -- Geoff N. Hiten Microsoft SQL Server MVP Senior Database Administrator Careerbuilder.com "Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message news:1010tp39i5pp57b (AT) corp (DOT) supernews.com... Thanks for the input. I looked at these scenarios and nothing seems to help. In looking at the event logs, it looks as though the second node tried unsuccessfully to take control of the disk array 6 times, then turned off it's cluster service. Since the ip address had already failed over, that left the server in a half failover state. This does not happen if we just unplug the network cable though?!?! Any other ideas? Thank You again, Chris "Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote in message news:%23RB37xS4DHA.1948 (AT) TK2MSFTNGP12 (DOT) phx.gbl... Do any of the following help? http://support.microsoft.com/default...b;EN-US;242600 http://support.microsoft.com/default...b;EN-US;176320 http://support.microsoft.com/default...uct=winsvr2003 You didn't specify the host OS, so I included links for 2000 and 2003. If you are on NT4, upgrade it now. ![]() My take is that the LooksAlive and Isalive tests fail, forcing the failover, but the IP address is still alive on the net, preventing the Virtual server from coming up on the second node. If this is the case, you should see duplicate IP address errors in the second node's event log. -- Geoff N. Hiten Microsoft SQL Server MVP Senior Database Administrator Careerbuilder.com "Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message news:1010amhmati54e0 (AT) corp (DOT) supernews.com... Hello All, We have a SQL cluster in active/passive configuration. Unfortunately, our network has been having issues. More specifically, the line protocol on the Cisco 3550 that the primary node on this cluster is connected to seems to drop off temporarily from time to time. So the problem is that when this happens, failover does not work. It hangs up on the first node. When this happens, you can't connect to cluster administrator to see what is going on. Yet if you physically tun off the primary node, then failover will complete. We tried testing failover by unplugging the network cable from the primary node. When you do this the failover happens without incident. Our take on it is that clustering will only work if you have a failure on level 1 of the OSI layers?!?! That hardly seems right for a system as robust as SQL. Yet a level one failure kicks off the failover perfectly. Then if we lose the line protocol on the switch (whihc would be layer 3-4, it doesn't. So it appears as though the network connection is up to the server, but you can't communicate with it on the public side. Our heartbeat connection is simply a crossover cable, so I don't think that is the problem. We have been pulling out our hair on this for 2 weeks. does anyone here have any suggestions. Or, is there a way to force the cluster to failover EVERYTHING if any one resource dies? Thank you! Chris |
#7
| |||
| |||
|
|
We are running the cluster on two Poweredge 6650 servers. Quad Processors and 8 gigs of ram each. They both connect up to a Dell power vault 220s. Windows 2000 and SQL 2000 They connect up to a dell power "Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote in message news:e7BteLc4DHA.2136 (AT) TK2MSFTNGP12 (DOT) phx.gbl... What is your hardware and OS config? I am aware of at least one combination that has problems releasing the SCSI reservation when there are communication issues. -- Geoff N. Hiten Microsoft SQL Server MVP Senior Database Administrator Careerbuilder.com "Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message news:1010tp39i5pp57b (AT) corp (DOT) supernews.com... Thanks for the input. I looked at these scenarios and nothing seems to help. In looking at the event logs, it looks as though the second node tried unsuccessfully to take control of the disk array 6 times, then turned off it's cluster service. Since the ip address had already failed over, that left the server in a half failover state. This does not happen if we just unplug the network cable though?!?! Any other ideas? Thank You again, Chris "Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote in message news:%23RB37xS4DHA.1948 (AT) TK2MSFTNGP12 (DOT) phx.gbl... Do any of the following help? http://support.microsoft.com/default...b;EN-US;242600 http://support.microsoft.com/default...b;EN-US;176320 http://support.microsoft.com/default...&Product=winsv |
|
You didn't specify the host OS, so I included links for 2000 and 2003. If you are on NT4, upgrade it now. ![]() My take is that the LooksAlive and Isalive tests fail, forcing the failover, but the IP address is still alive on the net, preventing the Virtual server from coming up on the second node. If this is the case, you should see duplicate IP address errors in the second node's event log. -- Geoff N. Hiten Microsoft SQL Server MVP Senior Database Administrator Careerbuilder.com "Chris Carmichael" <cc (AT) someisp (DOT) com> wrote in message news:1010amhmati54e0 (AT) corp (DOT) supernews.com... Hello All, We have a SQL cluster in active/passive configuration. Unfortunately, our network has been having issues. More specifically, the line protocol on the Cisco 3550 that the primary node on this cluster is connected to seems to drop off temporarily from time to time. So the problem is that when this happens, failover does not work. It hangs up on the first node. When this happens, you can't connect to cluster administrator to see what is going on. Yet if you physically tun off the primary node, then failover will complete. We tried testing failover by unplugging the network cable from the primary node. When you do this the failover happens without incident. Our take on it is that clustering will only work if you have a failure on level 1 of the OSI layers?!?! That hardly seems right for a system as robust as SQL. Yet a level one failure kicks off the failover perfectly. Then if we lose the line protocol on the switch (whihc would be layer 3-4, it doesn't. So it appears as though the network connection is up to the server, but you can't communicate with it on the public side. Our heartbeat connection is simply a crossover cable, so I don't think that is the problem. We have been pulling out our hair on this for 2 weeks. does anyone here have any suggestions. Or, is there a way to force the cluster to failover EVERYTHING if any one resource dies? Thank you! Chris |
![]() |
| Thread Tools | |
| Display Modes | |
| |