![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
I'm seeing some issues on a Two Node cluster (majority + disk). Failover works GORGEOUSLY. No problems. They go like a volleyball game. But when I shut one of the nodes down, the node that's gone down won't give up the reservations to the remaining node. As a result, failover works, but SQL can't come up on the remaining node because the remaining node can't bring the failed instance's drives on line. There's something about shutting down . . . The first time this happened it drove us N.U.T.S. It lead us to evict the node that couldn't get the drives back, but it lead to a total cluster rebuild over NYEve weekend. The first time, as we learned from the second time it happened, all we would have needed to do to fix the problem was reboot the node that had locked up the drives. The second time, we managed to check through all the iSCSI details and reached that conclusion about the reboot. BOTH TIMES the shutdown was supposedly in an orderly fashion. The first time, I ordered one node to shut down so I could set the terminal services to disabled and boot without them. The instance failed properly to the remaining node, but it couldn't come back. The second time, I just shut a node down instead of logging off. (Hey, don't tell me you've not done this one in your life .My understanding is that whether deliberate or fault-related, the group should failover on shutdown. Have you ever heard of anything like this? The first time this happened, the cluster was running on Fibre Channel, but we had just moved the Witness Disk to the iSCSI san we were migrating to. We had the data on the FC, and MSDtc and Witness disks on iSCSI. This first time, a run of the validation wizard showed the Witness disk was the problem child that couldn't be brought on line. The simultaneous availability check on W: failed after my reboot. Sadly, I don't remember exactly which node owned it when I shut down. The second time, the problem wasn't with the witness disk. One of our SQL Instance's Data, Temp, and Master volumes couldn't come on line. BTW, the nodes are running W2K8 SP2, SQL2K8 SP1, and are connected to an Equallogic iSCSI SAN. ALSO, MAYBE Important: All our data locations are drives mounted on small parent volumes. |
#3
| ||||
| ||||
|
|
Sounds like a glitch in the drivers or in the SAN firmware. Are you sure you have the SAN set to allow multiple connections to the target LUNs? |
|
As far as using "Anchor" LUNs and mountpoints, I do that as a matter of course. Make sure the mount point volumes are dependent on the Anchor LUN and that SQL is dependent on all LUNs (anchor and mountpoint) to make sure everything comes online in the correct sequence. |
|
You are correct, the group should failover on a host node shutdown whether the shutdown is controlled or not. A couple of questions. Are the iSCSI connections on dedicated NICs? If so, are these iSCSI NICs or ordinary network NICs? |
|
If iSCSI is on separate NICs, are all the other protocols disabled on those NICs? |
#4
| |||
| |||
|
|
Geoff, thanks for your response. Please see my responses in line. I look forward to testing out different configurations to combat this problem as downtime allows. It's scarce at all times, but maybe if we come up with a plan we can test out a change. On 1/20/10 1:44 PM, in article OTAaoCgmKHA.4312 (AT) TK2MSFTNGP05 (DOT) phx.gbl, "Geoff N. Hiten" <SQLCraftsman (AT) gmail (DOT) com> wrote: Sounds like a glitch in the drivers or in the SAN firmware. Are you sure you have the SAN set to allow multiple connections to the target LUNs? Yes, that was the first thing I thought of. As far as using "Anchor" LUNs and mountpoints, I do that as a matter of course. Make sure the mount point volumes are dependent on the Anchor LUN and that SQL is dependent on all LUNs (anchor and mountpoint) to make sure everything comes online in the correct sequence. I read somewhere this wasn't strictly necessary with W2K8 and SQL2K8. In the previous generation, with W2K3 and SQL2K5, I had made sure that SQL depended on every single drive. Is it in books on line where I was advised that one needed obsess about SQL process dependencies? I think I read something to the effect of 'Make sure SQL depends on the name and the parent disk, and then Windows clustering will do the rest." If this is not true, adding the dependencies may be a quick fix. You are correct, the group should failover on a host node shutdown whether the shutdown is controlled or not. A couple of questions. Are the iSCSI connections on dedicated NICs? If so, are these iSCSI NICs or ordinary network NICs? Yes, they are on dedicated nics. They are not strictly iSCSI nics, but they are on NICs whose drivers advertise an iSCSI, not merely a TOE capability. They are HP NC360T's. If iSCSI is on separate NICs, are all the other protocols disabled on those NICs? Hmmm. What do you mean exactly? We're standardized on IP v 4. I could unbind IP6. But I think I need to have the HP Network Control Utility onboard to set the cards to use Jumbo Frames and Flow Control. Thanks for getting back so quickly. |
![]() |
| Thread Tools | |
| Display Modes | |
| |