dbTalk Databases Forums  

Extremely high % Disk Time .. How is this possible??

microsoft.public.sqlserver.clustering microsoft.public.sqlserver.clustering


Discuss Extremely high % Disk Time .. How is this possible?? in the microsoft.public.sqlserver.clustering forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Tim
 
Posts: n/a

Default Extremely high % Disk Time .. How is this possible?? - 02-09-2005 , 10:59 AM






Hi Group

SYSTEM:

Clustered SQL Server 2000 on Windows 2000 (2 servers), with RAID 1+0, 6
drives (Total of 24 20GB hard drives with mirroring/striping). 4 GB RAM /
Server. All fiber back end, with Fibre Channel Storage Hub 7 ( @ 100
MB/sec ). Our SQL servers have fiber cards @ 400 MB/sec connected to hub.
About 50 DBases, ranging in size from about 50-100 MB up to 6 GB. They are
shared by 15 clients (they are fairly balanced over a 24 hour period, when 1
client is busy, the other isn't). The total size of dbases is ~ 65 GB, Log
File ~ 35 GB. 100-250 connected users (to our application), with 300-500
SQL connections. Dual XEON 2.8Ghz Processors / Server.

PROBLEM:

We are experiencing disk problems but are having a very difficult time
diagnosing 'exactly' what is the cause and solution. On the data drives,
thru PerfMon, the % Disk Times are averaging between 250 and 1500, with
peaks of over 30,000 (How do I get 30,000 % disk use??). Average disk
queues are setting at about 100-250 during the problem times, but
occasionally see it spike well over 1000. The transactions/sec hover
between 50-100. Processor hovers between 10 and 40%, spiking to 70-80%.
The disk bytes/sec for the data drives average betwwen 1 and 6 MBytes/sec.

I know there are a lot of variables that can be at play here, but where
should I start? Is the system overall too small? Over worked?? We did have
memory issues but increased the RAM and there no longer appears to be buffer
cache problems and we have not yet exceeded the available RAM. We have
removed most of our temp tables (in favor of table variables) and removed a
lot of the cursors that were in the system. We moved our reporting server
to it's own box (completely seperate from prod servers). We reduced the log
file backup intervals and staggered our database backup jobs. We converted
portions of our reporting to Analysis services. All with short term
marginal improvements. We have to speed our system up before we start
losing clients. Does it sound/appear as though the next option would be to
scale out? scale up? Goto a SAN environment? And info and advise would be
greatly appreciated!

If you got this far, thanks for the time in reading this long post!!! And am
anxious to read any responses. Thank you.



Reply With Quote
  #2  
Old   
Andrew J. Kelly
 
Posts: n/a

Default Re: Extremely high % Disk Time .. How is this possible?? - 02-09-2005 , 12:16 PM






It sure sounds like your storage array is not able to handle the load.
Having a Raid 1+0 is great but only having 6 drives is on the low end.
Since you are not on a SAN you only have the cache on the controllers
themselves to help deal with all the writes. Make sure the cache on the
controllers is set to 100% write back and not read. It also sounds like you
only have a single drive array and everything is on it (logs , data and
tempdb). That also is not good. You should have at least a Raid1 or
Raid1+0 for just the log files. The spikes can be caused by lots of things
but you are probably seeing some of this due to checkpoints. You can use
the Checkpoint pages per second perfmon counter to correlate this to the
disk queue spikes. You might also try using connection pooling instead of
that many individual connections. You will need multiple pools to handle
the different dbs but it should cut down the actual number of connections
quite a bit.Have you ran any traces to see what kind of queries are being
run and who the worst offenders are? You may have a few really poorly tuned
queries that are hurting everything else.


--
Andrew J. Kelly SQL MVP


"Tim" <tim (AT) pds (DOT) notanaddress.ca> wrote

Quote:
Hi Group

SYSTEM:

Clustered SQL Server 2000 on Windows 2000 (2 servers), with RAID 1+0, 6
drives (Total of 24 20GB hard drives with mirroring/striping). 4 GB RAM /
Server. All fiber back end, with Fibre Channel Storage Hub 7 ( @ 100
MB/sec ). Our SQL servers have fiber cards @ 400 MB/sec connected to hub.
About 50 DBases, ranging in size from about 50-100 MB up to 6 GB. They
are
shared by 15 clients (they are fairly balanced over a 24 hour period, when
1
client is busy, the other isn't). The total size of dbases is ~ 65 GB,
Log
File ~ 35 GB. 100-250 connected users (to our application), with 300-500
SQL connections. Dual XEON 2.8Ghz Processors / Server.

PROBLEM:

We are experiencing disk problems but are having a very difficult time
diagnosing 'exactly' what is the cause and solution. On the data drives,
thru PerfMon, the % Disk Times are averaging between 250 and 1500, with
peaks of over 30,000 (How do I get 30,000 % disk use??). Average disk
queues are setting at about 100-250 during the problem times, but
occasionally see it spike well over 1000. The transactions/sec hover
between 50-100. Processor hovers between 10 and 40%, spiking to 70-80%.
The disk bytes/sec for the data drives average betwwen 1 and 6 MBytes/sec.

I know there are a lot of variables that can be at play here, but where
should I start? Is the system overall too small? Over worked?? We did
have
memory issues but increased the RAM and there no longer appears to be
buffer
cache problems and we have not yet exceeded the available RAM. We have
removed most of our temp tables (in favor of table variables) and removed
a
lot of the cursors that were in the system. We moved our reporting server
to it's own box (completely seperate from prod servers). We reduced the
log
file backup intervals and staggered our database backup jobs. We
converted
portions of our reporting to Analysis services. All with short term
marginal improvements. We have to speed our system up before we start
losing clients. Does it sound/appear as though the next option would be
to
scale out? scale up? Goto a SAN environment? And info and advise would
be
greatly appreciated!

If you got this far, thanks for the time in reading this long post!!! And
am
anxious to read any responses. Thank you.





Reply With Quote
  #3  
Old   
Tim
 
Posts: n/a

Default Re: Extremely high % Disk Time .. How is this possible?? - 02-10-2005 , 07:50 AM



I do have 5 logical drives, 1 for logs and 1 for dbases for each server, and
a quorum drive. Been evaluating the benefits of adding a dedicated array
for tempDB and indexes. Roughly 30-40 % of ransactions occur in tempDB
(still have a lot of tempDbases as it seems to be the most efficient way to
maintain/handle the passing of tables between stored procedures. In many
cases, there is no easy way to eliminate the passing of the tables as a
couple of stored procedures (the big offenders) require a large amount of
processing.


"Andrew J. Kelly" <sqlmvpnooospam (AT) shadhawk (DOT) com> wrote

Quote:
It sure sounds like your storage array is not able to handle the load.
Having a Raid 1+0 is great but only having 6 drives is on the low end.
Since you are not on a SAN you only have the cache on the controllers
themselves to help deal with all the writes. Make sure the cache on the
controllers is set to 100% write back and not read. It also sounds like
you
only have a single drive array and everything is on it (logs , data and
tempdb). That also is not good. You should have at least a Raid1 or
Raid1+0 for just the log files. The spikes can be caused by lots of
things
but you are probably seeing some of this due to checkpoints. You can use
the Checkpoint pages per second perfmon counter to correlate this to the
disk queue spikes. You might also try using connection pooling instead of
that many individual connections. You will need multiple pools to handle
the different dbs but it should cut down the actual number of connections
quite a bit.Have you ran any traces to see what kind of queries are being
run and who the worst offenders are? You may have a few really poorly
tuned
queries that are hurting everything else.


--
Andrew J. Kelly SQL MVP


"Tim" <tim (AT) pds (DOT) notanaddress.ca> wrote in message
news:%23v7AhlsDFHA.2804 (AT) TK2MSFTNGP14 (DOT) phx.gbl...
Hi Group

SYSTEM:

Clustered SQL Server 2000 on Windows 2000 (2 servers), with RAID 1+0, 6
drives (Total of 24 20GB hard drives with mirroring/striping). 4 GB RAM
/
Server. All fiber back end, with Fibre Channel Storage Hub 7 ( @ 100
MB/sec ). Our SQL servers have fiber cards @ 400 MB/sec connected to
hub.
About 50 DBases, ranging in size from about 50-100 MB up to 6 GB. They
are
shared by 15 clients (they are fairly balanced over a 24 hour period,
when
1
client is busy, the other isn't). The total size of dbases is ~ 65 GB,
Log
File ~ 35 GB. 100-250 connected users (to our application), with
300-500
SQL connections. Dual XEON 2.8Ghz Processors / Server.

PROBLEM:

We are experiencing disk problems but are having a very difficult time
diagnosing 'exactly' what is the cause and solution. On the data
drives,
thru PerfMon, the % Disk Times are averaging between 250 and 1500, with
peaks of over 30,000 (How do I get 30,000 % disk use??). Average disk
queues are setting at about 100-250 during the problem times, but
occasionally see it spike well over 1000. The transactions/sec hover
between 50-100. Processor hovers between 10 and 40%, spiking to 70-80%.
The disk bytes/sec for the data drives average betwwen 1 and 6
MBytes/sec.

I know there are a lot of variables that can be at play here, but where
should I start? Is the system overall too small? Over worked?? We did
have
memory issues but increased the RAM and there no longer appears to be
buffer
cache problems and we have not yet exceeded the available RAM. We have
removed most of our temp tables (in favor of table variables) and
removed
a
lot of the cursors that were in the system. We moved our reporting
server
to it's own box (completely seperate from prod servers). We reduced the
log
file backup intervals and staggered our database backup jobs. We
converted
portions of our reporting to Analysis services. All with short term
marginal improvements. We have to speed our system up before we start
losing clients. Does it sound/appear as though the next option would be
to
scale out? scale up? Goto a SAN environment? And info and advise
would
be
greatly appreciated!

If you got this far, thanks for the time in reading this long post!!!
And
am
anxious to read any responses. Thank you.







Reply With Quote
  #4  
Old   
Geoff N. Hiten
 
Posts: n/a

Default Re: Extremely high % Disk Time .. How is this possible?? - 02-10-2005 , 08:32 AM



As Andrew pointed out, it is the physical drives that appear to be limited.
Slicing them into logical drives may look better but it does not increase IO
capability. Separating data and logs onto different physical disks with the
correct RAID layouts will increase performance. Andrew also suggested
profiling for performance to find if there are just a few really ugly
queries that are killing your system. I agree completely.

As for the original question on Percent IO Time, that counter has been
largely irrelevent since SCSI Overlapped IO was introduced way back when.
Caching controllers really killed it. I prefer to look at disk queue length
and read bytes/write bytes per second as true measures of disk performance.

I believe that you will be running with write cache off in a non-SAN cluster
to prevent data loss during a cluster failover. This is a big performance
killer and one major reason I do not recommend a cluster without a SAN. All
indicators are that you need to modify your IO system. You may find other
performance bottlenecks as you remove the IO limit, but for right now, that
is the worst offender.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com

I support the Professional Association for SQL Server
www.sqlpass.org

"Tim" <tim (AT) pds (DOT) notanaddress.ca> wrote

Quote:
I do have 5 logical drives, 1 for logs and 1 for dbases for each server,
and
a quorum drive. Been evaluating the benefits of adding a dedicated array
for tempDB and indexes. Roughly 30-40 % of ransactions occur in tempDB
(still have a lot of tempDbases as it seems to be the most efficient way
to
maintain/handle the passing of tables between stored procedures. In many
cases, there is no easy way to eliminate the passing of the tables as a
couple of stored procedures (the big offenders) require a large amount of
processing.


"Andrew J. Kelly" <sqlmvpnooospam (AT) shadhawk (DOT) com> wrote in message
news:e65DCOtDFHA.2632 (AT) TK2MSFTNGP12 (DOT) phx.gbl...
It sure sounds like your storage array is not able to handle the load.
Having a Raid 1+0 is great but only having 6 drives is on the low end.
Since you are not on a SAN you only have the cache on the controllers
themselves to help deal with all the writes. Make sure the cache on the
controllers is set to 100% write back and not read. It also sounds like
you
only have a single drive array and everything is on it (logs , data and
tempdb). That also is not good. You should have at least a Raid1 or
Raid1+0 for just the log files. The spikes can be caused by lots of
things
but you are probably seeing some of this due to checkpoints. You can
use
the Checkpoint pages per second perfmon counter to correlate this to the
disk queue spikes. You might also try using connection pooling instead
of
that many individual connections. You will need multiple pools to
handle
the different dbs but it should cut down the actual number of
connections
quite a bit.Have you ran any traces to see what kind of queries are
being
run and who the worst offenders are? You may have a few really poorly
tuned
queries that are hurting everything else.


--
Andrew J. Kelly SQL MVP


"Tim" <tim (AT) pds (DOT) notanaddress.ca> wrote in message
news:%23v7AhlsDFHA.2804 (AT) TK2MSFTNGP14 (DOT) phx.gbl...
Hi Group

SYSTEM:

Clustered SQL Server 2000 on Windows 2000 (2 servers), with RAID 1+0,
6
drives (Total of 24 20GB hard drives with mirroring/striping). 4 GB
RAM
/
Server. All fiber back end, with Fibre Channel Storage Hub 7 ( @ 100
MB/sec ). Our SQL servers have fiber cards @ 400 MB/sec connected to
hub.
About 50 DBases, ranging in size from about 50-100 MB up to 6 GB.
They
are
shared by 15 clients (they are fairly balanced over a 24 hour period,
when
1
client is busy, the other isn't). The total size of dbases is ~ 65
GB,
Log
File ~ 35 GB. 100-250 connected users (to our application), with
300-500
SQL connections. Dual XEON 2.8Ghz Processors / Server.

PROBLEM:

We are experiencing disk problems but are having a very difficult time
diagnosing 'exactly' what is the cause and solution. On the data
drives,
thru PerfMon, the % Disk Times are averaging between 250 and 1500,
with
peaks of over 30,000 (How do I get 30,000 % disk use??). Average disk
queues are setting at about 100-250 during the problem times, but
occasionally see it spike well over 1000. The transactions/sec hover
between 50-100. Processor hovers between 10 and 40%, spiking to
70-80%.
The disk bytes/sec for the data drives average betwwen 1 and 6
MBytes/sec.

I know there are a lot of variables that can be at play here, but
where
should I start? Is the system overall too small? Over worked?? We
did
have
memory issues but increased the RAM and there no longer appears to be
buffer
cache problems and we have not yet exceeded the available RAM. We
have
removed most of our temp tables (in favor of table variables) and
removed
a
lot of the cursors that were in the system. We moved our reporting
server
to it's own box (completely seperate from prod servers). We reduced
the
log
file backup intervals and staggered our database backup jobs. We
converted
portions of our reporting to Analysis services. All with short term
marginal improvements. We have to speed our system up before we start
losing clients. Does it sound/appear as though the next option would
be
to
scale out? scale up? Goto a SAN environment? And info and advise
would
be
greatly appreciated!

If you got this far, thanks for the time in reading this long post!!!
And
am
anxious to read any responses. Thank you.









Reply With Quote
  #5  
Old   
Andrew J. Kelly
 
Posts: n/a

Default Re: Extremely high % Disk Time .. How is this possible?? - 02-10-2005 , 09:18 AM



Completely agree with Geoff.

--
Andrew J. Kelly SQL MVP


"Geoff N. Hiten" <SRDBA (AT) Careerbuilder (DOT) com> wrote

Quote:
As Andrew pointed out, it is the physical drives that appear to be
limited.
Slicing them into logical drives may look better but it does not increase
IO
capability. Separating data and logs onto different physical disks with
the
correct RAID layouts will increase performance. Andrew also suggested
profiling for performance to find if there are just a few really ugly
queries that are killing your system. I agree completely.

As for the original question on Percent IO Time, that counter has been
largely irrelevent since SCSI Overlapped IO was introduced way back when.
Caching controllers really killed it. I prefer to look at disk queue
length
and read bytes/write bytes per second as true measures of disk
performance.

I believe that you will be running with write cache off in a non-SAN
cluster
to prevent data loss during a cluster failover. This is a big performance
killer and one major reason I do not recommend a cluster without a SAN.
All
indicators are that you need to modify your IO system. You may find other
performance bottlenecks as you remove the IO limit, but for right now,
that
is the worst offender.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com

I support the Professional Association for SQL Server
www.sqlpass.org

"Tim" <tim (AT) pds (DOT) notanaddress.ca> wrote in message
news:%23kwePe3DFHA.2608 (AT) TK2MSFTNGP10 (DOT) phx.gbl...
I do have 5 logical drives, 1 for logs and 1 for dbases for each server,
and
a quorum drive. Been evaluating the benefits of adding a dedicated array
for tempDB and indexes. Roughly 30-40 % of ransactions occur in tempDB
(still have a lot of tempDbases as it seems to be the most efficient way
to
maintain/handle the passing of tables between stored procedures. In many
cases, there is no easy way to eliminate the passing of the tables as a
couple of stored procedures (the big offenders) require a large amount of
processing.


"Andrew J. Kelly" <sqlmvpnooospam (AT) shadhawk (DOT) com> wrote in message
news:e65DCOtDFHA.2632 (AT) TK2MSFTNGP12 (DOT) phx.gbl...
It sure sounds like your storage array is not able to handle the load.
Having a Raid 1+0 is great but only having 6 drives is on the low end.
Since you are not on a SAN you only have the cache on the controllers
themselves to help deal with all the writes. Make sure the cache on
the
controllers is set to 100% write back and not read. It also sounds
like
you
only have a single drive array and everything is on it (logs , data and
tempdb). That also is not good. You should have at least a Raid1 or
Raid1+0 for just the log files. The spikes can be caused by lots of
things
but you are probably seeing some of this due to checkpoints. You can
use
the Checkpoint pages per second perfmon counter to correlate this to
the
disk queue spikes. You might also try using connection pooling instead
of
that many individual connections. You will need multiple pools to
handle
the different dbs but it should cut down the actual number of
connections
quite a bit.Have you ran any traces to see what kind of queries are
being
run and who the worst offenders are? You may have a few really poorly
tuned
queries that are hurting everything else.


--
Andrew J. Kelly SQL MVP


"Tim" <tim (AT) pds (DOT) notanaddress.ca> wrote in message
news:%23v7AhlsDFHA.2804 (AT) TK2MSFTNGP14 (DOT) phx.gbl...
Hi Group

SYSTEM:

Clustered SQL Server 2000 on Windows 2000 (2 servers), with RAID 1+0,
6
drives (Total of 24 20GB hard drives with mirroring/striping). 4 GB
RAM
/
Server. All fiber back end, with Fibre Channel Storage Hub 7 ( @ 100
MB/sec ). Our SQL servers have fiber cards @ 400 MB/sec connected to
hub.
About 50 DBases, ranging in size from about 50-100 MB up to 6 GB.
They
are
shared by 15 clients (they are fairly balanced over a 24 hour period,
when
1
client is busy, the other isn't). The total size of dbases is ~ 65
GB,
Log
File ~ 35 GB. 100-250 connected users (to our application), with
300-500
SQL connections. Dual XEON 2.8Ghz Processors / Server.

PROBLEM:

We are experiencing disk problems but are having a very difficult
time
diagnosing 'exactly' what is the cause and solution. On the data
drives,
thru PerfMon, the % Disk Times are averaging between 250 and 1500,
with
peaks of over 30,000 (How do I get 30,000 % disk use??). Average
disk
queues are setting at about 100-250 during the problem times, but
occasionally see it spike well over 1000. The transactions/sec hover
between 50-100. Processor hovers between 10 and 40%, spiking to
70-80%.
The disk bytes/sec for the data drives average betwwen 1 and 6
MBytes/sec.

I know there are a lot of variables that can be at play here, but
where
should I start? Is the system overall too small? Over worked?? We
did
have
memory issues but increased the RAM and there no longer appears to be
buffer
cache problems and we have not yet exceeded the available RAM. We
have
removed most of our temp tables (in favor of table variables) and
removed
a
lot of the cursors that were in the system. We moved our reporting
server
to it's own box (completely seperate from prod servers). We reduced
the
log
file backup intervals and staggered our database backup jobs. We
converted
portions of our reporting to Analysis services. All with short term
marginal improvements. We have to speed our system up before we
start
losing clients. Does it sound/appear as though the next option would
be
to
scale out? scale up? Goto a SAN environment? And info and advise
would
be
greatly appreciated!

If you got this far, thanks for the time in reading this long post!!!
And
am
anxious to read any responses. Thank you.











Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.