dbTalk Databases Forums  

Strange Lockups on IDS 7.31.UD8

comp.databases.informix comp.databases.informix


Discuss Strange Lockups on IDS 7.31.UD8 in the comp.databases.informix forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Anthony
 
Posts: n/a

Default Strange Lockups on IDS 7.31.UD8 - 08-16-2005 , 09:00 AM






Hey all,

We finally upgraded to UD8 about four months ago, after some fairly
thorough testing. However, we're having some problems that seem to be
mostly unexplainable.

Quick background on the system:

The system is a dual Athlon 1900 MP system w/ 3.5GB of RAM, and (5)
Hard Disks. The dbspaces are all stored on the first two drives, which
are RAID-1. There is a second RAID-1 array which we dump backups too,
and the last drive is a spare. We're running Redhat Linux 7.3, and a
2.4.20 kernel. The instance is about 11.5GB in size, not counting the
root or the tmp spaces. And it's growing at a quick clip (doubled in
size from June 1).

About two weeks ago, our production instance of IDS just quit
responding. onmonitor and dbaccess would time out looking for the db
instance. However, the log files simply showed a Completed log,
similar to:

07:31:58 Checkpoint Completed: duration was 0 seconds.
07:31:58 Checkpoint loguniq 9612, logpos 0x9e4018

Logs were continuing to run / pile up, even though no one could access
the database. However, they weren't showing up any faster / slower
than normal. I was unable to kill it w/ an onmode -k, either. After
waiting about 20 minutes, I did a kill -9, which showed a:

09:08:10 The Master Daemon Died
09:08:12 PANIC: Attempting to bring system down

Crossing my fingers, it came back up just fine.

This has happened TWICE in the past few weeks. The system is running,
the logs are fine, but completely unresponsive. In fact, the system
load is usually under 0.21 or 0.30 when this happens.


I'm not sure if it's related, but it's becoming more and more
frustrating, is the fact that my indexes are being continually
corrupted. It would seem that we have a bad index or two EVERY day
now. For instance, when I came in this morning, I had the following
waiting for me:

09:44:58 IBM Informix Dynamic Server Version 7.31.UD8
09:44:58 Who: Session(19, informix (AT) corp (DOT) inventconnect.com, -1,
2146684976)
Thread(154, sqlexec, 7ff2b3c0, 1)
File: rspartn.c Line: 1867
09:44:58 Results: Could not complete operation on
'development:"root".invention'
09:44:58 Action: Run 'oncheck -cDI development:"root".invention'
09:45:02 See Also: /db0/tmp/af.482edd9, shmem.482edd9.0

I ran oncheck, it found the bad index (no problems w/ the table check)
and fixed it. Ran it again, no problems found. However, I did the
exact same thing last Thursday, on the same table.

About two weeks ago, I did a complete oncheck of the dbspace. Both -cD
and -cI, and neither found any problems. We're running Art's dostats
program to tune the indexes, about once every week or two.

I don't know if it helps, but an output of onstat_rau_ur.sh shows:
[root@corp azure]# ./onstat_rau_ur.sh
Read Utilization (UR): 96.697 %
Bufwaits Ratio (BR): 0.693446 %
The UR should ideally be very near 100%. The higher the better.
The BR should be below 7%. The lower the better.

When we were running under 7.31.UD2, the UR was 99.95% and the Bufwaits
was closer to 2%. Not sure what happened.

I'd be happy to post any stats / onconfig files that you might need to
help me solve this issue(s).

Thanks for your help.

--Anthony


Reply With Quote
  #2  
Old   
Martin Fuerderer
 
Posts: n/a

Default Re: Strange Lockups on IDS 7.31.UD8 - 08-16-2005 , 11:03 AM







Hi,

during the upgrade - did you also change the machine and OS ?
Specifically, did you change from 1 CPU system to 2 CPU system ?
If that's the case check your configuration, especially
onconfig file parameters SINGLE_CPU_VP , NUMCPUVPS and
MULTIPROCESSOR for correctness. If they are still set for
1 CPU system, then correcting them may solve your problem.

Regards,
Martin
--
Martin Fuerderer
IBM Informix Development Munich, Germany
Information Management

owner-informix-list (AT) iiug (DOT) org wrote on 16.08.2005 16:00:49:
Quote:
Hey all,

We finally upgraded to UD8 about four months ago, after some fairly
thorough testing. However, we're having some problems that seem to be
mostly unexplainable.

Quick background on the system:

The system is a dual Athlon 1900 MP system w/ 3.5GB of RAM, and (5)
Hard Disks. The dbspaces are all stored on the first two drives, which
are RAID-1. There is a second RAID-1 array which we dump backups too,
and the last drive is a spare. We're running Redhat Linux 7.3, and a
2.4.20 kernel. The instance is about 11.5GB in size, not counting the
root or the tmp spaces. And it's growing at a quick clip (doubled in
size from June 1).

About two weeks ago, our production instance of IDS just quit
responding. onmonitor and dbaccess would time out looking for the db
instance. However, the log files simply showed a Completed log,
similar to:

07:31:58 Checkpoint Completed: duration was 0 seconds.
07:31:58 Checkpoint loguniq 9612, logpos 0x9e4018

Logs were continuing to run / pile up, even though no one could access
the database. However, they weren't showing up any faster / slower
than normal. I was unable to kill it w/ an onmode -k, either. After
waiting about 20 minutes, I did a kill -9, which showed a:

09:08:10 The Master Daemon Died
09:08:12 PANIC: Attempting to bring system down

Crossing my fingers, it came back up just fine.

This has happened TWICE in the past few weeks. The system is running,
the logs are fine, but completely unresponsive. In fact, the system
load is usually under 0.21 or 0.30 when this happens.


I'm not sure if it's related, but it's becoming more and more
frustrating, is the fact that my indexes are being continually
corrupted. It would seem that we have a bad index or two EVERY day
now. For instance, when I came in this morning, I had the following
waiting for me:

09:44:58 IBM Informix Dynamic Server Version 7.31.UD8
09:44:58 Who: Session(19, informix (AT) corp (DOT) inventconnect.com, -1,
2146684976)
Thread(154, sqlexec, 7ff2b3c0, 1)
File: rspartn.c Line: 1867
09:44:58 Results: Could not complete operation on
'development:"root".invention'
09:44:58 Action: Run 'oncheck -cDI development:"root".invention'
09:45:02 See Also: /db0/tmp/af.482edd9, shmem.482edd9.0

I ran oncheck, it found the bad index (no problems w/ the table check)
and fixed it. Ran it again, no problems found. However, I did the
exact same thing last Thursday, on the same table.

About two weeks ago, I did a complete oncheck of the dbspace. Both -cD
and -cI, and neither found any problems. We're running Art's dostats
program to tune the indexes, about once every week or two.

I don't know if it helps, but an output of onstat_rau_ur.sh shows:
[root@corp azure]# ./onstat_rau_ur.sh
Read Utilization (UR): 96.697 %
Bufwaits Ratio (BR): 0.693446 %
The UR should ideally be very near 100%. The higher the better.
The BR should be below 7%. The lower the better.

When we were running under 7.31.UD2, the UR was 99.95% and the Bufwaits
was closer to 2%. Not sure what happened.

I'd be happy to post any stats / onconfig files that you might need to
help me solve this issue(s).

Thanks for your help.

--Anthony
sending to informix-list


Reply With Quote
  #3  
Old   
Anthony
 
Posts: n/a

Default Re: Strange Lockups on IDS 7.31.UD8 - 08-16-2005 , 04:37 PM



The machine is the same one that we've been using for about ... 2
years. We've been running IDS on it the whole time.

However, that doesn't mean those values haven't been wrong the whole
time. The values for the parameters you asked about are:

MULTIPROCESSOR 1 # 0 for single-processor, 1 for
multi-processor
NUMCPUVPS 2 # Number of user (cpu) vps
SINGLE_CPU_VP 0 # If non-zero, limit number of cpu vps
to one

I also noticed the index errors from last Thursday look more like:

09:44:58 Assert Failed: ptmap
09:44:58 IBM Informix Dynamic Server Version 7.31.UD8
09:44:58 Who: Session(19, informix (AT) corp (DOT) inventconnect.com, -1,
2146684976)
Thread(154, sqlexec, 7ff2b3c0, 1)
File: rspartn.c Line: 1867
09:44:58 Results: Could not complete operation on
'development:"root".invention'
09:44:58 Action: Run 'oncheck -cDI development:"root".invention'
09:45:02 See Also: /db0/tmp/af.482edd9, shmem.482edd9.0
09:46:35 I/O bad request chunk 4095, pagenum 1048575, pagecnt 1
09:46:39 listener-thread: err = -25587: oserr = 0: errstr = : Network
receive failed.

However, running an oncheck -cDI doesn't show any problems (other than
the bad index). I would expect to see something about a bad chunk.
Isn't that what the above error states?

--Anthony


Reply With Quote
  #4  
Old   
Dirk Moolman
 
Posts: n/a

Default RE: Strange Lockups on IDS 7.31.UD8 - 08-17-2005 , 01:30 AM





-----Original Message-----
From: owner-informix-list (AT) iiug (DOT) org [mailtowner-informix-list (AT) iiug (DOT) org]
On Behalf Of Anthony
Sent: 16 August 2005 11:37 PM
To: informix-list (AT) iiug (DOT) org
Subject: Re: Strange Lockups on IDS 7.31.UD8

Quote:
The machine is the same one that we've been using for about ... 2
years. We've been running IDS on it the whole time.

However, that doesn't mean those values haven't been wrong the whole
time. The values for the parameters you asked about are:
[snip]

Quote:
However, running an oncheck -cDI doesn't show any problems (other than
the bad index). I would expect to see something about a bad chunk.
Isn't that what the above error states?

--Anthony


Did you do an in-place upgrade ? you could be sitting with a bug, with
old pages not properly converted, and now causing the index corruption.
I had a similar problem many years ago, where some (not all) of my
indexes, especially the larger ones, were corrupted, and only a couple
of days after the upgrade.

If inplace, I would suggest unloading and reloading these tables, or at
least, dropping and rebuilding the indexes.


Dirk

sending to informix-list


Reply With Quote
  #5  
Old   
Superboer
 
Posts: n/a

Default Re: Strange Lockups on IDS 7.31.UD8 - 08-17-2005 , 02:08 AM



in a 'normal' running env you should not get index corruption followed
by an engine crash.....

Sounds like you are hitting a bug or may have hardware problems

what is in /db0/tmp/af.482edd9
i guess a pagedump....

I/O bad request chunk 4095 -->> max chunks is 2048.... so either
the data on disk is screwed up or your os (controler) is screwing it up
or you run into a major bug.
Contact TS for this!!

Superboer.

Anthony schreef:

Quote:
The machine is the same one that we've been using for about ... 2
years. We've been running IDS on it the whole time.

However, that doesn't mean those values haven't been wrong the whole
time. The values for the parameters you asked about are:

MULTIPROCESSOR 1 # 0 for single-processor, 1 for
multi-processor
NUMCPUVPS 2 # Number of user (cpu) vps
SINGLE_CPU_VP 0 # If non-zero, limit number of cpu vps
to one

I also noticed the index errors from last Thursday look more like:

09:44:58 Assert Failed: ptmap
09:44:58 IBM Informix Dynamic Server Version 7.31.UD8
09:44:58 Who: Session(19, informix (AT) corp (DOT) inventconnect.com, -1,
2146684976)
Thread(154, sqlexec, 7ff2b3c0, 1)
File: rspartn.c Line: 1867
09:44:58 Results: Could not complete operation on
'development:"root".invention'
09:44:58 Action: Run 'oncheck -cDI development:"root".invention'
09:45:02 See Also: /db0/tmp/af.482edd9, shmem.482edd9.0
09:46:35 I/O bad request chunk 4095, pagenum 1048575, pagecnt 1
09:46:39 listener-thread: err = -25587: oserr = 0: errstr = : Network
receive failed.

However, running an oncheck -cDI doesn't show any problems (other than
the bad index). I would expect to see something about a bad chunk.
Isn't that what the above error states?

--Anthony


Reply With Quote
  #6  
Old   
Ben Thompson
 
Posts: n/a

Default Re: Strange Lockups on IDS 7.31.UD8 - 08-17-2005 , 02:11 AM



Anthony wrote:
Quote:
Hey all,

We finally upgraded to UD8 about four months ago, after some fairly
thorough testing. However, we're having some problems that seem to be
mostly unexplainable.

Quick background on the system:

The system is a dual Athlon 1900 MP system w/ 3.5GB of RAM, and (5)
Hard Disks. The dbspaces are all stored on the first two drives, which
are RAID-1. There is a second RAID-1 array which we dump backups too,
and the last drive is a spare. We're running Redhat Linux 7.3, and a
2.4.20 kernel. The instance is about 11.5GB in size, not counting the
root or the tmp spaces. And it's growing at a quick clip (doubled in
size from June 1).
Not what you were asking but a RAID 10 array using four discs may be
better for you in performance terms than two RAID-1 mirrors.

Quote:
About two weeks ago, our production instance of IDS just quit
responding. onmonitor and dbaccess would time out looking for the db
instance. However, the log files simply showed a Completed log,
similar to:

07:31:58 Checkpoint Completed: duration was 0 seconds.
07:31:58 Checkpoint loguniq 9612, logpos 0x9e4018

Logs were continuing to run / pile up, even though no one could access
the database. However, they weren't showing up any faster / slower
than normal. I was unable to kill it w/ an onmode -k, either. After
waiting about 20 minutes, I did a kill -9, which showed a:

09:08:10 The Master Daemon Died
09:08:12 PANIC: Attempting to bring system down

Crossing my fingers, it came back up just fine.

This has happened TWICE in the past few weeks. The system is running,
the logs are fine, but completely unresponsive. In fact, the system
load is usually under 0.21 or 0.30 when this happens.
Logical logs not being backed up??? Please post "onstat -l".

Quote:
I'm not sure if it's related, but it's becoming more and more
frustrating, is the fact that my indexes are being continually
corrupted. It would seem that we have a bad index or two EVERY day
now. For instance, when I came in this morning, I had the following
waiting for me:

09:44:58 IBM Informix Dynamic Server Version 7.31.UD8
09:44:58 Who: Session(19, informix (AT) corp (DOT) inventconnect.com, -1,
2146684976)
Thread(154, sqlexec, 7ff2b3c0, 1)
File: rspartn.c Line: 1867
09:44:58 Results: Could not complete operation on
'development:"root".invention'
09:44:58 Action: Run 'oncheck -cDI development:"root".invention'
09:45:02 See Also: /db0/tmp/af.482edd9, shmem.482edd9.0

I ran oncheck, it found the bad index (no problems w/ the table check)
and fixed it. Ran it again, no problems found. However, I did the
exact same thing last Thursday, on the same table.
If you're suffering disc problems then you may see something in
/var/log/messages

If you have technical support (since you upgraded I am guessing you do)
you could raise a case and send them the af.482edd9 file. This may be
your best option rather than asking on this group.

Quote:
About two weeks ago, I did a complete oncheck of the dbspace. Both -cD
and -cI, and neither found any problems. We're running Art's dostats
program to tune the indexes, about once every week or two.

I don't know if it helps, but an output of onstat_rau_ur.sh shows:
[root@corp azure]# ./onstat_rau_ur.sh
Read Utilization (UR): 96.697 %
Bufwaits Ratio (BR): 0.693446 %
The UR should ideally be very near 100%. The higher the better.
The BR should be below 7%. The lower the better.

When we were running under 7.31.UD2, the UR was 99.95% and the Bufwaits
was closer to 2%. Not sure what happened.
What is BUFFERS set to? It would be helpful to post a full ONCONFIG
file. Also please post "onstat -p". If you have a lot more data than
previously then the read utilisation may well be falling.

Quote:
I'd be happy to post any stats / onconfig files that you might need to
help me solve this issue(s).

Thanks for your help.

--Anthony


Reply With Quote
  #7  
Old   
Martin Fuerderer
 
Posts: n/a

Default Re: Strange Lockups on IDS 7.31.UD8 - 08-17-2005 , 04:10 AM




Hi,

the onconfig file parameters look fine.

Check the AF-file (/db0/tmp/af.482edd9) as adviced
in the online.log. If that doesn't clarify things, you should
probably open a "Case" with IBM Informix Tech Support
to have them help analyze the cause of this. They may
also request the SHM dump (file shmem.482edd9.0) for
analysis.

I'm not sure whether the message about the "bad I/O request" is
directly connected to the preceeding messages, but it may well
be. Tech Support will be able to tell by the chunk and page
number ...

I've seen systems that had such "transient I/O errors" not because
of the I/O subsystem, but because of processor cache latency
that showed different content of the same page to different CPUs.
And that is just one possibility ...

The listener-thread message most probably has a different cause.

Regards,
Martin
--
Martin Fuerderer
IBM Informix Development Munich, Germany
Information Management

owner-informix-list (AT) iiug (DOT) org wrote on 16.08.2005 23:37:23:
Quote:
The machine is the same one that we've been using for about ... 2
years. We've been running IDS on it the whole time.

However, that doesn't mean those values haven't been wrong the whole
time. The values for the parameters you asked about are:

MULTIPROCESSOR 1 # 0 for single-processor, 1 for
multi-processor
NUMCPUVPS 2 # Number of user (cpu) vps
SINGLE_CPU_VP 0 # If non-zero, limit number of cpu vps
to one

I also noticed the index errors from last Thursday look more like:

09:44:58 Assert Failed: ptmap
09:44:58 IBM Informix Dynamic Server Version 7.31.UD8
09:44:58 Who: Session(19, informix (AT) corp (DOT) inventconnect.com, -1,
2146684976)
Thread(154, sqlexec, 7ff2b3c0, 1)
File: rspartn.c Line: 1867
09:44:58 Results: Could not complete operation on
'development:"root".invention'
09:44:58 Action: Run 'oncheck -cDI development:"root".invention'
09:45:02 See Also: /db0/tmp/af.482edd9, shmem.482edd9.0
09:46:35 I/O bad request chunk 4095, pagenum 1048575, pagecnt 1
09:46:39 listener-thread: err = -25587: oserr = 0: errstr = : Network
receive failed.

However, running an oncheck -cDI doesn't show any problems (other than
the bad index). I would expect to see something about a bad chunk.
Isn't that what the above error states?

--Anthony
sending to informix-list


Reply With Quote
  #8  
Old   
Dave Griffen
 
Posts: n/a

Default Re: Strange Lockups on IDS 7.31.UD8 - 08-17-2005 , 05:05 PM



The "I/O bad request chunk 4095" message could be due to a corrupted index
pointing to a nonexistent chunk.
In any case, if oncheck -cDI confirms a bad index, drop and recreate the
index.


"Anthony" <anthony (AT) zoraptera (DOT) com> wrote

Quote:
The machine is the same one that we've been using for about ... 2
years. We've been running IDS on it the whole time.

However, that doesn't mean those values haven't been wrong the whole
time. The values for the parameters you asked about are:

MULTIPROCESSOR 1 # 0 for single-processor, 1 for
multi-processor
NUMCPUVPS 2 # Number of user (cpu) vps
SINGLE_CPU_VP 0 # If non-zero, limit number of cpu vps
to one

I also noticed the index errors from last Thursday look more like:

09:44:58 Assert Failed: ptmap
09:44:58 IBM Informix Dynamic Server Version 7.31.UD8
09:44:58 Who: Session(19, informix (AT) corp (DOT) inventconnect.com, -1,
2146684976)
Thread(154, sqlexec, 7ff2b3c0, 1)
File: rspartn.c Line: 1867
09:44:58 Results: Could not complete operation on
'development:"root".invention'
09:44:58 Action: Run 'oncheck -cDI development:"root".invention'
09:45:02 See Also: /db0/tmp/af.482edd9, shmem.482edd9.0
09:46:35 I/O bad request chunk 4095, pagenum 1048575, pagecnt 1
09:46:39 listener-thread: err = -25587: oserr = 0: errstr = : Network
receive failed.

However, running an oncheck -cDI doesn't show any problems (other than
the bad index). I would expect to see something about a bad chunk.
Isn't that what the above error states?

--Anthony




Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.