![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
Hey all, We finally upgraded to UD8 about four months ago, after some fairly thorough testing. However, we're having some problems that seem to be mostly unexplainable. Quick background on the system: The system is a dual Athlon 1900 MP system w/ 3.5GB of RAM, and (5) Hard Disks. The dbspaces are all stored on the first two drives, which are RAID-1. There is a second RAID-1 array which we dump backups too, and the last drive is a spare. We're running Redhat Linux 7.3, and a 2.4.20 kernel. The instance is about 11.5GB in size, not counting the root or the tmp spaces. And it's growing at a quick clip (doubled in size from June 1). About two weeks ago, our production instance of IDS just quit responding. onmonitor and dbaccess would time out looking for the db instance. However, the log files simply showed a Completed log, similar to: 07:31:58 Checkpoint Completed: duration was 0 seconds. 07:31:58 Checkpoint loguniq 9612, logpos 0x9e4018 Logs were continuing to run / pile up, even though no one could access the database. However, they weren't showing up any faster / slower than normal. I was unable to kill it w/ an onmode -k, either. After waiting about 20 minutes, I did a kill -9, which showed a: 09:08:10 The Master Daemon Died 09:08:12 PANIC: Attempting to bring system down Crossing my fingers, it came back up just fine. This has happened TWICE in the past few weeks. The system is running, the logs are fine, but completely unresponsive. In fact, the system load is usually under 0.21 or 0.30 when this happens. I'm not sure if it's related, but it's becoming more and more frustrating, is the fact that my indexes are being continually corrupted. It would seem that we have a bad index or two EVERY day now. For instance, when I came in this morning, I had the following waiting for me: 09:44:58 IBM Informix Dynamic Server Version 7.31.UD8 09:44:58 Who: Session(19, informix (AT) corp (DOT) inventconnect.com, -1, 2146684976) Thread(154, sqlexec, 7ff2b3c0, 1) File: rspartn.c Line: 1867 09:44:58 Results: Could not complete operation on 'development:"root".invention' 09:44:58 Action: Run 'oncheck -cDI development:"root".invention' 09:45:02 See Also: /db0/tmp/af.482edd9, shmem.482edd9.0 I ran oncheck, it found the bad index (no problems w/ the table check) and fixed it. Ran it again, no problems found. However, I did the exact same thing last Thursday, on the same table. About two weeks ago, I did a complete oncheck of the dbspace. Both -cD and -cI, and neither found any problems. We're running Art's dostats program to tune the indexes, about once every week or two. I don't know if it helps, but an output of onstat_rau_ur.sh shows: [root@corp azure]# ./onstat_rau_ur.sh Read Utilization (UR): 96.697 % Bufwaits Ratio (BR): 0.693446 % The UR should ideally be very near 100%. The higher the better. The BR should be below 7%. The lower the better. When we were running under 7.31.UD2, the UR was 99.95% and the Bufwaits was closer to 2%. Not sure what happened. I'd be happy to post any stats / onconfig files that you might need to help me solve this issue(s). Thanks for your help. --Anthony sending to informix-list |
#3
| |||
| |||
|
#4
| |||
| |||
|
|
The machine is the same one that we've been using for about ... 2 years. We've been running IDS on it the whole time. However, that doesn't mean those values haven't been wrong the whole time. The values for the parameters you asked about are: |
|
However, running an oncheck -cDI doesn't show any problems (other than the bad index). I would expect to see something about a bad chunk. Isn't that what the above error states? --Anthony |
#5
| |||
| |||
|
|
The machine is the same one that we've been using for about ... 2 years. We've been running IDS on it the whole time. However, that doesn't mean those values haven't been wrong the whole time. The values for the parameters you asked about are: MULTIPROCESSOR 1 # 0 for single-processor, 1 for multi-processor NUMCPUVPS 2 # Number of user (cpu) vps SINGLE_CPU_VP 0 # If non-zero, limit number of cpu vps to one I also noticed the index errors from last Thursday look more like: 09:44:58 Assert Failed: ptmap 09:44:58 IBM Informix Dynamic Server Version 7.31.UD8 09:44:58 Who: Session(19, informix (AT) corp (DOT) inventconnect.com, -1, 2146684976) Thread(154, sqlexec, 7ff2b3c0, 1) File: rspartn.c Line: 1867 09:44:58 Results: Could not complete operation on 'development:"root".invention' 09:44:58 Action: Run 'oncheck -cDI development:"root".invention' 09:45:02 See Also: /db0/tmp/af.482edd9, shmem.482edd9.0 09:46:35 I/O bad request chunk 4095, pagenum 1048575, pagecnt 1 09:46:39 listener-thread: err = -25587: oserr = 0: errstr = : Network receive failed. However, running an oncheck -cDI doesn't show any problems (other than the bad index). I would expect to see something about a bad chunk. Isn't that what the above error states? --Anthony |
#6
| |||||
| |||||
|
|
Hey all, We finally upgraded to UD8 about four months ago, after some fairly thorough testing. However, we're having some problems that seem to be mostly unexplainable. Quick background on the system: The system is a dual Athlon 1900 MP system w/ 3.5GB of RAM, and (5) Hard Disks. The dbspaces are all stored on the first two drives, which are RAID-1. There is a second RAID-1 array which we dump backups too, and the last drive is a spare. We're running Redhat Linux 7.3, and a 2.4.20 kernel. The instance is about 11.5GB in size, not counting the root or the tmp spaces. And it's growing at a quick clip (doubled in size from June 1). |
|
About two weeks ago, our production instance of IDS just quit responding. onmonitor and dbaccess would time out looking for the db instance. However, the log files simply showed a Completed log, similar to: 07:31:58 Checkpoint Completed: duration was 0 seconds. 07:31:58 Checkpoint loguniq 9612, logpos 0x9e4018 Logs were continuing to run / pile up, even though no one could access the database. However, they weren't showing up any faster / slower than normal. I was unable to kill it w/ an onmode -k, either. After waiting about 20 minutes, I did a kill -9, which showed a: 09:08:10 The Master Daemon Died 09:08:12 PANIC: Attempting to bring system down Crossing my fingers, it came back up just fine. This has happened TWICE in the past few weeks. The system is running, the logs are fine, but completely unresponsive. In fact, the system load is usually under 0.21 or 0.30 when this happens. |
|
I'm not sure if it's related, but it's becoming more and more frustrating, is the fact that my indexes are being continually corrupted. It would seem that we have a bad index or two EVERY day now. For instance, when I came in this morning, I had the following waiting for me: 09:44:58 IBM Informix Dynamic Server Version 7.31.UD8 09:44:58 Who: Session(19, informix (AT) corp (DOT) inventconnect.com, -1, 2146684976) Thread(154, sqlexec, 7ff2b3c0, 1) File: rspartn.c Line: 1867 09:44:58 Results: Could not complete operation on 'development:"root".invention' 09:44:58 Action: Run 'oncheck -cDI development:"root".invention' 09:45:02 See Also: /db0/tmp/af.482edd9, shmem.482edd9.0 I ran oncheck, it found the bad index (no problems w/ the table check) and fixed it. Ran it again, no problems found. However, I did the exact same thing last Thursday, on the same table. |
|
About two weeks ago, I did a complete oncheck of the dbspace. Both -cD and -cI, and neither found any problems. We're running Art's dostats program to tune the indexes, about once every week or two. I don't know if it helps, but an output of onstat_rau_ur.sh shows: [root@corp azure]# ./onstat_rau_ur.sh Read Utilization (UR): 96.697 % Bufwaits Ratio (BR): 0.693446 % The UR should ideally be very near 100%. The higher the better. The BR should be below 7%. The lower the better. When we were running under 7.31.UD2, the UR was 99.95% and the Bufwaits was closer to 2%. Not sure what happened. |
|
I'd be happy to post any stats / onconfig files that you might need to help me solve this issue(s). Thanks for your help. --Anthony |
#7
| |||
| |||
|
|
The machine is the same one that we've been using for about ... 2 years. We've been running IDS on it the whole time. However, that doesn't mean those values haven't been wrong the whole time. The values for the parameters you asked about are: MULTIPROCESSOR 1 # 0 for single-processor, 1 for multi-processor NUMCPUVPS 2 # Number of user (cpu) vps SINGLE_CPU_VP 0 # If non-zero, limit number of cpu vps to one I also noticed the index errors from last Thursday look more like: 09:44:58 Assert Failed: ptmap 09:44:58 IBM Informix Dynamic Server Version 7.31.UD8 09:44:58 Who: Session(19, informix (AT) corp (DOT) inventconnect.com, -1, 2146684976) Thread(154, sqlexec, 7ff2b3c0, 1) File: rspartn.c Line: 1867 09:44:58 Results: Could not complete operation on 'development:"root".invention' 09:44:58 Action: Run 'oncheck -cDI development:"root".invention' 09:45:02 See Also: /db0/tmp/af.482edd9, shmem.482edd9.0 09:46:35 I/O bad request chunk 4095, pagenum 1048575, pagecnt 1 09:46:39 listener-thread: err = -25587: oserr = 0: errstr = : Network receive failed. However, running an oncheck -cDI doesn't show any problems (other than the bad index). I would expect to see something about a bad chunk. Isn't that what the above error states? --Anthony sending to informix-list |
#8
| |||
| |||
|
|
The machine is the same one that we've been using for about ... 2 years. We've been running IDS on it the whole time. However, that doesn't mean those values haven't been wrong the whole time. The values for the parameters you asked about are: MULTIPROCESSOR 1 # 0 for single-processor, 1 for multi-processor NUMCPUVPS 2 # Number of user (cpu) vps SINGLE_CPU_VP 0 # If non-zero, limit number of cpu vps to one I also noticed the index errors from last Thursday look more like: 09:44:58 Assert Failed: ptmap 09:44:58 IBM Informix Dynamic Server Version 7.31.UD8 09:44:58 Who: Session(19, informix (AT) corp (DOT) inventconnect.com, -1, 2146684976) Thread(154, sqlexec, 7ff2b3c0, 1) File: rspartn.c Line: 1867 09:44:58 Results: Could not complete operation on 'development:"root".invention' 09:44:58 Action: Run 'oncheck -cDI development:"root".invention' 09:45:02 See Also: /db0/tmp/af.482edd9, shmem.482edd9.0 09:46:35 I/O bad request chunk 4095, pagenum 1048575, pagecnt 1 09:46:39 listener-thread: err = -25587: oserr = 0: errstr = : Network receive failed. However, running an oncheck -cDI doesn't show any problems (other than the bad index). I would expect to see something about a bad chunk. Isn't that what the above error states? --Anthony |
![]() |
| Thread Tools | |
| Display Modes | |
| |