dbTalk Databases Forums  

Problems with long checkpoints

comp.databases.informix comp.databases.informix


Discuss Problems with long checkpoints in the comp.databases.informix forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Sean Baker
 
Posts: n/a

Default Problems with long checkpoints - 09-27-2011 , 02:05 PM






Hello,

We're stuck without a dba after some massive layoffs. I know a little, butI'm no dba.

IDS 11.50.FC6 on RHEL 5.3

We have a development instance that is suddenly going into long checkpoints(43 minutes, 61 minutes, 90 minutes!). We've rebooted the server and restarted the instance, and even did a restore from our production server to the dev server. It lasts for a while (maybe 30 minutes) then goes into long checkpoints.

One of our developers has been creating new dbspaces, and it seems that might be the cause. Although he's added a few over the past several months without any problems. After the restore today, he added a dbspace, and we'restuck in long checkpoints again:

IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line (CKPT INP) -- Up 01:39:05 -- 6638456 Kbytes

I've listed the last few log entries from today below. Can anyone give me some obvious things to check? Or some better data to provide to the list for some help?

Thanks,

Sean

Log:

11:31:41 Checkpoint Completed: duration was 0 seconds.
11:31:41 Tue Sep 27 - loguniq 15415, logpos 0x33f1018, timestamp: 0x4f4c8892 Interval: 209340

11:31:41 Maximum server connections 12
11:31:41 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked0, Plog used 0, Llog used 2

11:35:02 Space 'sm_dbs0' added.
11:35:02 Checkpoint Completed: duration was 0 seconds.
11:35:02 Tue Sep 27 - loguniq 15415, logpos 0x33fb2e4, timestamp: 0x4f4c8a3d Interval: 209341

11:35:02 Maximum server connections 12
11:35:02 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked0, Plog used 16, Llog used 10

11:35:13 Checkpoint Completed: duration was 0 seconds.
11:35:13 Tue Sep 27 - loguniq 15415, logpos 0x33fe018, timestamp: 0x4f4c8b0c Interval: 209342

11:35:13 Maximum server connections 12
11:35:13 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked0, Plog used 0, Llog used 3

11:40:40 Checkpoint Completed: duration was 1 seconds.
11:40:40 Tue Sep 27 - loguniq 15415, logpos 0x3867018, timestamp: 0x4f4cdac8 Interval: 209343

11:40:40 Maximum server connections 12
11:40:40 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked0, Plog used 134, Llog used 1129

11:44:58 Adjusting LRU for bufferpool - id 0 size 2k
11:44:58 Old max 60.0 min 50.0 New max 59.4 min 49.5
11:46:07 Adjusting LRU for bufferpool - id 0 size 2k
11:46:07 Old max 59.4 min 49.5 New max 58.8 min 49.0

Reply With Quote
  #2  
Old   
jrenaut
 
Posts: n/a

Default Re: Problems with long checkpoints - 09-27-2011 , 02:40 PM






On Sep 27, 2:05*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
Quote:
Hello,

We're stuck without a dba after some massive layoffs. *I know a little,but I'm no dba.

IDS 11.50.FC6 on RHEL 5.3

We have a development instance that is suddenly going into long checkpoints (43 minutes, 61 minutes, 90 minutes!). *We've rebooted the server and restarted the instance, and even did a restore from our production server to the dev server. *It lasts for a while (maybe 30 minutes) then goes intolong checkpoints.

One of our developers has been creating new dbspaces, and it seems that might be the cause. *Although he's added a few over the past several months without any problems. *After the restore today, he added a dbspace, andwe're stuck in long checkpoints again:

IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line (CKPT INP) -- Up 01:39:05 -- 6638456 Kbytes

I've listed the last few log entries from today below. *Can anyone giveme some obvious things to check? *Or some better data to provide to the list for some help?

Thanks,

Sean

Log:

11:31:41 *Checkpoint Completed: *duration was 0 seconds.
11:31:41 *Tue Sep 27 - loguniq 15415, logpos 0x33f1018, timestamp: 0x4f4c8892 Interval: 209340

11:31:41 *Maximum server connections 12
11:31:41 *Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 0, Llog used 2

11:35:02 *Space 'sm_dbs0' added.
11:35:02 *Checkpoint Completed: *duration was 0 seconds.
11:35:02 *Tue Sep 27 - loguniq 15415, logpos 0x33fb2e4, timestamp: 0x4f4c8a3d Interval: 209341

11:35:02 *Maximum server connections 12
11:35:02 *Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 16, Llog used 10

11:35:13 *Checkpoint Completed: *duration was 0 seconds.
11:35:13 *Tue Sep 27 - loguniq 15415, logpos 0x33fe018, timestamp: 0x4f4c8b0c Interval: 209342

11:35:13 *Maximum server connections 12
11:35:13 *Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 0, Llog used 3

11:40:40 *Checkpoint Completed: *duration was 1 seconds.
11:40:40 *Tue Sep 27 - loguniq 15415, logpos 0x3867018, timestamp: 0x4f4cdac8 Interval: 209343

11:40:40 *Maximum server connections 12
11:40:40 *Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 134, Llog used 1129

11:44:58 *Adjusting LRU for bufferpool - id 0 size 2k
11:44:58 * * * *Old max 60.0 min 50.0 * *New max 59.4 min 49.5
11:46:07 *Adjusting LRU for bufferpool - id 0 size 2k
11:46:07 * * * *Old max 59.4 min 49.5 * *New max 58.8 min 49.0
Well, we can't really get much from what you've posted as the
checkpoint duration's listed in that snipped of MSGPATH file are 0
seconds. How about you post onstat -g ckp output after a long
checkpoint happens? Perhaps that will be more helpful.

Reply With Quote
  #3  
Old   
Sean Baker
 
Posts: n/a

Default RE: Problems with long checkpoints - 09-27-2011 , 02:48 PM



Quote:
-----Original Message-----
From: informix-list-bounces (AT) iiug (DOT) org [mailto:informix-list-
bounces (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 12:41 PM
To: informix-list (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 2:05*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
Hello,

We're stuck without a dba after some massive layoffs. *I know a
little, but I'm no dba.

IDS 11.50.FC6 on RHEL 5.3

We have a development instance that is suddenly going into long
checkpoints (43 minutes, 61 minutes, 90 minutes!). *We've rebooted the
server and restarted the instance, and even did a restore from our
production server to the dev server. *It lasts for a while (maybe 30
minutes) then goes into long checkpoints.

One of our developers has been creating new dbspaces, and it seems
that might be the cause. *Although he's added a few over the past
several months without any problems. *After the restore today, he added
a dbspace, and we're stuck in long checkpoints again:

IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line (CKPT
INP) -- Up 01:39:05 -- 6638456 Kbytes

I've listed the last few log entries from today below. *Can anyone
give me some obvious things to check? *Or some better data to provide
to the list for some help?

Thanks,

Sean

Log:

11:31:41 *Checkpoint Completed: *duration was 0 seconds.
11:31:41 *Tue Sep 27 - loguniq 15415, logpos 0x33f1018, timestamp:
0x4f4c8892 Interval: 209340

11:31:41 *Maximum server connections 12
11:31:41 *Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 0, Llog used 2

11:35:02 *Space 'sm_dbs0' added.
11:35:02 *Checkpoint Completed: *duration was 0 seconds.
11:35:02 *Tue Sep 27 - loguniq 15415, logpos 0x33fb2e4, timestamp:
0x4f4c8a3d Interval: 209341

11:35:02 *Maximum server connections 12
11:35:02 *Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 16, Llog used 10

11:35:13 *Checkpoint Completed: *duration was 0 seconds.
11:35:13 *Tue Sep 27 - loguniq 15415, logpos 0x33fe018, timestamp:
0x4f4c8b0c Interval: 209342

11:35:13 *Maximum server connections 12
11:35:13 *Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 0, Llog used 3

11:40:40 *Checkpoint Completed: *duration was 1 seconds.
11:40:40 *Tue Sep 27 - loguniq 15415, logpos 0x3867018, timestamp:
0x4f4cdac8 Interval: 209343

11:40:40 *Maximum server connections 12
11:40:40 *Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 134, Llog used 1129

11:44:58 *Adjusting LRU for bufferpool - id 0 size 2k
11:44:58 * * * *Old max 60.0 min 50.0 * *New max 59.4 min 49.5
11:46:07 *Adjusting LRU for bufferpool - id 0 size 2k
11:46:07 * * * *Old max 59.4 min 49.5 * *New max 58.8 min 49.0

Well, we can't really get much from what you've posted as the
checkpoint duration's listed in that snipped of MSGPATH file are 0
seconds. How about you post onstat -g ckp output after a long
checkpoint happens? Perhaps that will be more helpful.
Will do. Although we're going on an hour since the last checkpoint...

Reply With Quote
  #4  
Old   
jrenaut
 
Posts: n/a

Default Re: Problems with long checkpoints - 09-27-2011 , 03:06 PM



On Sep 27, 2:48*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
Quote:
-----Original Message-----
From: informix-list-boun... (AT) iiug (DOT) org [mailto:informix-list-
boun... (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 12:41 PM
To: informix-l... (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 2:05*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
Hello,

We're stuck without a dba after some massive layoffs. *I know a
little, but I'm no dba.

IDS 11.50.FC6 on RHEL 5.3

We have a development instance that is suddenly going into long
checkpoints (43 minutes, 61 minutes, 90 minutes!). *We've rebooted the
server and restarted the instance, and even did a restore from our
production server to the dev server. *It lasts for a while (maybe 30
minutes) then goes into long checkpoints.

One of our developers has been creating new dbspaces, and it seems
that might be the cause. *Although he's added a few over the past
several months without any problems. *After the restore today, he added
a dbspace, and we're stuck in long checkpoints again:

IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line (CKPT
INP) -- Up 01:39:05 -- 6638456 Kbytes

I've listed the last few log entries from today below. *Can anyone
give me some obvious things to check? *Or some better data to provide
to the list for some help?

Thanks,

Sean

Log:

11:31:41 *Checkpoint Completed: *duration was 0 seconds.
11:31:41 *Tue Sep 27 - loguniq 15415, logpos 0x33f1018, timestamp:
0x4f4c8892 Interval: 209340

11:31:41 *Maximum server connections 12
11:31:41 *Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 0, Llog used 2

11:35:02 *Space 'sm_dbs0' added.
11:35:02 *Checkpoint Completed: *duration was 0 seconds.
11:35:02 *Tue Sep 27 - loguniq 15415, logpos 0x33fb2e4, timestamp:
0x4f4c8a3d Interval: 209341

11:35:02 *Maximum server connections 12
11:35:02 *Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 16, Llog used 10

11:35:13 *Checkpoint Completed: *duration was 0 seconds.
11:35:13 *Tue Sep 27 - loguniq 15415, logpos 0x33fe018, timestamp:
0x4f4c8b0c Interval: 209342

11:35:13 *Maximum server connections 12
11:35:13 *Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 0, Llog used 3

11:40:40 *Checkpoint Completed: *duration was 1 seconds.
11:40:40 *Tue Sep 27 - loguniq 15415, logpos 0x3867018, timestamp:
0x4f4cdac8 Interval: 209343

11:40:40 *Maximum server connections 12
11:40:40 *Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns
blocked 0, Plog used 134, Llog used 1129

11:44:58 *Adjusting LRU for bufferpool - id 0 size 2k
11:44:58 * * * *Old max 60.0 min 50.0 * *New max 59.4 min49.5
11:46:07 *Adjusting LRU for bufferpool - id 0 size 2k
11:46:07 * * * *Old max 59.4 min 49.5 * *New max 58.8 min49.0

Well, we can't really get much from what you've posted as the
checkpoint duration's listed in that snipped of MSGPATH file are 0
seconds. *How about you post onstat -g ckp output after a long
checkpoint happens? *Perhaps that will be more helpful.

Will do. *Although we're going on an hour since the last checkpoint...
Well, if it's happening right now, how about you post and onstat -g
ath and then find the thread id for the main_loop() thread and post an
onstat -g stk <thread id of main_loop() thread> to see what main_loop
is doing, or what it might be waiting for.

Jacques Renaut
IBM Informix Advanced Support
APD Team

Reply With Quote
  #5  
Old   
Sean Baker
 
Posts: n/a

Default RE: Problems with long checkpoints - 09-27-2011 , 03:15 PM



Quote:
-----Original Message-----
From: informix-list-bounces (AT) iiug (DOT) org [mailto:informix-list-
bounces (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 1:06 PM
To: informix-list (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 2:48*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
-----Original Message-----
From: informix-list-boun... (AT) iiug (DOT) org [mailto:informix-list-
boun... (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 12:41 PM
To: informix-l... (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 2:05*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
Hello,

We're stuck without a dba after some massive layoffs. *I know a
little, but I'm no dba.

IDS 11.50.FC6 on RHEL 5.3

We have a development instance that is suddenly going into long
checkpoints (43 minutes, 61 minutes, 90 minutes!). *We've rebooted
the
server and restarted the instance, and even did a restore from our
production server to the dev server. *It lasts for a while (maybe
30
minutes) then goes into long checkpoints.

One of our developers has been creating new dbspaces, and it
seems
that might be the cause. *Although he's added a few over the past
several months without any problems. *After the restore today, he
added
a dbspace, and we're stuck in long checkpoints again:

IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line
(CKPT
INP) -- Up 01:39:05 -- 6638456 Kbytes

I've listed the last few log entries from today below. *Can
anyone
give me some obvious things to check? *Or some better data to
provide
to the list for some help?

Thanks,

Sean

Log:

11:31:41 *Checkpoint Completed: *duration was 0 seconds.
11:31:41 *Tue Sep 27 - loguniq 15415, logpos 0x33f1018,
timestamp:
0x4f4c8892 Interval: 209340

11:31:41 *Maximum server connections 12
11:31:41 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 0, Llog used 2

11:35:02 *Space 'sm_dbs0' added.
11:35:02 *Checkpoint Completed: *duration was 0 seconds.
11:35:02 *Tue Sep 27 - loguniq 15415, logpos 0x33fb2e4,
timestamp:
0x4f4c8a3d Interval: 209341

11:35:02 *Maximum server connections 12
11:35:02 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 16, Llog used 10

11:35:13 *Checkpoint Completed: *duration was 0 seconds.
11:35:13 *Tue Sep 27 - loguniq 15415, logpos 0x33fe018,
timestamp:
0x4f4c8b0c Interval: 209342

11:35:13 *Maximum server connections 12
11:35:13 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 0, Llog used 3

11:40:40 *Checkpoint Completed: *duration was 1 seconds.
11:40:40 *Tue Sep 27 - loguniq 15415, logpos 0x3867018,
timestamp:
0x4f4cdac8 Interval: 209343

11:40:40 *Maximum server connections 12
11:40:40 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 134, Llog used 1129

11:44:58 *Adjusting LRU for bufferpool - id 0 size 2k
11:44:58 * * * *Old max 60.0 min 50.0 * *New max 59.4 min 49.5
11:46:07 *Adjusting LRU for bufferpool - id 0 size 2k
11:46:07 * * * *Old max 59.4 min 49.5 * *New max 58.8 min 49.0

Well, we can't really get much from what you've posted as the
checkpoint duration's listed in that snipped of MSGPATH file are 0
seconds. *How about you post onstat -g ckp output after a long
checkpoint happens? *Perhaps that will be more helpful.

Will do. *Although we're going on an hour since the last
checkpoint...

Well, if it's happening right now, how about you post and onstat -g
ath and then find the thread id for the main_loop() thread and post an
onstat -g stk <thread id of main_loop() thread> to see what main_loop
is doing, or what it might be waiting for.

Jacques Renaut
IBM Informix Advanced Support
APD Team
The developer tried dropping the dbspace he created, so now we're blocked with "DBS_DROP". Here's the onstat -g ath for the main_loop():

IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line (CKPT INP) -- Up 02:48:10 -- 6638456 Kbytes
BlockedBS_DROP

Threads:
tid tcb rstcb prty status vp-class name
18 15f37a548 15e56f028 3 sleeping secs: 1 5cpu main_loop()

And here's onstat -g stk 18:

IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line (CKPT INP) -- Up 02:49:56 -- 6638456 Kbytes
BlockedBS_DROP

Stack for thread: 18 main_loop()
base: 0x000000015f3c6000
len: 69632
pc: 0x0000000001073d0e
tos: 0x000000015f3d5a40
state: sleeping
vp: 4

0x0000000001073d0e (/opt/IBM/informix/bin/oninit) yield_processor_mvp
0x000000000107556b (/opt/IBM/informix/bin/oninit) mt_yield
0x0000000000bb951a (/opt/IBM/informix/bin/oninit) dskflush_bfpool
0x0000000000bb98c2 (/opt/IBM/informix/bin/oninit) dskflush
0x0000000000b56165 (/opt/IBM/informix/bin/oninit) checkpoint
0x00000000005336e9 (/opt/IBM/informix/bin/oninit) main_loop
0x000000000108560f (/opt/IBM/informix/bin/oninit) startup

Thanks,

Sean.

Reply With Quote
  #6  
Old   
Sean Baker
 
Posts: n/a

Default RE: Problems with long checkpoints - 09-27-2011 , 03:30 PM



Quote:
-----Original Message-----
From: informix-list-bounces (AT) iiug (DOT) org [mailto:informix-list-
bounces (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 1:06 PM
To: informix-list (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 2:48*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
-----Original Message-----
From: informix-list-boun... (AT) iiug (DOT) org [mailto:informix-list-
boun... (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 12:41 PM
To: informix-l... (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 2:05*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
Hello,

We're stuck without a dba after some massive layoffs. *I know a
little, but I'm no dba.

IDS 11.50.FC6 on RHEL 5.3

We have a development instance that is suddenly going into long
checkpoints (43 minutes, 61 minutes, 90 minutes!). *We've rebooted
the
server and restarted the instance, and even did a restore from our
production server to the dev server. *It lasts for a while (maybe
30
minutes) then goes into long checkpoints.

One of our developers has been creating new dbspaces, and it
seems
that might be the cause. *Although he's added a few over the past
several months without any problems. *After the restore today, he
added
a dbspace, and we're stuck in long checkpoints again:

IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line
(CKPT
INP) -- Up 01:39:05 -- 6638456 Kbytes

I've listed the last few log entries from today below. *Can
anyone
give me some obvious things to check? *Or some better data to
provide
to the list for some help?

Thanks,

Sean

Log:

11:31:41 *Checkpoint Completed: *duration was 0 seconds.
11:31:41 *Tue Sep 27 - loguniq 15415, logpos 0x33f1018,
timestamp:
0x4f4c8892 Interval: 209340

11:31:41 *Maximum server connections 12
11:31:41 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 0, Llog used 2

11:35:02 *Space 'sm_dbs0' added.
11:35:02 *Checkpoint Completed: *duration was 0 seconds.
11:35:02 *Tue Sep 27 - loguniq 15415, logpos 0x33fb2e4,
timestamp:
0x4f4c8a3d Interval: 209341

11:35:02 *Maximum server connections 12
11:35:02 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 16, Llog used 10

11:35:13 *Checkpoint Completed: *duration was 0 seconds.
11:35:13 *Tue Sep 27 - loguniq 15415, logpos 0x33fe018,
timestamp:
0x4f4c8b0c Interval: 209342

11:35:13 *Maximum server connections 12
11:35:13 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 0, Llog used 3

11:40:40 *Checkpoint Completed: *duration was 1 seconds.
11:40:40 *Tue Sep 27 - loguniq 15415, logpos 0x3867018,
timestamp:
0x4f4cdac8 Interval: 209343

11:40:40 *Maximum server connections 12
11:40:40 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 134, Llog used 1129

11:44:58 *Adjusting LRU for bufferpool - id 0 size 2k
11:44:58 * * * *Old max 60.0 min 50.0 * *New max 59.4 min 49.5
11:46:07 *Adjusting LRU for bufferpool - id 0 size 2k
11:46:07 * * * *Old max 59.4 min 49.5 * *New max 58.8 min 49.0

Well, we can't really get much from what you've posted as the
checkpoint duration's listed in that snipped of MSGPATH file are 0
seconds. *How about you post onstat -g ckp output after a long
checkpoint happens? *Perhaps that will be more helpful.

Will do. *Although we're going on an hour since the last
checkpoint...

Well, if it's happening right now, how about you post and onstat -g
ath and then find the thread id for the main_loop() thread and post an
onstat -g stk <thread id of main_loop() thread> to see what main_loop
is doing, or what it might be waiting for.

Jacques Renaut
IBM Informix Advanced Support
APD Team
The checkpoint finally finished. Here's onstat -g ckp:

----------
IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line -- Up 03:02:03-- 6638456 Kbytes

AUTO_CKPTS=On RTO_SERVER_RESTART=Off

Critical Sections Physical Log Logical Log
Clock Total Flush Block # Ckpt Wait Long # Dirty Dskflu Total Avg Total Avg
Interval Time Trigger LSN Time Time Time Waits Time Time Time Buffers /Sec Pages /Sec Pages /Sec
209328 11:06:09 *Reorg 15415:0x312f018 0.0 0.0 0.0 1 0.0 0.0 0.0 6 6 30 30 24 24
209329 11:06:10 *Reorg 15415:0x314b018 0.0 0.0 0.0 1 0.0 0.0 0.0 12 12 53 53 28 28
209330 11:06:19 *Reorg 15415:0x32657bc 0.0 0.0 0.0 2 0.0 0.0 0.0 18 18 85 9 283 31
209331 11:06:20 *Reorg 15415:0x326b7bc 0.0 0.0 0.0 1 0.0 0.0 0.0 8 8 24 24 6 6
209332 11:06:21 *Reorg 15415:0x3282018 0.0 0.0 0.0 1 0.0 0.0 0.0 6 6 29 29 22 22
209333 11:06:22 *Reorg 15415:0x329a018 0.0 0.0 0.0 1 0.0 0.0 0.0 11 11 49 49 24 24
209334 11:06:28 *Reorg 15415:0x33cc7bc 0.0 0.0 0.0 1 0.0 0.0 0.0 14 14 114 19 307 51
209335 11:06:29 *Reorg 15415:0x33d4018 0.0 0.0 0.0 1 0.0 0.0 0.0 5 5 26 26 7 7
209336 11:11:39 CKPTINTVL 15415:0x33e0018 0.0 0.0 0.0 0 0.0 0.0 0.0 117 117 65 0 12 0
209337 11:16:40 CKPTINTVL 15415:0x33e2018 0.0 0.0 0.0 0 0.0 0.0 0.0 10 10 0 0 2 0
209338 11:21:40 CKPTINTVL 15415:0x33ed018 0.0 0.0 0.0 0 0.0 0.0 0.0 43 43 35 0 11 0
209339 11:26:40 CKPTINTVL 15415:0x33ef018 0.0 0.0 0.0 0 0.0 0.0 0.0 10 10 2 0 2 0
209340 11:31:41 CKPTINTVL 15415:0x33f1018 0.0 0.0 0.0 0 0.0 0.0 0.0 9 9 0 0 2 0
209341 11:35:02 *Admin 15415:0x33fb2e4 0.0 0.0 0.0 1 0.0 0.0 0.0 2 2 16 0 10 0
209342 11:35:13 *User 15415:0x33fe018 0.0 0.0 0.0 1 0.0 0.0 0.0 0 0 0 0 3 0
209343 11:40:40 CKPTINTVL 15415:0x3867018 1.1 1.1 0.0 0 0.0 0.0 0.0 897 831 134 0 1129 3
209344 13:23:25 CKPTINTVL 15415:0x477779c 5865.4 5718.6 0.0 1 0.8 24.7 24.7 1818113 317 332 1 4270 13
209345 13:23:26 *Admin 15415:0x4917018 0.2 0.0 0.0 25 0.0 0.2 0.2 151 151 28120 4 416 0
209346 13:23:26 *Admin 15415:0x492a018 0.0 0.0 0.0 1 0.0 0.0 0.0 15 15 37 37 19 19
209347 13:24:11 *User 15415:0x492c018 0.0 0.0 0.0 1 0.0 0.0 0.0 0 0 0 0 2 0

Max Plog Max Llog Max Dskflush Avg Dskflush Avg Dirty Blocked
pages/sec pages/sec Time pages/sec pages/sec Time
768 368 5719 43 153 0

Based on the current workload, the physical log might be too small
to accommodate the time it takes to flush the buffer pool during
checkpoint processing. The server might block transactions during checkpoints.
If the server blocks transactions, increase the physical log size to
at least 8838144 KB.
----------

Not sure if the warning at the end is real or not.

Thanks,

Sean.

Reply With Quote
  #7  
Old   
jrenaut
 
Posts: n/a

Default Re: Problems with long checkpoints - 09-27-2011 , 03:34 PM



On Sep 27, 3:15*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
Quote:
-----Original Message-----
From: informix-list-boun... (AT) iiug (DOT) org [mailto:informix-list-
boun... (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 1:06 PM
To: informix-l... (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 2:48*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
-----Original Message-----
From: informix-list-boun... (AT) iiug (DOT) org [mailto:informix-list-
boun... (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 12:41 PM
To: informix-l... (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 2:05*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
Hello,

We're stuck without a dba after some massive layoffs. *I know a
little, but I'm no dba.

IDS 11.50.FC6 on RHEL 5.3

We have a development instance that is suddenly going into long
checkpoints (43 minutes, 61 minutes, 90 minutes!). *We've rebooted
the
server and restarted the instance, and even did a restore from our
production server to the dev server. *It lasts for a while (maybe
30
minutes) then goes into long checkpoints.

One of our developers has been creating new dbspaces, and it
seems
that might be the cause. *Although he's added a few over the past
several months without any problems. *After the restore today, he
added
a dbspace, and we're stuck in long checkpoints again:

IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line
(CKPT
INP) -- Up 01:39:05 -- 6638456 Kbytes

I've listed the last few log entries from today below. *Can
anyone
give me some obvious things to check? *Or some better data to
provide
to the list for some help?

Thanks,

Sean

Log:

11:31:41 *Checkpoint Completed: *duration was 0 seconds.
11:31:41 *Tue Sep 27 - loguniq 15415, logpos 0x33f1018,
timestamp:
0x4f4c8892 Interval: 209340

11:31:41 *Maximum server connections 12
11:31:41 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 0, Llog used 2

11:35:02 *Space 'sm_dbs0' added.
11:35:02 *Checkpoint Completed: *duration was 0 seconds.
11:35:02 *Tue Sep 27 - loguniq 15415, logpos 0x33fb2e4,
timestamp:
0x4f4c8a3d Interval: 209341

11:35:02 *Maximum server connections 12
11:35:02 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 16, Llog used 10

11:35:13 *Checkpoint Completed: *duration was 0 seconds.
11:35:13 *Tue Sep 27 - loguniq 15415, logpos 0x33fe018,
timestamp:
0x4f4c8b0c Interval: 209342

11:35:13 *Maximum server connections 12
11:35:13 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 0, Llog used 3

11:40:40 *Checkpoint Completed: *duration was 1 seconds.
11:40:40 *Tue Sep 27 - loguniq 15415, logpos 0x3867018,
timestamp:
0x4f4cdac8 Interval: 209343

11:40:40 *Maximum server connections 12
11:40:40 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 134, Llog used 1129

11:44:58 *Adjusting LRU for bufferpool - id 0 size 2k
11:44:58 * * * *Old max 60.0 min 50.0 * *New max 59.4min 49.5
11:46:07 *Adjusting LRU for bufferpool - id 0 size 2k
11:46:07 * * * *Old max 59.4 min 49.5 * *New max 58.8min 49.0

Well, we can't really get much from what you've posted as the
checkpoint duration's listed in that snipped of MSGPATH file are 0
seconds. *How about you post onstat -g ckp output after a long
checkpoint happens? *Perhaps that will be more helpful.

Will do. *Although we're going on an hour since the last
checkpoint...

Well, if it's happening right now, how about you post and onstat -g
ath and then find the thread id for the main_loop() thread and post an
onstat -g stk <thread id of main_loop() thread> to see what main_loop
is doing, or what it might be waiting for.

Jacques Renaut
IBM Informix Advanced Support
APD Team

The developer tried dropping the dbspace he created, so now we're blockedwith "DBS_DROP". *Here's the onstat -g ath for the main_loop():

IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line (CKPT INP) -- Up 02:48:10 -- 6638456 Kbytes
BlockedBS_DROP

Threads:
*tid * * tcb * * * * * * *rstcb * * * * * *prty status * * * * * * * *vp-class * * * name
*18 * * *15f37a548 * * * *15e56f028 * * * *3 * *sleeping secs: 1 * * * *5cpu * * * * main_loop()

And here's onstat -g stk 18:

IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line (CKPT INP) -- Up 02:49:56 -- 6638456 Kbytes
BlockedBS_DROP

Stack for thread: 18 main_loop()
*base: 0x000000015f3c6000
* len: * 69632
* *pc: 0x0000000001073d0e
* tos: 0x000000015f3d5a40
state: sleeping
* *vp: 4

0x0000000001073d0e (/opt/IBM/informix/bin/oninit) yield_processor_mvp
0x000000000107556b (/opt/IBM/informix/bin/oninit) mt_yield
0x0000000000bb951a (/opt/IBM/informix/bin/oninit) dskflush_bfpool
0x0000000000bb98c2 (/opt/IBM/informix/bin/oninit) dskflush
0x0000000000b56165 (/opt/IBM/informix/bin/oninit) checkpoint
0x00000000005336e9 (/opt/IBM/informix/bin/oninit) main_loop
0x000000000108560f (/opt/IBM/informix/bin/oninit) startup

Thanks,

Sean.
Well main_loop is at the point where it's trying to flush the buffer
pool, so if that's taking a long time, it could possibly be some I/O
problem. You could try and look at onstat -g iof and see if you see
any dbspace/chunk with really bad avg. time. You could also look at
onstat -F output and see if you see the "Chunk" column changing (which
would indicate that flushers are changing which chunk they are
flushing) or if they are not changing then which chunks are being
flushed which might be the ones that are having IO problems (not
likely errors but slow IO performance).

Jacques Renaut
IBM Informix Advanced Support
APD Team

Reply With Quote
  #8  
Old   
jrenaut
 
Posts: n/a

Default Re: Problems with long checkpoints - 09-27-2011 , 04:12 PM



On Sep 27, 3:30*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
Quote:
-----Original Message-----
From: informix-list-boun... (AT) iiug (DOT) org [mailto:informix-list-
boun... (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 1:06 PM
To: informix-l... (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 2:48*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
-----Original Message-----
From: informix-list-boun... (AT) iiug (DOT) org [mailto:informix-list-
boun... (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 12:41 PM
To: informix-l... (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 2:05*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
Hello,

We're stuck without a dba after some massive layoffs. *I know a
little, but I'm no dba.

IDS 11.50.FC6 on RHEL 5.3

We have a development instance that is suddenly going into long
checkpoints (43 minutes, 61 minutes, 90 minutes!). *We've rebooted
the
server and restarted the instance, and even did a restore from our
production server to the dev server. *It lasts for a while (maybe
30
minutes) then goes into long checkpoints.

One of our developers has been creating new dbspaces, and it
seems
that might be the cause. *Although he's added a few over the past
several months without any problems. *After the restore today, he
added
a dbspace, and we're stuck in long checkpoints again:

IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line
(CKPT
INP) -- Up 01:39:05 -- 6638456 Kbytes

I've listed the last few log entries from today below. *Can
anyone
give me some obvious things to check? *Or some better data to
provide
to the list for some help?

Thanks,

Sean

Log:

11:31:41 *Checkpoint Completed: *duration was 0 seconds.
11:31:41 *Tue Sep 27 - loguniq 15415, logpos 0x33f1018,
timestamp:
0x4f4c8892 Interval: 209340

11:31:41 *Maximum server connections 12
11:31:41 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 0, Llog used 2

11:35:02 *Space 'sm_dbs0' added.
11:35:02 *Checkpoint Completed: *duration was 0 seconds.
11:35:02 *Tue Sep 27 - loguniq 15415, logpos 0x33fb2e4,
timestamp:
0x4f4c8a3d Interval: 209341

11:35:02 *Maximum server connections 12
11:35:02 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 16, Llog used 10

11:35:13 *Checkpoint Completed: *duration was 0 seconds.
11:35:13 *Tue Sep 27 - loguniq 15415, logpos 0x33fe018,
timestamp:
0x4f4c8b0c Interval: 209342

11:35:13 *Maximum server connections 12
11:35:13 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 0, Llog used 3

11:40:40 *Checkpoint Completed: *duration was 1 seconds.
11:40:40 *Tue Sep 27 - loguniq 15415, logpos 0x3867018,
timestamp:
0x4f4cdac8 Interval: 209343

11:40:40 *Maximum server connections 12
11:40:40 *Checkpoint Statistics - Avg. Txn Block Time 0.000, #
Txns
blocked 0, Plog used 134, Llog used 1129

11:44:58 *Adjusting LRU for bufferpool - id 0 size 2k
11:44:58 * * * *Old max 60.0 min 50.0 * *New max 59.4min 49.5
11:46:07 *Adjusting LRU for bufferpool - id 0 size 2k
11:46:07 * * * *Old max 59.4 min 49.5 * *New max 58.8min 49.0

Well, we can't really get much from what you've posted as the
checkpoint duration's listed in that snipped of MSGPATH file are 0
seconds. *How about you post onstat -g ckp output after a long
checkpoint happens? *Perhaps that will be more helpful.

Will do. *Although we're going on an hour since the last
checkpoint...

Well, if it's happening right now, how about you post and onstat -g
ath and then find the thread id for the main_loop() thread and post an
onstat -g stk <thread id of main_loop() thread> to see what main_loop
is doing, or what it might be waiting for.

Jacques Renaut
IBM Informix Advanced Support
APD Team

The checkpoint finally finished. *Here's onstat -g ckp:

----------
IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line -- Up 03:02:03 -- 6638456 Kbytes

AUTO_CKPTS=On * RTO_SERVER_RESTART=Off *

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Critical Sections * * * * * * * * * * * * *Physical Log * *Logical Log * *
* * * * * *Clock * * * * * * * * * * * * * * * * *Total Flush Block # * * *Ckpt *Wait *Long *# Dirty * Dskflu *Total * *Avg * *Total * *Avg *
Interval * Time * * *Trigger * *LSN * * * * * * * Time *Time *Time *Waits *Time *Time *Time *Buffers * /Sec * *Pages * */Sec * Pages * */Sec *
209328 * * 11:06:09 *Reorg * * *15415:0x312f018 * 0.0 * 0.0* 0.0 * 1 * * *0.0 * 0.0 * 0.0 * 6 * * * * 6 * * * 30 * * * 30 * * 24 * * * 24 * *
209329 * * 11:06:10 *Reorg * * *15415:0x314b018 * 0.0 * 0.0* 0.0 * 1 * * *0.0 * 0.0 * 0.0 * 12 * * * *12 * * *53 * * * 53 * * 28 * * * 28 * *
209330 * * 11:06:19 *Reorg * * *15415:0x32657bc * 0.0 * 0.0* 0.0 * 2 * * *0.0 * 0.0 * 0.0 * 18 * * * *18 * * *85 * * * 9 * * *283 * * *31 * *
209331 * * 11:06:20 *Reorg * * *15415:0x326b7bc * 0.0 * 0.0* 0.0 * 1 * * *0.0 * 0.0 * 0.0 * 8 * * * * 8 * * * 24 * * * 24 * * 6 * * * *6 * *
209332 * * 11:06:21 *Reorg * * *15415:0x3282018 * 0.0 * 0.0* 0.0 * 1 * * *0.0 * 0.0 * 0.0 * 6 * * * * 6 * * * 29 * * * 29 * * 22 * * * 22 * *
209333 * * 11:06:22 *Reorg * * *15415:0x329a018 * 0.0 * 0.0* 0.0 * 1 * * *0.0 * 0.0 * 0.0 * 11 * * * *11 * * *49 * * * 49 * * 24 * * * 24 * *
209334 * * 11:06:28 *Reorg * * *15415:0x33cc7bc * 0.0 * 0.0* 0.0 * 1 * * *0.0 * 0.0 * 0.0 * 14 * * * *14 * * *114 * * *19 * * 307 * * *51 * *
209335 * * 11:06:29 *Reorg * * *15415:0x33d4018 * 0.0 * 0.0* 0.0 * 1 * * *0.0 * 0.0 * 0.0 * 5 * * * * 5 * * * 26 * * * 26 * * 7 * * * *7 * *
209336 * * 11:11:39 *CKPTINTVL *15415:0x33e0018 * 0.0 * 0.0 * 0.0 * 0 * * *0.0 * 0.0 * 0.0 * 117 * * * 117 * * 65 * * * 0 * * *12 * * * 0 * *
209337 * * 11:16:40 *CKPTINTVL *15415:0x33e2018 * 0.0 * 0.0 * 0.0 * 0 * * *0.0 * 0.0 * 0.0 * 10 * * * *10 * * *0 * * * *0 * * *2 * * * *0 * *
209338 * * 11:21:40 *CKPTINTVL *15415:0x33ed018 * 0.0 * 0.0 * 0.0 * 0 * * *0.0 * 0.0 * 0.0 * 43 * * * *43 * * *35 * * * 0 * * *11 * * * 0 * *
209339 * * 11:26:40 *CKPTINTVL *15415:0x33ef018 * 0.0 * 0.0 * 0.0 * 0 * * *0.0 * 0.0 * 0.0 * 10 * * * *10 * * *2 * * * *0 * * *2 * * * *0 * *
209340 * * 11:31:41 *CKPTINTVL *15415:0x33f1018 * 0.0 * 0.0 * 0.0 * 0 * * *0.0 * 0.0 * 0.0 * 9 * * * * 9 * * * 0 * * * *0 * * *2 * * * *0 * *
209341 * * 11:35:02 *Admin * * *15415:0x33fb2e4 * 0.0 * 0.0* 0.0 * 1 * * *0.0 * 0.0 * 0.0 * 2 * * * * 2 * * * 16 * * * 0 * * *10 * * * 0 * *
209342 * * 11:35:13 *User * * * 15415:0x33fe018 * 0.0 * 0.0* 0.0 * 1 * * *0.0 * 0.0 * 0.0 * 0 * * * * 0 * * * 0 * * * *0 * * *3 * * * *0 * *
209343 * * 11:40:40 *CKPTINTVL *15415:0x3867018 * 1.1 * 1.1 * 0.0 * 0 * * *0.0 * 0.0 * 0.0 * 897 * * * 831 * * 134 * * *0 * * *1129 * * 3 * *
209344 * * 13:23:25 *CKPTINTVL *15415:0x477779c * 5865.4 5718.60.0 * 1 * * *0.8 * 24.7 *24.7 *1818113 * 317 * * 332 * * *1 * * *4270 * * 13 * *
209345 * * 13:23:26 *Admin * * *15415:0x4917018 * 0.2 * 0.0* 0.0 * 25 * * 0.0 * 0.2 * 0.2 * 151 * * * 151 * *28120 * *4 * * *416 * * *0 * *
209346 * * 13:23:26 *Admin * * *15415:0x492a018 * 0.0 * 0.0* 0.0 * 1 * * *0.0 * 0.0 * 0.0 * 15 * * * *15 * * *37 * * * 37 * * 19 * * * 19 * *
209347 * * 13:24:11 *User * * * 15415:0x492c018 * 0.0 * 0.0* 0.0 * 1 * * *0.0 * 0.0 * 0.0 * 0 * * * * 0 * * * 0 * * * *0 * * *2 * * * *0 * *

Max Plog * * * Max Llog * * * Max Dskflush * Avg Dskflush * Avg Dirty * * *Blocked * * *
pages/sec * * *pages/sec * * *Time * * * * * pages/sec * * *pages/sec * * *Time * * * *
768 * * * * * *368 * * * * * *5719 * * * * * 43 * * * * * * 153 * * * * * *0 * * ** * *

Based on the current workload, the physical log might be too small
to accommodate the time it takes to flush the buffer pool during
checkpoint processing. *The server might block transactions during checkpoints.
If the server blocks transactions, increase the physical log size to
at least 8838144 KB.
----------

Not sure if the warning at the end is real or not.

Thanks,

Sean.
Taking a look at your onstat -g ckp output your system goes from <
1000 dirty buffers per checkpoint, to the checkpoint that takes ~5800
seconds and it said you have 1.8 million dirty buffers for that
checkpoint. So it would seem like you are doing something on your
system that is doing tons of page modifications, and so at the time of
the checkpoint, those 1.8 million pages need to be flushed, which it
taking a long time. It might be an slow IO issue, or you might just
need to tune your page cleaners to do more aggressive cleaning by
changing your LRU min/max settings. Or you need to find out why your
system suddenly dirtied up 1.8 million pages if that's not supposed to
be happening.

Jacques Renaut
IBM Informix Advanced Support
APD Team

Reply With Quote
  #9  
Old   
Sean Baker
 
Posts: n/a

Default RE: Problems with long checkpoints - 09-27-2011 , 04:48 PM



Quote:
-----Original Message-----
From: informix-list-bounces (AT) iiug (DOT) org [mailto:informix-list-
bounces (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 2:12 PM
To: informix-list (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 3:30*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
-----Original Message-----
From: informix-list-boun... (AT) iiug (DOT) org [mailto:informix-list-
boun... (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 1:06 PM
To: informix-l... (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 2:48*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
-----Original Message-----
From: informix-list-boun... (AT) iiug (DOT) org [mailto:informix-list-
boun... (AT) iiug (DOT) org] On Behalf Of jrenaut
Sent: Tuesday, September 27, 2011 12:41 PM
To: informix-l... (AT) iiug (DOT) org
Subject: Re: Problems with long checkpoints

On Sep 27, 2:05*pm, Sean Baker <SBa... (AT) moneymailer (DOT) com> wrote:
Hello,

We're stuck without a dba after some massive layoffs. *I know
a
little, but I'm no dba.

IDS 11.50.FC6 on RHEL 5.3

We have a development instance that is suddenly going into
long
checkpoints (43 minutes, 61 minutes, 90 minutes!). *We've
rebooted
the
server and restarted the instance, and even did a restore from
our
production server to the dev server. *It lasts for a while
(maybe
30
minutes) then goes into long checkpoints.

One of our developers has been creating new dbspaces, and it
seems
that might be the cause. *Although he's added a few over the
past
several months without any problems. *After the restore today,
he
added
a dbspace, and we're stuck in long checkpoints again:

IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line
(CKPT
INP) -- Up 01:39:05 -- 6638456 Kbytes

I've listed the last few log entries from today below. *Can
anyone
give me some obvious things to check? *Or some better data to
provide
to the list for some help?

Thanks,

Sean

Log:

11:31:41 *Checkpoint Completed: *duration was 0 seconds.
11:31:41 *Tue Sep 27 - loguniq 15415, logpos 0x33f1018,
timestamp:
0x4f4c8892 Interval: 209340

11:31:41 *Maximum server connections 12
11:31:41 *Checkpoint Statistics - Avg. Txn Block Time 0.000,
#
Txns
blocked 0, Plog used 0, Llog used 2

11:35:02 *Space 'sm_dbs0' added.
11:35:02 *Checkpoint Completed: *duration was 0 seconds.
11:35:02 *Tue Sep 27 - loguniq 15415, logpos 0x33fb2e4,
timestamp:
0x4f4c8a3d Interval: 209341

11:35:02 *Maximum server connections 12
11:35:02 *Checkpoint Statistics - Avg. Txn Block Time 0.000,
#
Txns
blocked 0, Plog used 16, Llog used 10

11:35:13 *Checkpoint Completed: *duration was 0 seconds.
11:35:13 *Tue Sep 27 - loguniq 15415, logpos 0x33fe018,
timestamp:
0x4f4c8b0c Interval: 209342

11:35:13 *Maximum server connections 12
11:35:13 *Checkpoint Statistics - Avg. Txn Block Time 0.000,
#
Txns
blocked 0, Plog used 0, Llog used 3

11:40:40 *Checkpoint Completed: *duration was 1 seconds.
11:40:40 *Tue Sep 27 - loguniq 15415, logpos 0x3867018,
timestamp:
0x4f4cdac8 Interval: 209343

11:40:40 *Maximum server connections 12
11:40:40 *Checkpoint Statistics - Avg. Txn Block Time 0.000,
#
Txns
blocked 0, Plog used 134, Llog used 1129

11:44:58 *Adjusting LRU for bufferpool - id 0 size 2k
11:44:58 * * * *Old max 60.0 min 50.0 * *New max 59..4 min
49.5
11:46:07 *Adjusting LRU for bufferpool - id 0 size 2k
11:46:07 * * * *Old max 59.4 min 49.5 * *New max 58..8 min
49.0

Well, we can't really get much from what you've posted as the
checkpoint duration's listed in that snipped of MSGPATH file
are 0
seconds. *How about you post onstat -g ckp output after a long
checkpoint happens? *Perhaps that will be more helpful.

Will do. *Although we're going on an hour since the last
checkpoint...

Well, if it's happening right now, how about you post and onstat -g
ath and then find the thread id for the main_loop() thread and post
an
onstat -g stk <thread id of main_loop() thread> to see what
main_loop
is doing, or what it might be waiting for.

Jacques Renaut
IBM Informix Advanced Support
APD Team

The checkpoint finally finished. *Here's onstat -g ckp:

----------
IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line -- Up
03:02:03 -- 6638456 Kbytes

AUTO_CKPTS=On * RTO_SERVER_RESTART=Off


Critical Sections * * * * * * * * * * * * *Physical Log * *Logical Log

* * * * * *Clock * * * * * * * * * * * * * * * * *Total Flush Block #
* * *Ckpt *Wait *Long *# Dirty * Dskflu *Total * *Avg* *Total * *Avg
Interval * Time * * *Trigger * *LSN * * * * * ** Time *Time *Time
*Waits *Time *Time *Time *Buffers * /Sec * *Pages * */Sec * Pages
*/Sec
209328 * * 11:06:09 *Reorg * * *15415:0x312f018 * 0.0 * 0..0 * 0.0 * 1
* * *0.0 * 0.0 * 0.0 * 6 * * * * 6 * * * 30 ** * 30 * * 24 * * * 24

209329 * * 11:06:10 *Reorg * * *15415:0x314b018 * 0.0 * 0..0 * 0.0 * 1
* * *0.0 * 0.0 * 0.0 * 12 * * * *12 * * *53 ** * 53 * * 28 * * * 28

209330 * * 11:06:19 *Reorg * * *15415:0x32657bc * 0.0 * 0..0 * 0.0 * 2
* * *0.0 * 0.0 * 0.0 * 18 * * * *18 * * *85 ** * 9 * * *283 * * *31

209331 * * 11:06:20 *Reorg * * *15415:0x326b7bc * 0.0 * 0..0 * 0.0 * 1
* * *0.0 * 0.0 * 0.0 * 8 * * * * 8 * * * 24 ** * 24 * * 6 * * * *6
209332 * * 11:06:21 *Reorg * * *15415:0x3282018 * 0.0 * 0..0 * 0.0 * 1
* * *0.0 * 0.0 * 0.0 * 6 * * * * 6 * * * 29 ** * 29 * * 22 * * * 22

209333 * * 11:06:22 *Reorg * * *15415:0x329a018 * 0.0 * 0..0 * 0.0 * 1
* * *0.0 * 0.0 * 0.0 * 11 * * * *11 * * *49 ** * 49 * * 24 * * * 24

209334 * * 11:06:28 *Reorg * * *15415:0x33cc7bc * 0.0 * 0..0 * 0.0 * 1
* * *0.0 * 0.0 * 0.0 * 14 * * * *14 * * *114 * * *19 * * 307 * * *51

209335 * * 11:06:29 *Reorg * * *15415:0x33d4018 * 0.0 * 0..0 * 0.0 * 1
* * *0.0 * 0.0 * 0.0 * 5 * * * * 5 * * * 26 ** * 26 * * 7 * * * *7
209336 * * 11:11:39 *CKPTINTVL *15415:0x33e0018 * 0.0 * 0.0* 0.0 * 0
* * *0.0 * 0.0 * 0.0 * 117 * * * 117 * * 65 * ** 0 * * *12 * * * 0
209337 * * 11:16:40 *CKPTINTVL *15415:0x33e2018 * 0.0 * 0.0* 0.0 * 0
* * *0.0 * 0.0 * 0.0 * 10 * * * *10 * * *0 * * * *0 * * *2 * * * *0
209338 * * 11:21:40 *CKPTINTVL *15415:0x33ed018 * 0.0 * 0.0* 0.0 * 0
* * *0.0 * 0.0 * 0.0 * 43 * * * *43 * * *35 ** * 0 * * *11 * * * 0
209339 * * 11:26:40 *CKPTINTVL *15415:0x33ef018 * 0.0 * 0.0* 0.0 * 0
* * *0.0 * 0.0 * 0.0 * 10 * * * *10 * * *2 * * * *0 * * *2 * * * *0
209340 * * 11:31:41 *CKPTINTVL *15415:0x33f1018 * 0.0 * 0.0* 0.0 * 0
* * *0.0 * 0.0 * 0.0 * 9 * * * * 9 * * * 0 * * * *0 * * *2 * * * *0
209341 * * 11:35:02 *Admin * * *15415:0x33fb2e4 * 0.0 * 0..0 * 0.0 * 1
* * *0.0 * 0.0 * 0.0 * 2 * * * * 2 * * * 16 ** * 0 * * *10 * * * 0
209342 * * 11:35:13 *User * * * 15415:0x33fe018 * 0.0 * 0..0 * 0.0 * 1
* * *0.0 * 0.0 * 0.0 * 0 * * * * 0 * * * 0 * * * *0 * * *3 * * * *0
209343 * * 11:40:40 *CKPTINTVL *15415:0x3867018 * 1.1 * 1.1* 0.0 * 0
* * *0.0 * 0.0 * 0.0 * 897 * * * 831 * * 134 * * *0 * * *1129 * * 3
209344 * * 13:23:25 *CKPTINTVL *15415:0x477779c * 5865.4 5718..6 0.0
1 * * *0.8 * 24.7 *24.7 *1818113 * 317 * * 332 * * *1 * * *4270 * * 13

209345 * * 13:23:26 *Admin * * *15415:0x4917018 * 0.2 * 0..0 * 0.0
25 * * 0.0 * 0.2 * 0.2 * 151 * * * 151 * * 28120 * *4 * * *416 * * *0

209346 * * 13:23:26 *Admin * * *15415:0x492a018 * 0.0 * 0..0 * 0.0 * 1
* * *0.0 * 0.0 * 0.0 * 15 * * * *15 * * *37 ** * 37 * * 19 * * * 19

209347 * * 13:24:11 *User * * * 15415:0x492c018 * 0.0 * 0..0 * 0.0 * 1
* * *0.0 * 0.0 * 0.0 * 0 * * * * 0 * * * 0 * * * *0 * * *2 * * * *0

Max Plog * * * Max Llog * * * Max Dskflush * Avg Dskflush* Avg Dirty
* * *Blocked
pages/sec * * *pages/sec * * *Time * * * * * pages/sec * * *pages/sec
* * *Time
768 * * * * * *368 * * * * * *5719 * * * * * 43 * * * * * * 153
* * *0

Based on the current workload, the physical log might be too small
to accommodate the time it takes to flush the buffer pool during
checkpoint processing. *The server might block transactions during
checkpoints.
If the server blocks transactions, increase the physical log size to
at least 8838144 KB.
----------

Not sure if the warning at the end is real or not.

Thanks,

Sean.

Taking a look at your onstat -g ckp output your system goes from
1000 dirty buffers per checkpoint, to the checkpoint that takes ~5800
seconds and it said you have 1.8 million dirty buffers for that
checkpoint. So it would seem like you are doing something on your
system that is doing tons of page modifications, and so at the time of
the checkpoint, those 1.8 million pages need to be flushed, which it
taking a long time. It might be an slow IO issue, or you might just
need to tune your page cleaners to do more aggressive cleaning by
changing your LRU min/max settings. Or you need to find out why your
system suddenly dirtied up 1.8 million pages if that's not supposed to
be happening.

Jacques Renaut
IBM Informix Advanced Support
APD Team
You may be onto something regarding the slow I/O. The developer has been creating these test dbspaces on an NFS volume for several months, but there was an upgrade to the server holding the NFS volume last week. Here's onstat -g iof, I'm only including the output for the main dbspace and the new one:

----------
IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line (CKPT INP) -- Up 01:07:04 -- 6638456 Kbytes

AIO global files:
gfd pathname bytes read page reads bytes write page writes io/s
6 dbs1 4337414144 2117867 3074048 1501 2156.6
op type count avg. time
seeks 0 N/A
reads 221019 0.0005
writes 1028 0.0015
kaio_reads 0 N/A
kaio_writes 0 N/A

9 sm_dbspace0 8747304960 4271145 892133376 435616 17.2
op type count avg. time
seeks 0 N/A
reads 10781 0.0878
writes 353054 0.0571
kaio_reads 0 N/A
kaio_writes 0 N/A
----------

The avg. time for sm_dbspace0 is quite a bit worse than dbs1. As a shot inthe dark, we're going to create the dbspace on the same storage that dbds1is on, just to rule out the NFS volume. Here's onstat -F, but I'm not sure how to interpret the results, other than to say nothing changes if I run it over and over:

----------
IBM Informix Dynamic Server Version 11.50.FC6 -- On-Line (CKPT INP) -- Up 01:13:31 -- 6638456 Kbytes


Fg Writes LRU Writes Chunk Writes
0 545128 13310

address flusher state data # LRU Chunk Wakeups Idle Tim
15e56f880 0 L 1d 3 8 2266 2257..890
15e5700d8 1 L 9 2 5 2261 2257..565
15e570930 2 L 5 4 3 2261 2257..660
15e571188 3 C 7 2 2 2258 2257..661
15e5719e0 4 L 15 2 0 2351 2351..992
15e572238 5 L d 6 0 2386 2382..430
15e572a90 6 L 13 3 0 2317 2316..740
15e5732e8 7 L 11 3 0 2383 2382..432
15e573b40 8 L 23 9 0 2437 2430..166
15e574398 9 L 19 4 0 2328 2327..008
15e574bf0 10 L 21 3 0 2391 2390..153
15e575448 11 L 25 2 0 2392 2391..980
15e575ca0 12 L 1 3 0 2393 2392..360
15e5764f8 13 L 1f 2 0 2264 2263..785
15e576d50 14 L 3 6 0 2284 2279..906
15e5775a8 15 L 27 2 0 2316 2316..650
states: Exit Idle Chunk Lru
----------

Thanks,

Sean.

Reply With Quote
  #10  
Old   
jrenaut
 
Posts: n/a

Default Re: Problems with long checkpoints - 09-27-2011 , 09:43 PM



Quote:
Taking a look at your onstat -g ckp output your system goes from
1000 dirty buffers per checkpoint, to the checkpoint that takes ~5800
seconds and it said you have 1.8 million dirty buffers for that
checkpoint. *So it would seem like you are doing something on your
system that is doing tons of page modifications, and so at the time of
the checkpoint, those 1.8 million pages need to be flushed, which it
taking a long time. *It might be an slow IO issue, or you might just
need to tune your page cleaners to do more aggressive cleaning by
changing your LRU min/max settings. *Or you need to find out why your
system suddenly dirtied up 1.8 million pages if that's not supposed to
be happening.

Jacques Renaut
IBM Informix Advanced Support
APD Team

You may be onto something regarding the slow I/O. *The developer has been creating these test dbspaces on an NFS volume for several months, but there was an upgrade to the server holding the NFS volume last week. *Here's onstat -g iof, I'm only including the output for the main dbspace and thenew one:

----------
IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line (CKPT INP) -- Up 01:07:04 -- 6638456 Kbytes

AIO global files:
gfd pathname * * * * bytes read * * page reads *bytes write* *page writes io/s
6 * dbs1 * * * * * * 4337414144 * * 2117867 * * 3074048 * * * *1501 * * * *2156.6
* * * * op type * * count * * * * *avg. time
* * * * seeks * * * 0 * * * * * * *N/A
* * * * reads * * * 221019 * * * * 0.0005
* * * * writes * * *1028 * * * * * 0.0015
* * * * kaio_reads *0 * * * * * * *N/A
* * * * kaio_writes 0 * * * * * * *N/A

9 * sm_dbspace0 * * *8747304960 * * 4271145 * * 892133376* * *435616 * * *17.2
* * * * op type * * count * * * * *avg. time
* * * * seeks * * * 0 * * * * * * *N/A
* * * * reads * * * 10781 * * * * *0.0878
* * * * writes * * *353054 * * * * 0.0571
* * * * kaio_reads *0 * * * * * * *N/A
* * * * kaio_writes 0 * * * * * * *N/A
----------

The avg. time for sm_dbspace0 is quite a bit worse than dbs1. *As a shot in the dark, we're going to create the dbspace on the same storage that dbds1 is on, just to rule out the NFS volume. *Here's onstat -F, but I'm not sure how to interpret the results, other than to say nothing changes if I run it over and over:

----------
IBM Informix Dynamic Server Version 11.50.FC6 * * -- On-Line (CKPT INP) -- Up 01:13:31 -- 6638456 Kbytes

Fg Writes * * LRU Writes * *Chunk Writes
0 * * * * * * 545128 * * * *13310 * * * *

address * * * * * flusher *state * *data * * # LRU * *Chunk * *Wakeups *Idle Tim
15e56f880 * * * *0 * * * *L * * * *1d * * *3 * * * *8 * * * *2266 * * 2257.890
15e5700d8 * * * *1 * * * *L * * * *9 * * * *2 * * * *5 * * * *2261 * * 2257.565
15e570930 * * * *2 * * * *L * * * *5 * * * *4 * * * *3 * * * *2261 * * 2257.660
15e571188 * * * *3 * * * *C * * * *7 * * * *2 * * * *2 * * * *2258 * * 2257.661
15e5719e0 * * * *4 * * * *L * * * *15 * * *2 * * * *0 * * * *2351 * * 2351.992
15e572238 * * * *5 * * * *L * * * *d * * * *6 * * * *0 * * * *2386 * * 2382.430
15e572a90 * * * *6 * * * *L * * * *13 * * *3 * * * *0 * * * *2317 * * 2316.740
15e5732e8 * * * *7 * * * *L * * * *11 * * *3 * * * *0 * * * *2383 * * 2382.432
15e573b40 * * * *8 * * * *L * * * *23 * * *9 * * * *0 * * * *2437 * * 2430.166
15e574398 * * * *9 * * * *L * * * *19 * * *4 * * * *0 * * * *2328 * * 2327.008
15e574bf0 * * * *10 * * * L * * * *21 * * * 3* * * *0 * * * *2391 * * 2390.153
15e575448 * * * *11 * * * L * * * *25 * * * 2* * * *0 * * * *2392 * * 2391.980
15e575ca0 * * * *12 * * * L * * * *1 * * * *3 * * * *0 * * * *2393 * * 2392.360
15e5764f8 * * * *13 * * * L * * * *1f * * * 2* * * *0 * * * *2264 * * 2263.785
15e576d50 * * * *14 * * * L * * * *3 * * * *6 * * * *0 * * * *2284 * * 2279.906
15e5775a8 * * * *15 * * * L * * * *27 * * * 2* * * *0 * * * *2316 * * 2316.650
* * * states: Exit Idle Chunk Lru
----------

Thanks,

Sean.
Ok, I was mistaken in my original post about the onstat -F. I thought
the chunk number the flusher was cleaning was in the column labeled
"Chunk", but it's actually in the column labeled data. So if the
"state" column is "C" (for chunk) it would be trying to clean the
chunk number from the "data" column. So in your above output cleaner
#3 is assigned to clean chunk #7, and it appears it's the only cleaner
that is currently assigned to write out any chunks for the
checkpoint. If the checkpoint has been going on for a long time at
this point, it might be safe if you assume that it's the only chunk
that has not finished flushing. So I think that would be the chunk
that you'd be having the IO slowness on. I'm hoping it maps up to
some chunk in the sm_dbspace0 (can find the chunk numbers in onstat -d
output) that you've mentioned is NSF mounted, and does seem to have a
fairly slow avg time for write operations and it seems to be doing a
lot of them.

Jacques Renaut
IBM Informix Advanced Support
APD Team

Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.