All of lore.kernel.org
 help / color / mirror / Atom feed
* Mon losing touch with OSDs
@ 2013-02-15  3:29 Chris Dunlop
  2013-02-15  4:57 ` Sage Weil
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-02-15  3:29 UTC (permalink / raw)
  To: ceph-devel

G'day,

In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
mons to lose touch with the osds?

I imagine a network glitch could cause it, but I can't see any issues in any
other system logs on any of the machines on the network.

Having (mostly?) resolved my previous "slow requests" issue
(http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13076) at around
13:45, there were no problems until the mon lost osd.0 at 20:26 and lost osd.1
5 seconds later:

ceph-mon.b2.log:
2013-02-14 20:11:19.892060 7fa48d4f8700  0 log [INF] : pgmap v2822096: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:11:21.719513 7fa48d4f8700  0 log [INF] : pgmap v2822097: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:26:20.656162 7fa48dcf9700 -1 mon.b2@0(leader).osd e768 no osd or pg stats from osd.0 since 2013-02-14 20:11:19.720812, 900.935345 seconds ago.  marking down
2013-02-14 20:26:20.780244 7fa48d4f8700  1 mon.b2@0(leader).osd e769 e769: 2 osds: 1 up, 2 in
2013-02-14 20:26:20.837123 7fa48d4f8700  0 log [INF] : osdmap e769: 2 osds: 1 up, 2 in
2013-02-14 20:26:20.947523 7fa48d4f8700  0 log [INF] : pgmap v2822098: 576 pgs: 304 active+clean, 272 stale+active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:26:25.709341 7fa48dcf9700 -1 mon.b2@0(leader).osd e769 no osd or pg stats from osd.1 since 2013-02-14 20:11:21.523741, 904.185596 seconds ago.  marking down
2013-02-14 20:26:25.822773 7fa48d4f8700  1 mon.b2@0(leader).osd e770 e770: 2 osds: 0 up, 2 in
2013-02-14 20:26:25.863493 7fa48d4f8700  0 log [INF] : osdmap e770: 2 osds: 0 up, 2 in
2013-02-14 20:26:25.954799 7fa48d4f8700  0 log [INF] : pgmap v2822099: 576 pgs: 576 stale+active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-14 20:31:30.772360 7fa48dcf9700  0 log [INF] : osd.1 out (down for 304.933403)
2013-02-14 20:31:30.893521 7fa48d4f8700  1 mon.b2@0(leader).osd e771 e771: 2 osds: 0 up, 1 in
2013-02-14 20:31:30.933439 7fa48d4f8700  0 log [INF] : osdmap e771: 2 osds: 0 up, 1 in
2013-02-14 20:31:31.055408 7fa48d4f8700  0 log [INF] : pgmap v2822100: 576 pgs: 576 stale+active+clean; 407 GB data, 417 GB used, 1444 GB / 1862 GB avail
2013-02-14 20:35:05.831221 7fa48dcf9700  0 log [INF] : osd.0 out (down for 525.033581)
2013-02-14 20:35:05.989724 7fa48d4f8700  1 mon.b2@0(leader).osd e772 e772: 2 osds: 0 up, 0 in
2013-02-14 20:35:06.031409 7fa48d4f8700  0 log [INF] : osdmap e772: 2 osds: 0 up, 0 in
2013-02-14 20:35:06.129046 7fa48d4f8700  0 log [INF] : pgmap v2822101: 576 pgs: 576 stale+active+clean; 407 GB data, 0 KB used, 0 KB / 0 KB avail

The other 2 mons both have messages like this in their logs, starting at around 20:12:

2013-02-14 20:12:26.534977 7f2092b86700  0 -- 10.200.63.133:6789/0 >> 10.200.63.133:6800/6466 pipe(0xade76500 sd=22 :6789 s=0 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1)
2013-02-14 20:13:24.741092 7f2092d88700  0 -- 10.200.63.133:6789/0 >> 10.200.63.132:6800/2456 pipe(0x9f8b7180 sd=28 :6789 s=0 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1)
2013-02-14 20:13:56.551908 7f2090560700  0 -- 10.200.63.133:6789/0 >> 10.200.63.133:6800/6466 pipe(0x9f8b6000 sd=41 :6789 s=0 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1)
2013-02-14 20:14:24.752356 7f209035e700  0 -- 10.200.63.133:6789/0 >> 10.200.63.132:6800/2456 pipe(0x9f8b6500 sd=42 :6789 s=0 pgs=0 cs=0 l=1).accept replacing existing (lossy) channel (new one lossy=1)

(10.200.63.132 is mon.b4/osd.0, 10.200.63.133 is mon.b5/osd.1)

...although Greg Farnum indicates these messages are "normal":

http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/5989/focus=5993

Osd.0 doesn't show any signs of distress at all:

ceph-osd.0.log:
2013-02-14 20:00:10.280601 7ffceb012700  0 log [INF] : 2.7e scrub ok
2013-02-14 20:14:19.923490 7ffceb012700  0 log [INF] : 2.5b scrub ok
2013-02-14 20:14:50.571980 7ffceb012700  0 log [INF] : 2.7b scrub ok
2013-02-14 20:17:48.475129 7ffceb012700  0 log [INF] : 2.7d scrub ok
2013-02-14 20:28:22.601594 7ffceb012700  0 log [INF] : 2.91 scrub ok
2013-02-14 20:28:32.839278 7ffceb012700  0 log [INF] : 2.92 scrub ok
2013-02-14 20:28:46.992226 7ffceb012700  0 log [INF] : 2.93 scrub ok
2013-02-14 20:29:12.330668 7ffceb012700  0 log [INF] : 2.95 scrub ok

...although osd.1 started seeing problems around this time:

ceph-osd.1.log:
2013-02-14 20:03:11.413352 7fd1d8f0a700  0 log [INF] : 2.23 scrub ok
2013-02-14 20:26:51.601425 7fd1e6f26700  0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for > 30.750063 secs
2013-02-14 20:26:51.601432 7fd1e6f26700  0 log [WRN] : slow request 30.750063 seconds old, received at 2013-02-14 20:26:20.851304: osd_op(client.9983.0:28173 xxx.rbd [watch 1~0] 2.10089424) v4 currently wait for new map
2013-02-14 20:26:51.601437 7fd1e6f26700  0 log [WRN] : slow request 30.749947 seconds old, received at 2013-02-14 20:26:20.851420: osd_op(client.10001.0:618473 yyyyyy.rbd [watch 1~0] 2.3854277a) v4 currently wait for new map
2013-02-14 20:26:51.601440 7fd1e6f26700  0 log [WRN] : slow request 30.749938 seconds old, received at 2013-02-14 20:26:20.851429: osd_op(client.9998.0:39716 zzzzzz.rbd [watch 1~0] 2.71731007) v4 currently wait for new map
2013-02-14 20:26:51.601442 7fd1e6f26700  0 log [WRN] : slow request 30.749907 seconds old, received at 2013-02-14 20:26:20.851460: osd_op(client.10007.0:59572 aaaaaa.rbd [watch 1~0] 2.320eebb8) v4 currently wait for new map
2013-02-14 20:26:51.601445 7fd1e6f26700  0 log [WRN] : slow request 30.749630 seconds old, received at 2013-02-14 20:26:20.851737: osd_op(client.9980.0:86883 bbbbbb.rbd [watch 1~0] 2.ab9b579f) v4 currently wait for new map

Perhaps the mon lost osd.1 because it was too slow, but that hadn't happened in
any of the many previous "slow requests" intances, and the timing doesn't look
quite right: the mon complains it hasn't heard from osd.0 since 20:11:19, but
the osd.0 log shows nothing problems at all, then the mon complains about not
having heard from osd.1 since 20:11:21, whereas the first indication of trouble
on osd.1 was the request from 20:26:20 not being processed in a timely fashion.

No knowing enough about how the various pieces of ceph talk to each other
makes it difficult to distinguish cause and effect!

Trying to manually set the osds in (e.g. ceph osd in 0) didn't help, nor did
restarting the osds ('service ceph restart osd' on each osd host).

The immediate issue was resolved by restarting ceph completely on one of the
mon/osd hosts (service ceph restart). Possibly a restart of just the mon would
have been sufficient.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-15  3:29 Mon losing touch with OSDs Chris Dunlop
@ 2013-02-15  4:57 ` Sage Weil
  2013-02-15 22:05   ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Sage Weil @ 2013-02-15  4:57 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

Hi Chris,

On Fri, 15 Feb 2013, Chris Dunlop wrote:
> G'day,
> 
> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
> mons to lose touch with the osds?
> 
> I imagine a network glitch could cause it, but I can't see any issues in any
> other system logs on any of the machines on the network.
> 
> Having (mostly?) resolved my previous "slow requests" issue
> (http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/13076) at around
> 13:45, there were no problems until the mon lost osd.0 at 20:26 and lost osd.1
> 5 seconds later:
> 
> ceph-mon.b2.log:
> 2013-02-14 20:11:19.892060 7fa48d4f8700  0 log [INF] : pgmap v2822096: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
> 2013-02-14 20:11:21.719513 7fa48d4f8700  0 log [INF] : pgmap v2822097: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
> 2013-02-14 20:26:20.656162 7fa48dcf9700 -1 mon.b2@0(leader).osd e768 no osd or pg stats from osd.0 since 2013-02-14 20:11:19.720812, 900.935345 seconds ago.  marking down

There is a safety check that if the osd doesn't check in for a long period 
of time we assume it is dead.  But it seems as though that shouldn't 
happen, since osd.0 has some PGs assigned and is scrubbing away.

Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
hopes that this happens again?  It will give us more information to go on.

> ...although osd.1 started seeing problems around this time:
> 
> ceph-osd.1.log:
> 2013-02-14 20:03:11.413352 7fd1d8f0a700  0 log [INF] : 2.23 scrub ok
> 2013-02-14 20:26:51.601425 7fd1e6f26700  0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for > 30.750063 secs
> 2013-02-14 20:26:51.601432 7fd1e6f26700  0 log [WRN] : slow request 30.750063 seconds old, received at 2013-02-14 20:26:20.851304: osd_op(client.9983.0:28173 xxx.rbd [watch 1~0] 2.10089424) v4 currently wait for new map
> 2013-02-14 20:26:51.601437 7fd1e6f26700  0 log [WRN] : slow request 30.749947 seconds old, received at 2013-02-14 20:26:20.851420: osd_op(client.10001.0:618473 yyyyyy.rbd [watch 1~0] 2.3854277a) v4 currently wait for new map
> 2013-02-14 20:26:51.601440 7fd1e6f26700  0 log [WRN] : slow request 30.749938 seconds old, received at 2013-02-14 20:26:20.851429: osd_op(client.9998.0:39716 zzzzzz.rbd [watch 1~0] 2.71731007) v4 currently wait for new map
> 2013-02-14 20:26:51.601442 7fd1e6f26700  0 log [WRN] : slow request 30.749907 seconds old, received at 2013-02-14 20:26:20.851460: osd_op(client.10007.0:59572 aaaaaa.rbd [watch 1~0] 2.320eebb8) v4 currently wait for new map
> 2013-02-14 20:26:51.601445 7fd1e6f26700  0 log [WRN] : slow request 30.749630 seconds old, received at 2013-02-14 20:26:20.851737: osd_op(client.9980.0:86883 bbbbbb.rbd [watch 1~0] 2.ab9b579f) v4 currently wait for new map
> 
> Perhaps the mon lost osd.1 because it was too slow, but that hadn't happened in
> any of the many previous "slow requests" intances, and the timing doesn't look
> quite right: the mon complains it hasn't heard from osd.0 since 20:11:19, but
> the osd.0 log shows nothing problems at all, then the mon complains about not
> having heard from osd.1 since 20:11:21, whereas the first indication of trouble
> on osd.1 was the request from 20:26:20 not being processed in a timely fashion.

My guess is the above was a side-effect of osd.0 being marked out.   On 
0.56.2 there is some strange peering workqueue laggyness that could 
potentially contribute as well.  I recommend moving to 0.56.3.

> No knowing enough about how the various pieces of ceph talk to each other
> makes it difficult to distinguish cause and effect!
> 
> Trying to manually set the osds in (e.g. ceph osd in 0) didn't help, nor did
> restarting the osds ('service ceph restart osd' on each osd host).
> 
> The immediate issue was resolved by restarting ceph completely on one of the
> mon/osd hosts (service ceph restart). Possibly a restart of just the mon would
> have been sufficient.

Did you notice that the osds you restarted didn't immediately mark 
themselves in?  Again, it could be explained by the peering wq issue, 
especially if there are pools in your cluster that are not getting any IO.

sage

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-15  4:57 ` Sage Weil
@ 2013-02-15 22:05   ` Chris Dunlop
  2013-02-17 23:41     ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-02-15 22:05 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

G'day Sage,

On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
> On Fri, 15 Feb 2013, Chris Dunlop wrote:
>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
>> mons to lose touch with the osds?
> 
> Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
> hopes that this happens again?  It will give us more information to go on.

Debug turned on.

>> Perhaps the mon lost osd.1 because it was too slow, but that hadn't happened in
>> any of the many previous "slow requests" intances, and the timing doesn't look
>> quite right: the mon complains it hasn't heard from osd.0 since 20:11:19, but
>> the osd.0 log shows nothing problems at all, then the mon complains about not
>> having heard from osd.1 since 20:11:21, whereas the first indication of trouble
>> on osd.1 was the request from 20:26:20 not being processed in a timely fashion.
> 
> My guess is the above was a side-effect of osd.0 being marked out.   On 
> 0.56.2 there is some strange peering workqueue laggyness that could 
> potentially contribute as well.  I recommend moving to 0.56.3.

Upgraded to 0.56.3.

>> Trying to manually set the osds in (e.g. ceph osd in 0) didn't help, nor did
>> restarting the osds ('service ceph restart osd' on each osd host).
>> 
>> The immediate issue was resolved by restarting ceph completely on one of the
>> mon/osd hosts (service ceph restart). Possibly a restart of just the mon would
>> have been sufficient.
> 
> Did you notice that the osds you restarted didn't immediately mark 
> themselves in?  Again, it could be explained by the peering wq issue, 
> especially if there are pools in your cluster that are not getting any IO.

Sorry, no. I was kicking myself later for losing the 'ceph -s' output 
when I killed that terminal session but in the heat of the moment...

I can't see anything about osd marking themselves in from the logs from the
time (with no debugging), but I'm on my ipad at the moment so I could easily
have missed it. Should that info be in the logs somewhere?

There's certainly unused pools: we're only using the rbd pool and so the
default data and metadata pools are unused.

Thanks for your attention!

Cheers,

Chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-15 22:05   ` Chris Dunlop
@ 2013-02-17 23:41     ` Chris Dunlop
  2013-02-18  1:44       ` Sage Weil
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-02-17 23:41 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

G'day Sage,

On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
> On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
>> On Fri, 15 Feb 2013, Chris Dunlop wrote:
>>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
>>> mons to lose touch with the osds?
>> 
>> Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
>> hopes that this happens again?  It will give us more information to go on.
> 
> Debug turned on.

We haven't experienced the cluster losing touch with the osds completely
since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
for a few seconds before it recovered. See below for logs (reminder: 3
boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).

The osd.1 drop was associated with a bit of an write iops spike on the osd
disks (logs below, "w/s" column), although the logs also show plenty of
other similar spikes that haven't led to a drop.  ...oh, a closer look at
the timestamps shows the spike actually came after the drop, so it wasn't
the spike that caused the drop.

Cheers,

Chris

----------------------------------------------------------------------
ceph-osd.0.log
----------------------------------------------------------------------
2013-02-17 05:50:58.841310 7f108cf1b700  0 log [INF] : 2.44 scrub ok
2013-02-17 06:03:54.406730 7f108cf1b700  0 log [INF] : 2.51 scrub ok
2013-02-17 06:04:51.560283 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:31.560283)
2013-02-17 06:04:51.769792 7f108bf19700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:31.769792)
2013-02-17 06:04:52.565376 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:32.565376)
2013-02-17 06:04:53.565629 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:33.565628)
2013-02-17 06:04:54.565813 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:34.565812)
2013-02-17 06:04:55.565906 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:35.565905)
2013-02-17 06:04:55.870011 7f108bf19700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:35.870011)
2013-02-17 06:04:56.566030 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:36.566029)
2013-02-17 06:04:57.566227 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:37.566227)
2013-02-17 06:04:57.570184 7f108bf19700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:37.570184)
2013-02-17 06:04:58.070400 7f108bf19700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:38.070399)
2013-02-17 06:04:58.566489 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:38.566489)
2013-02-17 06:04:59.566631 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:39.566630)
2013-02-17 06:05:00.566728 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:40.566728)
2013-02-17 06:05:01.566848 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:41.566847)
2013-02-17 06:05:02.170643 7f108bf19700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:42.170643)
2013-02-17 06:05:02.566961 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:42.566960)
2013-02-17 06:05:04.880523 7f108a514700  0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0xac42a00 sd=31 :6802 s=2 pgs=19 cs=3 l=0).fault, initiating reconnect
2013-02-17 06:05:04.880977 7f108a615700  0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0xac42a00 sd=31 :6802 s=1 pgs=19 cs=4 l=0).fault
2013-02-17 06:18:52.354800 7f108cf1b700  0 log [INF] : 2.4e scrub ok
2013-02-17 06:22:12.410074 7f108cf1b700  0 log [INF] : 2.53 scrub ok

----------------------------------------------------------------------
ceph-osd.1.log
----------------------------------------------------------------------
2013-02-17 06:00:25.752991 7f5647f2c700  0 log [INF] : 2.a6 scrub ok
2013-02-17 06:01:59.282661 7f5647f2c700  0 log [INF] : 2.b0 scrub ok
2013-02-17 06:05:02.873412 7f5645525700  0 -- 192.168.254.133:6800/21178 >> 192.168.254.132:6802/18444 pipe(0x1e50c80 sd=38 :6800 s=2 pgs=1 cs=1 l=0).fault, initiating reconnect
2013-02-17 06:05:02.873463 7f5645323700  0 -- 192.168.254.133:6800/21178 >> 192.168.254.132:6802/18444 pipe(0x1e50c80 sd=38 :6800 s=1 pgs=1 cs=2 l=0).fault
2013-02-17 06:05:04.541062 7f5645525700  0 -- 192.168.254.133:6800/21178 >> 192.168.254.132:6802/18444 pipe(0x1e50c80 sd=31 :45391 s=2 pgs=2 cs=3 l=0).reader got old message 1 <= 2344847 0xa662c00 osd_map(785..786 src has 541..786) v3, discarding
2013-02-17 06:05:04.541113 7f5645525700  0 -- 192.168.254.133:6800/21178 >> 192.168.254.132:6802/18444 pipe(0x1e50c80 sd=31 :45391 s=2 pgs=2 cs=3 l=0).reader got old message 2 <= 2344847 0xa662c00 osd_map(785..786 src has 541..786) v3, discarding
2013-02-17 06:05:04.880116 7f564df38700  0 log [WRN] : map e786 wrongly marked me down
2013-02-17 06:19:13.397843 7f5647f2c700  0 log [INF] : 2.aa scrub ok
2013-02-17 06:21:05.506977 7f5647f2c700  0 log [INF] : 2.ba scrub ok

----------------------------------------------------------------------
ceph.log
----------------------------------------------------------------------
2013-02-17 06:04:45.031719 mon.0 10.200.63.130:6789/0 19956 : [INF] pgmap v2900128: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:04:47.732814 mon.0 10.200.63.130:6789/0 19957 : [INF] pgmap v2900129: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:04:50.046404 mon.0 10.200.63.130:6789/0 19958 : [INF] pgmap v2900130: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:04:52.579862 mon.0 10.200.63.130:6789/0 19959 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
2013-02-17 06:04:52.812732 mon.0 10.200.63.130:6789/0 19960 : [INF] pgmap v2900131: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:04:55.026841 mon.0 10.200.63.130:6789/0 19961 : [INF] pgmap v2900132: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:04:57.567496 mon.0 10.200.63.130:6789/0 19962 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
2013-02-17 06:04:57.773216 mon.0 10.200.63.130:6789/0 19963 : [INF] pgmap v2900133: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:00.043065 mon.0 10.200.63.130:6789/0 19964 : [INF] pgmap v2900134: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:02.567938 mon.0 10.200.63.130:6789/0 19965 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
2013-02-17 06:05:02.567989 mon.0 10.200.63.130:6789/0 19966 : [INF] osd.1 10.200.63.133:6801/21178 failed (3 reports from 1 peers after 2013-02-17 06:05:23.567928 >= grace 20.000021)
2013-02-17 06:05:02.787622 mon.0 10.200.63.130:6789/0 19967 : [INF] osdmap e785: 2 osds: 1 up, 2 in
2013-02-17 06:05:02.891325 mon.0 10.200.63.130:6789/0 19968 : [INF] pgmap v2900135: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:03.355214 mon.0 10.200.63.130:6789/0 19969 : [INF] pgmap v2900136: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:03.884400 mon.0 10.200.63.130:6789/0 19970 : [INF] osdmap e786: 2 osds: 1 up, 2 in
2013-02-17 06:05:04.057756 mon.0 10.200.63.130:6789/0 19971 : [INF] pgmap v2900137: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:05.921247 mon.0 10.200.63.130:6789/0 19972 : [INF] osdmap e787: 2 osds: 2 up, 2 in
2013-02-17 06:05:05.921306 mon.0 10.200.63.130:6789/0 19973 : [INF] osd.1 10.200.63.133:6801/21178 boot
2013-02-17 06:05:06.022361 mon.0 10.200.63.130:6789/0 19974 : [INF] pgmap v2900138: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:06.983262 mon.0 10.200.63.130:6789/0 19975 : [INF] osdmap e788: 2 osds: 2 up, 2 in
2013-02-17 06:05:07.103855 mon.0 10.200.63.130:6789/0 19976 : [INF] pgmap v2900139: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:08.085143 mon.0 10.200.63.130:6789/0 19977 : [INF] osdmap e789: 2 osds: 2 up, 2 in
2013-02-17 06:05:08.201700 mon.0 10.200.63.130:6789/0 19978 : [INF] pgmap v2900140: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:12.100060 mon.0 10.200.63.130:6789/0 19979 : [INF] pgmap v2900141: 576 pgs: 259 active, 271 active+clean, 45 peering, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:13.196692 mon.0 10.200.63.130:6789/0 19980 : [INF] pgmap v2900142: 576 pgs: 467 active, 109 peering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:04.880125 osd.1 10.200.63.133:6801/21178 292 : [WRN] map e786 wrongly marked me down
2013-02-17 06:05:17.088685 mon.0 10.200.63.130:6789/0 19981 : [INF] pgmap v2900143: 576 pgs: 479 active, 32 active+clean, 65 peering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:18.229214 mon.0 10.200.63.130:6789/0 19982 : [INF] pgmap v2900144: 576 pgs: 469 active, 105 active+clean, 2 peering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:22.702406 mon.0 10.200.63.130:6789/0 19983 : [INF] pgmap v2900145: 576 pgs: 198 active, 376 active+clean, 1 peering, 1 active+recovering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:23.795151 mon.0 10.200.63.130:6789/0 19984 : [INF] pgmap v2900146: 576 pgs: 574 active+clean, 1 active+recovery_wait, 1 active+recovering; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail; 1/211684 degraded (0.000%)
2013-02-17 06:05:27.689766 mon.0 10.200.63.130:6789/0 19985 : [INF] pgmap v2900147: 576 pgs: 575 active+clean, 1 active+recovery_wait; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail; 1/211684 degraded (0.000%)
2013-02-17 06:05:28.798006 mon.0 10.200.63.130:6789/0 19986 : [INF] pgmap v2900148: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-17 06:05:32.688719 mon.0 10.200.63.130:6789/0 19987 : [INF] pgmap v2900149: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-17 06:05:33.764091 mon.0 10.200.63.130:6789/0 19988 : [INF] pgmap v2900150: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail

----------------------------------------------------------------------
ceph-mon.b2.log
----------------------------------------------------------------------
2013-02-17 06:04:40.032792 7fb315ca2700  0 log [INF] : pgmap v2900126: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:04:42.733647 7fb315ca2700  0 log [INF] : pgmap v2900127: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:04:45.031710 7fb315ca2700  0 log [INF] : pgmap v2900128: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:04:47.732805 7fb315ca2700  0 log [INF] : pgmap v2900129: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:04:50.046400 7fb315ca2700  0 log [INF] : pgmap v2900130: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:04:52.577640 7fb315ca2700  1 mon.b2@0(leader).osd e784 prepare_failure osd.1 10.200.63.133:6801/21178 from osd.0 10.200.63.132:6801/18444 is reporting failure:1
2013-02-17 06:04:52.579842 7fb315ca2700  0 log [DBG] : osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
2013-02-17 06:04:52.812722 7fb315ca2700  0 log [INF] : pgmap v2900131: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:04:55.026832 7fb315ca2700  0 log [INF] : pgmap v2900132: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:04:57.567460 7fb315ca2700  1 mon.b2@0(leader).osd e784 prepare_failure osd.1 10.200.63.133:6801/21178 from osd.0 10.200.63.132:6801/18444 is reporting failure:1
2013-02-17 06:04:57.567493 7fb315ca2700  0 log [DBG] : osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
2013-02-17 06:04:57.773210 7fb315ca2700  0 log [INF] : pgmap v2900133: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:00.043056 7fb315ca2700  0 log [INF] : pgmap v2900134: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:02.567921 7fb315ca2700  1 mon.b2@0(leader).osd e784 prepare_failure osd.1 10.200.63.133:6801/21178 from osd.0 10.200.63.132:6801/18444 is reporting failure:1
2013-02-17 06:05:02.567937 7fb315ca2700  0 log [DBG] : osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
2013-02-17 06:05:02.567974 7fb315ca2700  1 mon.b2@0(leader).osd e784  we have enough reports/reporters to mark osd.1 down
2013-02-17 06:05:02.567987 7fb315ca2700  0 log [INF] : osd.1 10.200.63.133:6801/21178 failed (3 reports from 1 peers after 2013-02-17 06:05:23.567928 >= grace 20.000021)
2013-02-17 06:05:02.772787 7fb315ca2700  1 mon.b2@0(leader).osd e785 e785: 2 osds: 1 up, 2 in
2013-02-17 06:05:02.787619 7fb315ca2700  0 log [INF] : osdmap e785: 2 osds: 1 up, 2 in
2013-02-17 06:05:02.891321 7fb315ca2700  0 log [INF] : pgmap v2900135: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:03.355205 7fb315ca2700  0 log [INF] : pgmap v2900136: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:03.847394 7fb315ca2700  1 mon.b2@0(leader).osd e786 e786: 2 osds: 1 up, 2 in
2013-02-17 06:05:03.884395 7fb315ca2700  0 log [INF] : osdmap e786: 2 osds: 1 up, 2 in
2013-02-17 06:05:04.057750 7fb315ca2700  0 log [INF] : pgmap v2900137: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:04.869787 7fb315ca2700  1 mon.b2@0(leader).pg v2900137  ignoring stats from non-active osd.
2013-02-17 06:05:05.884371 7fb315ca2700  1 mon.b2@0(leader).osd e787 e787: 2 osds: 2 up, 2 in
2013-02-17 06:05:05.921244 7fb315ca2700  0 log [INF] : osdmap e787: 2 osds: 2 up, 2 in
2013-02-17 06:05:05.921303 7fb315ca2700  0 log [INF] : osd.1 10.200.63.133:6801/21178 boot
2013-02-17 06:05:06.022350 7fb315ca2700  0 log [INF] : pgmap v2900138: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:06.946150 7fb315ca2700  1 mon.b2@0(leader).osd e788 e788: 2 osds: 2 up, 2 in
2013-02-17 06:05:06.983256 7fb315ca2700  0 log [INF] : osdmap e788: 2 osds: 2 up, 2 in
2013-02-17 06:05:07.103846 7fb315ca2700  0 log [INF] : pgmap v2900139: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:08.048069 7fb315ca2700  1 mon.b2@0(leader).osd e789 e789: 2 osds: 2 up, 2 in
2013-02-17 06:05:08.085140 7fb315ca2700  0 log [INF] : osdmap e789: 2 osds: 2 up, 2 in
2013-02-17 06:05:08.201692 7fb315ca2700  0 log [INF] : pgmap v2900140: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:12.100055 7fb315ca2700  0 log [INF] : pgmap v2900141: 576 pgs: 259 active, 271 active+clean, 45 peering, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:13.196685 7fb315ca2700  0 log [INF] : pgmap v2900142: 576 pgs: 467 active, 109 peering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:17.088677 7fb315ca2700  0 log [INF] : pgmap v2900143: 576 pgs: 479 active, 32 active+clean, 65 peering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:18.229204 7fb315ca2700  0 log [INF] : pgmap v2900144: 576 pgs: 469 active, 105 active+clean, 2 peering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:22.702400 7fb315ca2700  0 log [INF] : pgmap v2900145: 576 pgs: 198 active, 376 active+clean, 1 peering, 1 active+recovering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
2013-02-17 06:05:23.795142 7fb315ca2700  0 log [INF] : pgmap v2900146: 576 pgs: 574 active+clean, 1 active+recovery_wait, 1 active+recovering; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail; 1/211684 degraded (0.000%)
2013-02-17 06:05:27.689761 7fb315ca2700  0 log [INF] : pgmap v2900147: 576 pgs: 575 active+clean, 1 active+recovery_wait; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail; 1/211684 degraded (0.000%)
2013-02-17 06:05:28.797998 7fb315ca2700  0 log [INF] : pgmap v2900148: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-17 06:05:32.688713 7fb315ca2700  0 log [INF] : pgmap v2900149: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
2013-02-17 06:05:33.764083 7fb315ca2700  0 log [INF] : pgmap v2900150: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail

----------------------------------------------------------------------
ceph-mon.b4.log
----------------------------------------------------------------------
2013-02-17 06:05:01.197587 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197440 ==== paxos(auth lease lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3474466732 0 0) 0x68998c0 con 0x2d189a0
2013-02-17 06:05:01.197626 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(auth lease_ack lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c636000
2013-02-17 06:05:01.560527 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9965 10.200.63.132:0/2024602 550 ==== mon_subscribe({monmap=10+,osdmap=785}) v2 ==== 42+0+0 (601251667 0 0) 0x4fe9dc0 con 0x2ed8580
2013-02-17 06:05:01.560568 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9965 10.200.63.132:0/2024602 -- mon_subscribe_ack(300s) v1 -- ?+0 0xccd71e0
2013-02-17 06:05:02.130295 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197441 ==== paxos(osdmap lease lc 784 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2329396769 0 0) 0x1c636000 con 0x2d189a0
2013-02-17 06:05:02.130339 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 784 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x68998c0
2013-02-17 06:05:02.130384 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197442 ==== paxos(mdsmap lease lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3718949626 0 0) 0x6899600 con 0x2d189a0
2013-02-17 06:05:02.130406 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(mdsmap lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c636000
2013-02-17 06:05:02.130475 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197443 ==== paxos(monmap lease lc 9 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3086430848 0 0) 0x69698c0 con 0x2d189a0
2013-02-17 06:05:02.130488 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(monmap lease_ack lc 9 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x6899600
2013-02-17 06:05:02.684455 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197444 ==== paxos(osdmap begin lc 784 fc 0 pn 40700 opn 0 gv {785=5833342}) v2 ==== 245+0+0 (1216715648 0 0) 0x6899600 con 0x2d189a0
2013-02-17 06:05:02.750106 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap accept lc 784 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x69698c0
2013-02-17 06:05:02.750141 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197445 ==== paxos(logm begin lc 2928952 fc 0 pn 40700 opn 0 gv {2928953=5833343}) v2 ==== 370+0+0 (2110251190 0 0) 0x1c636000 con 0x2d189a0
2013-02-17 06:05:02.778810 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928952 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x6899600
2013-02-17 06:05:02.778848 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197446 ==== paxos(pgmap begin lc 2900134 fc 0 pn 40700 opn 0 gv {2900135=5833344}) v2 ==== 5451+0+0 (3356992033 0 0) 0x68998c0 con 0x2d189a0
2013-02-17 06:05:02.804185 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap accept lc 2900134 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x1c636000
2013-02-17 06:05:02.804242 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197447 ==== paxos(osdmap commit lc 785 fc 0 pn 40700 opn 0 gv {785=5833342}) v2 ==== 245+0+0 (2445046887 0 0) 0x699edc0 con 0x2d189a0
2013-02-17 06:05:02.853386 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197448 ==== paxos(osdmap lease lc 785 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3283911110 0 0) 0x69698c0 con 0x2d189a0
2013-02-17 06:05:02.853414 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 785 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x699edc0
2013-02-17 06:05:02.869698 7f2879ad7700  1 mon.b4@1(peon).osd e785 e785: 2 osds: 1 up, 2 in
2013-02-17 06:05:02.889915 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9962 10.200.63.132:0/1024602 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x1c671e00
2013-02-17 06:05:02.889945 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9793 10.200.63.133:0/1028590 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x6895200
2013-02-17 06:05:02.889966 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9998 10.200.63.132:0/1026778 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x6895400
2013-02-17 06:05:02.889985 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10007 10.200.63.132:0/1027293 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x6895600
2013-02-17 06:05:02.890011 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10202 10.200.63.132:0/1011962 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x3017000
2013-02-17 06:05:02.890031 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9980 10.200.63.132:0/1025294 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x6895a00
2013-02-17 06:05:02.890063 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9983 10.200.63.132:0/1025964 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x6895c00
2013-02-17 06:05:02.890104 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10001 10.200.63.132:0/2026778 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2f00400
2013-02-17 06:05:02.890142 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9814 10.200.63.132:0/2029392 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2f0ca00
2013-02-17 06:05:02.890161 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9811 10.200.63.132:0/1029392 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2f02000
2013-02-17 06:05:02.890198 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.8936 10.200.63.132:0/1024890 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x3f9ea00
2013-02-17 06:05:02.890235 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9965 10.200.63.132:0/2024602 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x6892400
2013-02-17 06:05:02.890262 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197449 ==== paxos(logm commit lc 2928953 fc 0 pn 40700 opn 0 gv {2928953=5833343}) v2 ==== 370+0+0 (96512701 0 0) 0x6899600 con 0x2d189a0
2013-02-17 06:05:03.057217 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197450 ==== paxos(logm lease lc 2928953 fc 2928452 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (303443369 0 0) 0x1c591080 con 0x2d189a0
2013-02-17 06:05:03.057244 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928953 fc 2928451 pn 0 opn 0 gv {}) v2 -- ?+0 0x6899600
2013-02-17 06:05:03.109714 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197451 ==== paxos(pgmap commit lc 2900135 fc 0 pn 40700 opn 0 gv {2900135=5833344}) v2 ==== 5451+0+0 (1375112371 0 0) 0x1c636000 con 0x2d189a0
2013-02-17 06:05:03.211911 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197452 ==== paxos(pgmap lease lc 2900135 fc 2899634 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2609033812 0 0) 0x1c5bc000 con 0x2d189a0
2013-02-17 06:05:03.211933 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900135 fc 2899633 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c636000
2013-02-17 06:05:03.267399 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197453 ==== paxos(pgmap begin lc 2900135 fc 0 pn 40700 opn 0 gv {2900136=5833345}) v2 ==== 106258+0+0 (1573278324 0 0) 0x699edc0 con 0x2d189a0
2013-02-17 06:05:03.316969 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap accept lc 2900135 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x1c5bc000
2013-02-17 06:05:03.317018 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197454 ==== paxos(pgmap commit lc 2900136 fc 0 pn 40700 opn 0 gv {2900136=5833345}) v2 ==== 106258+0+0 (2619992220 0 0) 0x6899600 con 0x2d189a0
2013-02-17 06:05:03.421783 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197455 ==== paxos(pgmap lease lc 2900136 fc 2899635 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (556628681 0 0) 0x1c5bc000 con 0x2d189a0
2013-02-17 06:05:03.421821 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900136 fc 2899634 pn 0 opn 0 gv {}) v2 -- ?+0 0x6899600
2013-02-17 06:05:03.468894 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9983 10.200.63.132:0/1025964 550 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0xdbb2700 con 0x2d19b80
2013-02-17 06:05:03.468930 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9983 10.200.63.132:0/1025964 -- mon_subscribe_ack(300s) v1 -- ?+0 0xccd7380
2013-02-17 06:05:03.468945 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9793 10.200.63.133:0/1028590 548 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0xdbb2540 con 0x2ed8420
2013-02-17 06:05:03.468956 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9793 10.200.63.133:0/1028590 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686c340
2013-02-17 06:05:03.715599 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9998 10.200.63.132:0/1026778 549 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x2fe7c00 con 0x2ed82c0
2013-02-17 06:05:03.715637 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9998 10.200.63.132:0/1026778 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686c4e0
2013-02-17 06:05:03.771214 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197456 ==== paxos(osdmap begin lc 785 fc 0 pn 40700 opn 0 gv {786=5833346}) v2 ==== 248+0+0 (2752781820 0 0) 0x6899600 con 0x2d189a0
2013-02-17 06:05:03.833918 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap accept lc 785 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x1c5bc000
2013-02-17 06:05:03.833969 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197457 ==== paxos(logm begin lc 2928953 fc 0 pn 40700 opn 0 gv {2928954=5833347}) v2 ==== 899+0+0 (2467051332 0 0) 0x1c636000 con 0x2d189a0
2013-02-17 06:05:03.866255 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928953 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x6899600
2013-02-17 06:05:03.866293 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197458 ==== paxos(osdmap commit lc 786 fc 0 pn 40700 opn 0 gv {786=5833346}) v2 ==== 248+0+0 (3711416935 0 0) 0x1c590000 con 0x2d189a0
2013-02-17 06:05:03.931450 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197459 ==== paxos(osdmap lease lc 786 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1686591545 0 0) 0x1c5bc000 con 0x2d189a0
2013-02-17 06:05:03.931493 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 786 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c590000
2013-02-17 06:05:03.959675 7f2879ad7700  1 mon.b4@1(peon).osd e786 e786: 2 osds: 1 up, 2 in
2013-02-17 06:05:03.990545 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9983 10.200.63.132:0/1025964 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6892600
2013-02-17 06:05:03.990586 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9793 10.200.63.133:0/1028590 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6892800
2013-02-17 06:05:03.990633 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9998 10.200.63.132:0/1026778 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6892a00
2013-02-17 06:05:03.990670 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197460 ==== paxos(pgmap begin lc 2900136 fc 0 pn 40700 opn 0 gv {2900137=5833348}) v2 ==== 162+0+0 (2049167727 0 0) 0x6899600 con 0x2d189a0
2013-02-17 06:05:04.032168 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap accept lc 2900136 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x1c5bc000
2013-02-17 06:05:04.032203 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197461 ==== paxos(logm commit lc 2928954 fc 0 pn 40700 opn 0 gv {2928954=5833347}) v2 ==== 899+0+0 (4069256124 0 0) 0x1c590840 con 0x2d189a0
2013-02-17 06:05:04.088965 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197462 ==== paxos(logm lease lc 2928954 fc 2928453 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2473265918 0 0) 0x1c590b00 con 0x2d189a0
2013-02-17 06:05:04.088996 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928954 fc 2928452 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c590840
2013-02-17 06:05:04.129567 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197463 ==== paxos(pgmap commit lc 2900137 fc 0 pn 40700 opn 0 gv {2900137=5833348}) v2 ==== 162+0+0 (2510848205 0 0) 0x1c590000 con 0x2d189a0
2013-02-17 06:05:04.210301 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197464 ==== paxos(pgmap lease lc 2900137 fc 2899636 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3384776277 0 0) 0x1c5bc000 con 0x2d189a0
2013-02-17 06:05:04.210368 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900137 fc 2899635 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c590000
2013-02-17 06:05:04.239384 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197465 ==== paxos(auth lease lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3037158960 0 0) 0x1c590840 con 0x2d189a0
2013-02-17 06:05:04.239405 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(auth lease_ack lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c5bc000
2013-02-17 06:05:04.239426 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.10007 10.200.63.132:0/1027293 549 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x30256c0 con 0x2d198c0
2013-02-17 06:05:04.239472 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10007 10.200.63.132:0/1027293 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6892c00
2013-02-17 06:05:04.239484 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10007 10.200.63.132:0/1027293 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686c680
2013-02-17 06:05:04.239494 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.10202 10.200.63.132:0/1011962 549 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x1b93afc0 con 0x2d19080
2013-02-17 06:05:04.239535 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10202 10.200.63.132:0/1011962 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x3283000
2013-02-17 06:05:04.239546 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10202 10.200.63.132:0/1011962 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686c820
2013-02-17 06:05:04.239554 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.10001 10.200.63.132:0/2026778 551 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x19a4c380 con 0x2d19a20
2013-02-17 06:05:04.239574 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10001 10.200.63.132:0/2026778 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6893000
2013-02-17 06:05:04.239584 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10001 10.200.63.132:0/2026778 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686c9c0
2013-02-17 06:05:04.840455 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9814 10.200.63.132:0/2029392 550 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0xdbb08c0 con 0x2d18dc0
2013-02-17 06:05:04.840544 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9814 10.200.63.132:0/2029392 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6893200
2013-02-17 06:05:04.840564 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9814 10.200.63.132:0/2029392 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686cb60
2013-02-17 06:05:04.882186 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9811 10.200.63.132:0/1029392 550 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0xc19e8c0 con 0x2d18f20
2013-02-17 06:05:04.882245 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9811 10.200.63.132:0/1029392 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6893400
2013-02-17 06:05:04.882265 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9811 10.200.63.132:0/1029392 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686cd00
2013-02-17 06:05:04.920980 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197466 ==== paxos(logm begin lc 2928954 fc 0 pn 40700 opn 0 gv {2928955=5833349}) v2 ==== 636+0+0 (3516914051 0 0) 0x1c5bc000 con 0x2d189a0
2013-02-17 06:05:04.986213 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928954 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x1c590840
2013-02-17 06:05:04.986265 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197467 ==== paxos(logm commit lc 2928955 fc 0 pn 40700 opn 0 gv {2928955=5833349}) v2 ==== 636+0+0 (2318971175 0 0) 0x1c590000 con 0x2d189a0
2013-02-17 06:05:05.059002 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197468 ==== paxos(logm lease lc 2928955 fc 2928454 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (90852951 0 0) 0x1c590840 con 0x2d189a0
2013-02-17 06:05:05.059028 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928955 fc 2928453 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c590000
2013-02-17 06:05:05.130136 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197469 ==== paxos(mdsmap lease lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (439731423 0 0) 0x1c590000 con 0x2d189a0
2013-02-17 06:05:05.130163 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(mdsmap lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c590840
2013-02-17 06:05:05.130218 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197470 ==== paxos(monmap lease lc 9 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (286498847 0 0) 0x1c591b80 con 0x2d189a0
2013-02-17 06:05:05.130234 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(monmap lease_ack lc 9 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c590000
2013-02-17 06:05:05.159964 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.8936 10.200.63.132:0/1024890 550 ==== mon_subscribe({monmap=10+}) v2 ==== 23+0+0 (897212988 0 0) 0x19cb9a40 con 0x2d18b00
2013-02-17 06:05:05.159994 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.8936 10.200.63.132:0/1024890 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686cea0
2013-02-17 06:05:05.301727 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.8936 10.200.63.132:0/1024890 551 ==== mon_subscribe({monmap=10+,osdmap=787}) v2 ==== 42+0+0 (3460793650 0 0) 0x30d1340 con 0x2d18b00
2013-02-17 06:05:05.301785 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.8936 10.200.63.132:0/1024890 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686d040
2013-02-17 06:05:05.534281 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9965 10.200.63.132:0/2024602 551 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x72d0e00 con 0x2ed8580
2013-02-17 06:05:05.534371 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9965 10.200.63.132:0/2024602 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6893600
2013-02-17 06:05:05.534392 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9965 10.200.63.132:0/2024602 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686d1e0
2013-02-17 06:05:05.812437 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197471 ==== paxos(osdmap begin lc 786 fc 0 pn 40700 opn 0 gv {787=5833350}) v2 ==== 698+0+0 (1439899322 0 0) 0x1c590840 con 0x2d189a0
2013-02-17 06:05:05.870299 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap accept lc 786 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x1c591b80
2013-02-17 06:05:05.870363 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197472 ==== paxos(osdmap commit lc 787 fc 0 pn 40700 opn 0 gv {787=5833350}) v2 ==== 698+0+0 (1125289174 0 0) 0x1c5a98c0 con 0x2d189a0
2013-02-17 06:05:05.939275 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197473 ==== paxos(osdmap lease lc 787 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (885348033 0 0) 0x1c591b80 con 0x2d189a0
2013-02-17 06:05:05.939307 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 787 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c5a98c0
2013-02-17 06:05:05.951734 7f2879ad7700  1 mon.b4@1(peon).osd e787 e787: 2 osds: 2 up, 2 in

----------------------------------------------------------------------
ceph-mon.b5.log
----------------------------------------------------------------------
2013-02-17 06:05:00.003187 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217824 ==== paxos(pgmap commit lc 2900134 fc 0 pn 40700 opn 0 gv {2900134=5833340}) v2 ==== 4055+0+0 (2464340035 0 0) 0x27a70b00 con 0x268a9a0
2013-02-17 06:05:00.051665 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217825 ==== paxos(pgmap lease lc 2900134 fc 2899633 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2739579357 0 0) 0x3950580 con 0x268a9a0
2013-02-17 06:05:00.051693 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900134 fc 2899632 pn 0 opn 0 gv {}) v2 -- ?+0 0x27a70b00
2013-02-17 06:05:00.088234 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217826 ==== route(pg_stats_ack(11 pgs tid 15412) v1 tid 31190) v2 ==== 555+0+0 (4193262135 0 0) 0x27a62480 con 0x268a9a0
2013-02-17 06:05:00.088261 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.1 10.200.63.133:6801/21178 -- pg_stats_ack(11 pgs tid 15412) v1 -- ?+0 0x27bf4540
2013-02-17 06:05:00.124146 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217827 ==== paxos(logm begin lc 2928951 fc 0 pn 40700 opn 0 gv {2928952=5833341}) v2 ==== 408+0+0 (2121299380 0 0) 0x27a70b00 con 0x268a9a0
2013-02-17 06:05:00.152577 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928951 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x3950580
2013-02-17 06:05:00.164381 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217828 ==== paxos(logm commit lc 2928952 fc 0 pn 40700 opn 0 gv {2928952=5833341}) v2 ==== 408+0+0 (2048871512 0 0) 0x3950580 con 0x268a9a0
2013-02-17 06:05:00.221408 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217829 ==== paxos(logm lease lc 2928952 fc 2928451 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3800722919 0 0) 0x39b22c0 con 0x268a9a0
2013-02-17 06:05:00.221436 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928952 fc 2928450 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
2013-02-17 06:05:00.465887 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9971 10.200.63.132:0/1024854 548 ==== mon_subscribe({monmap=10+,osdmap=785}) v2 ==== 42+0+0 (601251667 0 0) 0x27a556c0 con 0x268b1e0
2013-02-17 06:05:00.465937 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9971 10.200.63.132:0/1024854 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed91e0
2013-02-17 06:05:00.543992 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9995 10.200.63.132:0/2026519 548 ==== mon_subscribe({monmap=10+,osdmap=785}) v2 ==== 42+0+0 (601251667 0 0) 0x99361c0 con 0x268b760
2013-02-17 06:05:00.544025 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9995 10.200.63.132:0/2026519 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed9380
2013-02-17 06:05:00.546004 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9992 10.200.63.132:0/1026519 548 ==== mon_subscribe({monmap=10+,osdmap=785}) v2 ==== 42+0+0 (601251667 0 0) 0x2745f340 con 0x268ba20
2013-02-17 06:05:00.546038 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9992 10.200.63.132:0/1026519 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed9520
2013-02-17 06:05:01.197554 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217830 ==== paxos(auth lease lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (364209048 0 0) 0x3950580 con 0x268a9a0
2013-02-17 06:05:01.197590 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(auth lease_ack lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 -- ?+0 0x39b22c0
2013-02-17 06:05:01.348870 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9953 10.200.63.133:0/1028024 549 ==== mon_subscribe({monmap=10+,osdmap=785}) v2 ==== 42+0+0 (601251667 0 0) 0x246a81c0 con 0x268ac60
2013-02-17 06:05:01.348913 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9953 10.200.63.133:0/1028024 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed96c0
2013-02-17 06:05:01.445022 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9977 10.200.63.132:0/1025185 549 ==== mon_subscribe({monmap=10+,osdmap=785}) v2 ==== 42+0+0 (601251667 0 0) 0x279848c0 con 0x268b080
2013-02-17 06:05:01.445054 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9977 10.200.63.132:0/1025185 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed9860
2013-02-17 06:05:02.129960 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217831 ==== paxos(osdmap lease lc 784 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2306880830 0 0) 0x39b22c0 con 0x268a9a0
2013-02-17 06:05:02.130015 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 784 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
2013-02-17 06:05:02.130058 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217832 ==== paxos(mdsmap lease lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1492732137 0 0) 0x3960dc0 con 0x268a9a0
2013-02-17 06:05:02.130071 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(mdsmap lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x39b22c0
2013-02-17 06:05:02.130177 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217833 ==== paxos(monmap lease lc 9 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (445342582 0 0) 0x39b2000 con 0x268a9a0
2013-02-17 06:05:02.130201 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(monmap lease_ack lc 9 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x3960dc0
2013-02-17 06:05:02.567361 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.0 10.200.63.132:6801/18444 16103 ==== osd_failure(failed osd.1 10.200.63.133:6801/21178 for 31sec e784 v784) v3 ==== 188+0+0 (2900250966 0 0) 0x2f69180 con 0x2ee6420
2013-02-17 06:05:02.567427 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- forward(osd_failure(failed osd.1 10.200.63.133:6801/21178 for 31sec e784 v784) v3) to leader v1 -- ?+0 0x39b2000
2013-02-17 06:05:02.568250 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217834 ==== route(no-reply tid 31188) v2 ==== 154+0+0 (247117294 0 0) 0x276ca000 con 0x268a9a0
2013-02-17 06:05:02.568572 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.0 10.200.63.132:6801/18444 16104 ==== pg_stats(15 pgs tid 15379 v 784) v1 ==== 5695+0+0 (862073156 0 0) 0x27b2e480 con 0x2ee6420
2013-02-17 06:05:02.568630 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- forward(pg_stats(15 pgs tid 15379 v 784) v1) to leader v1 -- ?+0 0x27a70b00
2013-02-17 06:05:02.683997 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217835 ==== paxos(osdmap begin lc 784 fc 0 pn 40700 opn 0 gv {785=5833342}) v2 ==== 245+0+0 (3059524207 0 0) 0x27a70b00 con 0x268a9a0
2013-02-17 06:05:02.733666 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap accept lc 784 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x381f8c0
2013-02-17 06:05:02.733705 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217836 ==== paxos(logm begin lc 2928952 fc 0 pn 40700 opn 0 gv {2928953=5833343}) v2 ==== 370+0+0 (1453139115 0 0) 0x39b2000 con 0x268a9a0
2013-02-17 06:05:02.758039 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928952 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x27a70b00
2013-02-17 06:05:02.758071 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217837 ==== paxos(pgmap begin lc 2900134 fc 0 pn 40700 opn 0 gv {2900135=5833344}) v2 ==== 5451+0+0 (2376171371 0 0) 0x3960dc0 con 0x268a9a0
2013-02-17 06:05:02.789264 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap accept lc 2900134 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x39b2000
2013-02-17 06:05:02.789297 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217838 ==== paxos(osdmap commit lc 785 fc 0 pn 40700 opn 0 gv {785=5833342}) v2 ==== 245+0+0 (2456140731 0 0) 0x381f8c0 con 0x268a9a0
2013-02-17 06:05:02.839358 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217839 ==== paxos(osdmap lease lc 785 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (655530447 0 0) 0x39b22c0 con 0x268a9a0
2013-02-17 06:05:02.839385 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 785 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x381f8c0
2013-02-17 06:05:02.859571 7faaec18d700  1 mon.b5@2(peon).osd e785 e785: 2 osds: 1 up, 2 in
2013-02-17 06:05:02.871798 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9995 10.200.63.132:0/2026519 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2ad3000
2013-02-17 06:05:02.871864 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9971 10.200.63.132:0/1024854 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2afae00
2013-02-17 06:05:02.871939 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9992 10.200.63.132:0/1026519 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2ba7200
2013-02-17 06:05:02.871977 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9977 10.200.63.132:0/1025185 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2b63400
2013-02-17 06:05:02.871998 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9953 10.200.63.133:0/1028024 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2ba6600
2013-02-17 06:05:02.872016 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9986 10.200.63.132:0/1026221 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2a9ae00
2013-02-17 06:05:02.872113 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9989 10.200.63.132:0/2026221 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2bce200
2013-02-17 06:05:02.872135 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9968 10.200.63.132:0/1024758 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x285e800
2013-02-17 06:05:02.872197 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9974 10.200.63.132:0/1024950 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2ba7c00
2013-02-17 06:05:02.872250 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9956 10.200.63.132:0/1024424 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2a4f000
2013-02-17 06:05:02.872272 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9959 10.200.63.132:0/2024424 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2b21a00
2013-02-17 06:05:02.872294 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- osd_map(784..785 src has 541..785) v3 -- ?+0 0x2b64c00
2013-02-17 06:05:02.872311 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217840 ==== route(osd_map(784..785 src has 541..785) v3 tid 31191) v2 ==== 549+0+0 (1208657806 0 0) 0xd2c6fc0 con 0x268a9a0
2013-02-17 06:05:02.872323 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- osd_map(784..785 src has 541..785) v3 -- ?+0 0x2c89c00
2013-02-17 06:05:02.872339 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217841 ==== paxos(logm commit lc 2928953 fc 0 pn 40700 opn 0 gv {2928953=5833343}) v2 ==== 370+0+0 (1947946272 0 0) 0x39b2000 con 0x268a9a0
2013-02-17 06:05:02.931840 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217842 ==== paxos(logm lease lc 2928953 fc 2928452 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2254758468 0 0) 0x27a70b00 con 0x268a9a0
2013-02-17 06:05:02.931867 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928953 fc 2928451 pn 0 opn 0 gv {}) v2 -- ?+0 0x39b2000
2013-02-17 06:05:02.953651 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217843 ==== paxos(pgmap commit lc 2900135 fc 0 pn 40700 opn 0 gv {2900135=5833344}) v2 ==== 5451+0+0 (3278650520 0 0) 0x3950580 con 0x268a9a0
2013-02-17 06:05:03.013296 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217844 ==== paxos(pgmap lease lc 2900135 fc 2899634 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (4264509609 0 0) 0x381fb80 con 0x268a9a0
2013-02-17 06:05:03.013323 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900135 fc 2899633 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
2013-02-17 06:05:03.043166 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217845 ==== paxos(pgmap begin lc 2900135 fc 0 pn 40700 opn 0 gv {2900136=5833345}) v2 ==== 106258+0+0 (3288368916 0 0) 0x381f8c0 con 0x268a9a0
2013-02-17 06:05:03.071470 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap accept lc 2900135 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x381fb80
2013-02-17 06:05:03.071503 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217846 ==== route(pg_stats_ack(15 pgs tid 15379) v1 tid 31192) v2 ==== 671+0+0 (3596506077 0 0) 0xcecf680 con 0x268a9a0
2013-02-17 06:05:03.071517 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- pg_stats_ack(15 pgs tid 15379) v1 -- ?+0 0x26b41c0
2013-02-17 06:05:03.071536 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.0 10.200.63.132:6801/18444 16105 ==== mon_subscribe({monmap=10+,osd_pg_creates=0,osdmap=785}) v2 ==== 69+0+0 (1495392616 0 0) 0x27a5c000 con 0x2ee6420
2013-02-17 06:05:03.071578 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2a9bc00
2013-02-17 06:05:03.071590 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed9a00
2013-02-17 06:05:03.071640 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.0 10.200.63.132:6801/18444 16106 ==== osd_alive(want up_thru 785 have 785) v1 ==== 22+0+0 (1697281490 0 0) 0x26675180 con 0x2ee6420
2013-02-17 06:05:03.071678 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- forward(osd_alive(want up_thru 785 have 785) v1) to leader v1 -- ?+0 0x381edc0
2013-02-17 06:05:03.086178 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217847 ==== paxos(pgmap commit lc 2900136 fc 0 pn 40700 opn 0 gv {2900136=5833345}) v2 ==== 106258+0+0 (3946206785 0 0) 0x381edc0 con 0x268a9a0
2013-02-17 06:05:03.317782 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217848 ==== paxos(pgmap lease lc 2900136 fc 2899635 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3766713448 0 0) 0x381fb80 con 0x268a9a0
2013-02-17 06:05:03.317811 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900136 fc 2899634 pn 0 opn 0 gv {}) v2 -- ?+0 0x381edc0
2013-02-17 06:05:03.583718 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9977 10.200.63.132:0/1025185 550 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0xf313a40 con 0x268b080
2013-02-17 06:05:03.583753 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9977 10.200.63.132:0/1025185 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed9d40
2013-02-17 06:05:03.765409 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9986 10.200.63.132:0/1026221 548 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x254b4700 con 0x268b340
2013-02-17 06:05:03.765443 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9986 10.200.63.132:0/1026221 -- mon_subscribe_ack(300s) v1 -- ?+0 0xc11e4e0
2013-02-17 06:05:03.771262 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217849 ==== paxos(osdmap begin lc 785 fc 0 pn 40700 opn 0 gv {786=5833346}) v2 ==== 248+0+0 (1123691466 0 0) 0x381edc0 con 0x268a9a0
2013-02-17 06:05:03.800417 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap accept lc 785 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x381fb80
2013-02-17 06:05:03.811504 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217850 ==== paxos(logm begin lc 2928953 fc 0 pn 40700 opn 0 gv {2928954=5833347}) v2 ==== 899+0+0 (976230293 0 0) 0x381fb80 con 0x268a9a0
2013-02-17 06:05:03.836730 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928953 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x381edc0
2013-02-17 06:05:03.836762 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217851 ==== paxos(osdmap commit lc 786 fc 0 pn 40700 opn 0 gv {786=5833346}) v2 ==== 248+0+0 (1548565610 0 0) 0x3950580 con 0x268a9a0
2013-02-17 06:05:03.885558 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217852 ==== paxos(osdmap lease lc 786 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (4021780416 0 0) 0x39b2000 con 0x268a9a0
2013-02-17 06:05:03.885585 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 786 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
2013-02-17 06:05:03.897747 7faaec18d700  1 mon.b5@2(peon).osd e786 e786: 2 osds: 1 up, 2 in
2013-02-17 06:05:03.910012 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9977 10.200.63.132:0/1025185 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x2ba6a00
2013-02-17 06:05:03.910042 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9986 10.200.63.132:0/1026221 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x2a9aa00
2013-02-17 06:05:03.910112 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- osd_map(785..786 src has 541..786) v3 -- ?+0 0x2d00800
2013-02-17 06:05:03.910135 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217853 ==== paxos(pgmap begin lc 2900136 fc 0 pn 40700 opn 0 gv {2900137=5833348}) v2 ==== 162+0+0 (3432763774 0 0) 0x381edc0 con 0x268a9a0
2013-02-17 06:05:03.934366 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap accept lc 2900136 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x39b2000
2013-02-17 06:05:03.934397 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217854 ==== route(osd_map(785..786 src has 541..786) v3 tid 31193) v2 ==== 541+0+0 (2399459519 0 0) 0x27b72fc0 con 0x268a9a0
2013-02-17 06:05:03.934411 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- osd_map(785..786 src has 541..786) v3 -- ?+0 0x285f600
2013-02-17 06:05:03.934428 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217855 ==== paxos(logm commit lc 2928954 fc 0 pn 40700 opn 0 gv {2928954=5833347}) v2 ==== 899+0+0 (2190316667 0 0) 0x3950580 con 0x268a9a0
2013-02-17 06:05:03.986328 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217856 ==== paxos(logm lease lc 2928954 fc 2928453 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1074855049 0 0) 0x387f8c0 con 0x268a9a0
2013-02-17 06:05:03.986355 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928954 fc 2928452 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
2013-02-17 06:05:04.008094 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217857 ==== paxos(pgmap commit lc 2900137 fc 0 pn 40700 opn 0 gv {2900137=5833348}) v2 ==== 162+0+0 (3385534158 0 0) 0x39b2000 con 0x268a9a0
2013-02-17 06:05:04.061016 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217858 ==== paxos(pgmap lease lc 2900137 fc 2899636 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1623274848 0 0) 0x3950580 con 0x268a9a0
2013-02-17 06:05:04.061043 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900137 fc 2899635 pn 0 opn 0 gv {}) v2 -- ?+0 0x39b2000
2013-02-17 06:05:04.089621 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9974 10.200.63.132:0/1024950 550 ==== mon_subscribe({monmap=10+}) v2 ==== 23+0+0 (897212988 0 0) 0x99376c0 con 0x268b8c0
2013-02-17 06:05:04.089641 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9974 10.200.63.132:0/1024950 -- mon_subscribe_ack(300s) v1 -- ?+0 0xc11e9c0
2013-02-17 06:05:04.161576 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9953 10.200.63.133:0/1028024 550 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x270f3500 con 0x268ac60
2013-02-17 06:05:04.161635 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9953 10.200.63.133:0/1028024 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x290e400
2013-02-17 06:05:04.161654 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9953 10.200.63.133:0/1028024 -- mon_subscribe_ack(300s) v1 -- ?+0 0xc11e340
2013-02-17 06:05:04.197772 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217859 ==== paxos(auth lease lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (335508935 0 0) 0x39b2000 con 0x268a9a0
2013-02-17 06:05:04.197801 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(auth lease_ack lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
2013-02-17 06:05:04.198129 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9989 10.200.63.132:0/2026221 549 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x9937340 con 0x2ee66e0
2013-02-17 06:05:04.198188 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9989 10.200.63.132:0/2026221 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x2a4e000
2013-02-17 06:05:04.198208 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9989 10.200.63.132:0/2026221 -- mon_subscribe_ack(300s) v1 -- ?+0 0xc11f1e0
2013-02-17 06:05:04.868960 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.1 10.200.63.133:6801/21178 16167 ==== pg_stats(11 pgs tid 15413 v 784) v1 ==== 4203+0+0 (1978113006 0 0) 0xd460fc0 con 0x268b4a0
2013-02-17 06:05:04.869046 7faaec18d700  1 -- 10.200.63.133:6789/0 --> 10.200.63.133:6801/21178 -- osd_map(785..786 src has 541..786) v3 -- ?+0 0x2b20600 con 0x268b4a0
2013-02-17 06:05:04.869070 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- forward(pg_stats(11 pgs tid 15413 v 784) v1) to leader v1 -- ?+0 0x39b2000
2013-02-17 06:05:04.870167 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217860 ==== route(osd_map(785..786 src has 541..786) v3 tid 31194) v2 ==== 541+0+0 (4109387781 0 0) 0xd2946c0 con 0x268a9a0
2013-02-17 06:05:04.870195 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.1 10.200.63.133:6801/21178 -- osd_map(785..786 src has 541..786) v3 -- ?+0 0x2ad2600
2013-02-17 06:05:04.920875 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217861 ==== paxos(logm begin lc 2928954 fc 0 pn 40700 opn 0 gv {2928955=5833349}) v2 ==== 636+0+0 (2711767532 0 0) 0x39b2000 con 0x268a9a0
2013-02-17 06:05:04.964213 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928954 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x387f8c0
2013-02-17 06:05:04.977102 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217862 ==== paxos(logm commit lc 2928955 fc 0 pn 40700 opn 0 gv {2928955=5833349}) v2 ==== 636+0+0 (4237095061 0 0) 0x387f8c0 con 0x268a9a0
2013-02-17 06:05:05.039837 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217863 ==== paxos(logm lease lc 2928955 fc 2928454 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3942935192 0 0) 0x3950580 con 0x268a9a0
2013-02-17 06:05:05.039864 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928955 fc 2928453 pn 0 opn 0 gv {}) v2 -- ?+0 0x387f8c0
2013-02-17 06:05:05.129923 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217864 ==== paxos(mdsmap lease lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2330587895 0 0) 0x387f8c0 con 0x268a9a0
2013-02-17 06:05:05.129952 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(mdsmap lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
2013-02-17 06:05:05.130245 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217865 ==== paxos(monmap lease lc 9 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1303228847 0 0) 0x3950840 con 0x268a9a0
2013-02-17 06:05:05.130263 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(monmap lease_ack lc 9 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x387f8c0
2013-02-17 06:05:05.380504 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9968 10.200.63.132:0/1024758 549 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x27a0c380 con 0x268b600
2013-02-17 06:05:05.380583 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9968 10.200.63.132:0/1024758 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x2c89a00
2013-02-17 06:05:05.380603 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9968 10.200.63.132:0/1024758 -- mon_subscribe_ack(300s) v1 -- ?+0 0xc11fd40
2013-02-17 06:05:05.726864 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.1 10.200.63.133:6801/21178 16168 ==== mon_get_version(what=osdmap handle=2) v1 ==== 18+0+0 (3896555503 0 0) 0xd3e5860 con 0x268b4a0
2013-02-17 06:05:05.726889 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.1 10.200.63.133:6801/21178 -- mon_check_map_ack(handle=2 version=786) v2 -- ?+0 0x4aa4820
2013-02-17 06:05:05.727547 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.1 10.200.63.133:6801/21178 16169 ==== osd_boot(osd.1 booted 780 v786) v3 ==== 581+0+0 (2497288980 0 0) 0x8340000 con 0x268b4a0
2013-02-17 06:05:05.727606 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- forward(osd_boot(osd.1 booted 780 v786) v3) to leader v1 -- ?+0 0x3950840
2013-02-17 06:05:05.812260 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217866 ==== paxos(osdmap begin lc 786 fc 0 pn 40700 opn 0 gv {787=5833350}) v2 ==== 698+0+0 (4074253838 0 0) 0x387f8c0 con 0x268a9a0
2013-02-17 06:05:05.841338 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap accept lc 786 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x39b2000
2013-02-17 06:05:05.856277 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217867 ==== paxos(osdmap commit lc 787 fc 0 pn 40700 opn 0 gv {787=5833350}) v2 ==== 698+0+0 (62741733 0 0) 0x3950580 con 0x268a9a0
2013-02-17 06:05:05.905956 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217868 ==== paxos(osdmap lease lc 787 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (719158437 0 0) 0x27a74b00 con 0x268a9a0
2013-02-17 06:05:05.905976 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 787 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
2013-02-17 06:05:05.918189 7faaec18d700  1 mon.b5@2(peon).osd e787 e787: 2 osds: 2 up, 2 in

----------------------------------------------------------------------
iostat
----------------------------------------------------------------------

b4:osd.0            load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
2013-02-17-06:04:04  0.4   1.2   0.0   1.4   0.6   0.0  94.1  sdn    0.0    0.0    9.7   13.7    49.33   126.70  15.00   0.37  15.88   3.77  24.5  10.5
2013-02-17-06:04:19  0.4   1.4   0.0   1.3   0.6   0.0  93.8  sdn    0.0    0.1    2.9   18.2    12.53   171.73  17.49   0.44  20.92   6.98  23.1  11.3
2013-02-17-06:04:34  0.6   1.8   0.0   1.6   0.5   0.0  93.7  sdn    0.0    0.0    2.7   20.3    11.47   169.67  15.80   0.56  14.74   4.00  16.2  11.5
2013-02-17-06:04:49  0.5   1.4   0.0   1.3   0.5   0.0  93.9  sdn    0.0    0.1    2.9   15.0    12.27   157.50  18.93   0.25  26.17   5.45  30.2   7.9
2013-02-17-06:05:04  0.7   2.0   0.0   1.1   1.2   0.0  92.5  sdn    0.0    0.1    7.7   62.7    32.53   498.20  15.09   1.44  20.36  22.43  20.1  22.6
2013-02-17-06:05:19  1.4   1.0   0.0   0.5   8.8   0.0  88.8  sdn    0.0    0.1    2.7  209.8    20.53  3959.80  37.47  89.88 414.63 159.00 417.9  99.9
2013-02-17-06:05:34  1.4   2.0   0.0   1.0   1.3   0.0  93.2  sdn    0.0    1.7    4.8   54.5   544.53   462.63  33.99   4.40 104.45  15.00 112.3  33.8
2013-02-17-06:05:49  1.2   2.2   0.0   1.3   0.6   0.0  93.3  sdn    0.0    0.6    0.0   18.5     0.00   175.90  18.98   0.49  26.44   0.00  26.4  10.1
2013-02-17-06:06:04  1.0   1.2   0.0   1.0   1.1   0.0  93.7  sdn    0.0    1.3    0.6   31.4     2.40   225.17  14.22   1.49  46.48  48.89  46.4  21.9
2013-02-17-06:06:19  0.8   4.0   0.0   1.3   0.7   0.0  90.7  sdn    0.0    0.7    2.1   21.5    16.80   202.60  18.65   0.45  18.95  16.45  19.2  13.1
2013-02-17-06:06:34  1.2   4.4   0.0   1.3   0.9   0.0  90.1  sdn    0.0    0.3    1.5   18.7    45.33   161.67  20.56   0.45  22.45  48.18  20.4  11.1
2013-02-17-06:06:49  1.2   3.4   0.0   1.5   1.0   0.0  90.5  sdn    0.0    0.5    0.7   26.7     3.47   246.93  18.32   1.38  50.61  74.00  50.0  21.0
2013-02-17-06:07:04  1.3   4.1   0.0   1.4   0.7   0.0  90.3  sdn    0.0    0.2    0.9   13.2     4.27   128.00  18.81   0.42  30.19  18.46  31.0  10.9

b4:osd.0-journal    load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
2013-02-17-06:04:04  0.4   1.2   0.0   1.4   0.6   0.0  94.1  md2    0.0    0.0    0.0    9.3     0.00    96.53  20.69   0.00   0.00   0.00   0.0   0.0
2013-02-17-06:04:19  0.4   1.3   0.0   1.3   0.6   0.0  93.8  md2    0.0    0.0    0.0    7.3     0.00   113.87  31.05   0.00   0.00   0.00   0.0   0.0
2013-02-17-06:04:34  0.6   1.8   0.0   1.6   0.5   0.0  93.7  md2    0.0    0.0    0.0    9.7     0.00   126.67  26.03   0.00   0.00   0.00   0.0   0.0
2013-02-17-06:04:49  0.5   1.4   0.0   1.2   0.5   0.0  93.9  md2    0.0    0.0    0.0    8.7     0.00   125.33  28.92   0.00   0.00   0.00   0.0   0.0
2013-02-17-06:05:04  0.7   2.0   0.0   1.1   1.2   0.0  92.4  md2    0.0    0.0    0.0   19.1     0.00   573.07  59.90   0.00   0.00   0.00   0.0   0.0
2013-02-17-06:05:19  1.4   1.0   0.0   0.5   8.7   0.0  88.8  md2    0.0    0.0    0.0   34.7     0.00  3702.13 213.58   0.00   0.00   0.00   0.0   0.0
2013-02-17-06:05:34  1.4   2.0   0.0   1.0   1.4   0.0  93.1  md2    0.0    0.0    0.0   18.5     0.00   297.07  32.06   0.00   0.00   0.00   0.0   0.0
2013-02-17-06:05:49  1.2   2.1   0.0   1.4   0.6   0.0  93.3  md2    0.0    0.0    0.0   10.5     0.00   124.27  23.59   0.00   0.00   0.00   0.0   0.0
2013-02-17-06:06:04  1.0   1.2   0.0   0.9   1.1   0.0  93.8  md2    0.0    0.0    0.0   15.3     0.00   187.47  24.45   0.00   0.00   0.00   0.0   0.0
2013-02-17-06:06:19  0.8   4.0   0.0   1.3   0.7   0.0  90.8  md2    0.0    0.0    0.0    9.2     0.00   108.80  23.65   0.00   0.00   0.00   0.0   0.0
2013-02-17-06:06:34  1.2   4.4   0.0   1.3   0.9   0.0  90.1  md2    0.0    0.0    0.0   16.0     0.00   170.40  21.30   0.00   0.00   0.00   0.0   0.0
2013-02-17-06:06:49  1.2   3.4   0.0   1.5   1.0   0.0  90.4  md2    0.0    0.0    0.0   11.9     0.00   141.87  23.91   0.00   0.00   0.00   0.0   0.0
2013-02-17-06:07:04  1.3   4.1   0.0   1.4   0.7   0.0  90.4  md2    0.0    0.0    0.0    9.3     0.00   108.53  23.26   0.00   0.00   0.00   0.0   0.0

b5:osd.1            load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
2013-02-17-06:04:05  0.7   0.1   0.0   0.1   0.7   0.0  98.2  sdg    0.0    0.0    9.4   19.1    48.00   166.17  15.01   1.17  40.89  20.92  50.7  27.2
2013-02-17-06:04:20  0.6   0.1   0.0   0.1   0.6   0.0  98.2  sdg    0.0    0.0    2.7   15.8    10.67   147.00  17.08   0.68  36.64   9.50  41.2  13.5
2013-02-17-06:04:35  0.5   0.2   0.0   0.1   0.7   0.0  98.0  sdg    0.0    0.0    3.2   22.5    13.07   204.63  16.92   1.10  42.77  30.62  44.5  23.1
2013-02-17-06:04:50  0.4   0.1   0.0   0.1   0.4   0.0  98.3  sdg    0.0    0.0    2.7   13.6    10.67   142.43  18.82   0.59  36.39  22.50  39.1  13.1
2013-02-17-06:05:05  0.3   0.1   0.0   0.1   1.1   0.0  97.7  sdg    0.0    0.0    3.0   20.7    11.80   190.57  17.05   0.96  40.48  23.56  42.9  20.1
2013-02-17-06:05:20  1.4   0.7   0.0   0.6   6.2   0.0  91.9  sdg    0.0    0.0    7.1  233.9    30.17  2871.43  24.08  91.60 342.67 132.99 349.1  99.5
2013-02-17-06:05:35  1.1   0.2   0.0   0.1   2.3   0.0  96.8  sdg    0.0    1.5    1.7   73.5    18.93  1111.63  30.09   9.59 247.66  76.00 251.6  49.6
2013-02-17-06:05:50  1.0   0.1   0.0   0.1   0.6   0.0  98.8  sdg    0.0    0.1    0.0   14.6     0.00   144.87  19.84   0.52  35.62   0.00  35.6   9.9
2013-02-17-06:06:05  0.8   0.2   0.0   0.1   1.5   0.0  97.7  sdg    0.0    1.1    2.1   37.9     8.27   297.93  15.31   1.97  49.17  64.84  48.3  30.1
2013-02-17-06:06:20  0.8   0.4   0.0   0.2   0.7   0.0  97.9  sdg    0.0    0.1   11.1   15.1   123.73   137.47  19.89   0.65  24.62  17.07  30.2  22.3
2013-02-17-06:06:35  0.7   0.4   0.0   0.2   1.5   0.0  97.1  sdg    0.0    0.1    9.4   23.7   150.13   223.33  22.54   1.27  38.31  19.79  45.6  26.4
2013-02-17-06:06:50  0.8   0.4   0.0   0.2   1.2   0.0  97.4  sdg    0.0    0.2    8.5   22.5   138.40   188.87  21.11   1.71  55.01  28.35  65.0  30.4
2013-02-17-06:07:05  0.8   0.5   0.0   0.3   1.2   0.0  97.2  sdg    0.0    0.0    7.2   18.4    93.07   185.93  21.80   1.13  44.30  26.02  51.4  30.0

[b5:osd.1-journal not recorded]

----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-17 23:41     ` Chris Dunlop
@ 2013-02-18  1:44       ` Sage Weil
  2013-02-19  3:02         ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Sage Weil @ 2013-02-18  1:44 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

On Mon, 18 Feb 2013, Chris Dunlop wrote:
> G'day Sage,
> 
> On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
> > On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
> >> On Fri, 15 Feb 2013, Chris Dunlop wrote:
> >>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
> >>> mons to lose touch with the osds?
> >> 
> >> Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
> >> hopes that this happens again?  It will give us more information to go on.
> > 
> > Debug turned on.
> 
> We haven't experienced the cluster losing touch with the osds completely
> since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
> for a few seconds before it recovered. See below for logs (reminder: 3
> boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).
> 
> The osd.1 drop was associated with a bit of an write iops spike on the osd
> disks (logs below, "w/s" column), although the logs also show plenty of
> other similar spikes that haven't led to a drop.  ...oh, a closer look at
> the timestamps shows the spike actually came after the drop, so it wasn't
> the spike that caused the drop.

Hrm, I don't see any obvious clues.  You could enable 'debug ms = 1' on 
the osds as well.  That will give us more to go on if/when it happens 
again, and should not affect performance significantly.

sage

> 
> Cheers,
> 
> Chris
> 
> ----------------------------------------------------------------------
> ceph-osd.0.log
> ----------------------------------------------------------------------
> 2013-02-17 05:50:58.841310 7f108cf1b700  0 log [INF] : 2.44 scrub ok
> 2013-02-17 06:03:54.406730 7f108cf1b700  0 log [INF] : 2.51 scrub ok
> 2013-02-17 06:04:51.560283 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:31.560283)
> 2013-02-17 06:04:51.769792 7f108bf19700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:31.769792)
> 2013-02-17 06:04:52.565376 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:32.565376)
> 2013-02-17 06:04:53.565629 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:33.565628)
> 2013-02-17 06:04:54.565813 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:34.565812)
> 2013-02-17 06:04:55.565906 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:35.565905)
> 2013-02-17 06:04:55.870011 7f108bf19700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:35.870011)
> 2013-02-17 06:04:56.566030 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:36.566029)
> 2013-02-17 06:04:57.566227 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:37.566227)
> 2013-02-17 06:04:57.570184 7f108bf19700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:37.570184)
> 2013-02-17 06:04:58.070400 7f108bf19700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:38.070399)
> 2013-02-17 06:04:58.566489 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:38.566489)
> 2013-02-17 06:04:59.566631 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:39.566630)
> 2013-02-17 06:05:00.566728 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:40.566728)
> 2013-02-17 06:05:01.566848 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:41.566847)
> 2013-02-17 06:05:02.170643 7f108bf19700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:42.170643)
> 2013-02-17 06:05:02.566961 7f109af37700 -1 osd.0 784 heartbeat_check: no reply from osd.1 since 2013-02-17 06:04:30.768700 (cutoff 2013-02-17 06:04:42.566960)
> 2013-02-17 06:05:04.880523 7f108a514700  0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0xac42a00 sd=31 :6802 s=2 pgs=19 cs=3 l=0).fault, initiating reconnect
> 2013-02-17 06:05:04.880977 7f108a615700  0 -- 192.168.254.132:6802/18444 >> 192.168.254.133:6800/21178 pipe(0xac42a00 sd=31 :6802 s=1 pgs=19 cs=4 l=0).fault
> 2013-02-17 06:18:52.354800 7f108cf1b700  0 log [INF] : 2.4e scrub ok
> 2013-02-17 06:22:12.410074 7f108cf1b700  0 log [INF] : 2.53 scrub ok
> 
> ----------------------------------------------------------------------
> ceph-osd.1.log
> ----------------------------------------------------------------------
> 2013-02-17 06:00:25.752991 7f5647f2c700  0 log [INF] : 2.a6 scrub ok
> 2013-02-17 06:01:59.282661 7f5647f2c700  0 log [INF] : 2.b0 scrub ok
> 2013-02-17 06:05:02.873412 7f5645525700  0 -- 192.168.254.133:6800/21178 >> 192.168.254.132:6802/18444 pipe(0x1e50c80 sd=38 :6800 s=2 pgs=1 cs=1 l=0).fault, initiating reconnect
> 2013-02-17 06:05:02.873463 7f5645323700  0 -- 192.168.254.133:6800/21178 >> 192.168.254.132:6802/18444 pipe(0x1e50c80 sd=38 :6800 s=1 pgs=1 cs=2 l=0).fault
> 2013-02-17 06:05:04.541062 7f5645525700  0 -- 192.168.254.133:6800/21178 >> 192.168.254.132:6802/18444 pipe(0x1e50c80 sd=31 :45391 s=2 pgs=2 cs=3 l=0).reader got old message 1 <= 2344847 0xa662c00 osd_map(785..786 src has 541..786) v3, discarding
> 2013-02-17 06:05:04.541113 7f5645525700  0 -- 192.168.254.133:6800/21178 >> 192.168.254.132:6802/18444 pipe(0x1e50c80 sd=31 :45391 s=2 pgs=2 cs=3 l=0).reader got old message 2 <= 2344847 0xa662c00 osd_map(785..786 src has 541..786) v3, discarding
> 2013-02-17 06:05:04.880116 7f564df38700  0 log [WRN] : map e786 wrongly marked me down
> 2013-02-17 06:19:13.397843 7f5647f2c700  0 log [INF] : 2.aa scrub ok
> 2013-02-17 06:21:05.506977 7f5647f2c700  0 log [INF] : 2.ba scrub ok
> 
> ----------------------------------------------------------------------
> ceph.log
> ----------------------------------------------------------------------
> 2013-02-17 06:04:45.031719 mon.0 10.200.63.130:6789/0 19956 : [INF] pgmap v2900128: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:04:47.732814 mon.0 10.200.63.130:6789/0 19957 : [INF] pgmap v2900129: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:04:50.046404 mon.0 10.200.63.130:6789/0 19958 : [INF] pgmap v2900130: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:04:52.579862 mon.0 10.200.63.130:6789/0 19959 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
> 2013-02-17 06:04:52.812732 mon.0 10.200.63.130:6789/0 19960 : [INF] pgmap v2900131: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:04:55.026841 mon.0 10.200.63.130:6789/0 19961 : [INF] pgmap v2900132: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:04:57.567496 mon.0 10.200.63.130:6789/0 19962 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
> 2013-02-17 06:04:57.773216 mon.0 10.200.63.130:6789/0 19963 : [INF] pgmap v2900133: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:00.043065 mon.0 10.200.63.130:6789/0 19964 : [INF] pgmap v2900134: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:02.567938 mon.0 10.200.63.130:6789/0 19965 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
> 2013-02-17 06:05:02.567989 mon.0 10.200.63.130:6789/0 19966 : [INF] osd.1 10.200.63.133:6801/21178 failed (3 reports from 1 peers after 2013-02-17 06:05:23.567928 >= grace 20.000021)
> 2013-02-17 06:05:02.787622 mon.0 10.200.63.130:6789/0 19967 : [INF] osdmap e785: 2 osds: 1 up, 2 in
> 2013-02-17 06:05:02.891325 mon.0 10.200.63.130:6789/0 19968 : [INF] pgmap v2900135: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:03.355214 mon.0 10.200.63.130:6789/0 19969 : [INF] pgmap v2900136: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:03.884400 mon.0 10.200.63.130:6789/0 19970 : [INF] osdmap e786: 2 osds: 1 up, 2 in
> 2013-02-17 06:05:04.057756 mon.0 10.200.63.130:6789/0 19971 : [INF] pgmap v2900137: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:05.921247 mon.0 10.200.63.130:6789/0 19972 : [INF] osdmap e787: 2 osds: 2 up, 2 in
> 2013-02-17 06:05:05.921306 mon.0 10.200.63.130:6789/0 19973 : [INF] osd.1 10.200.63.133:6801/21178 boot
> 2013-02-17 06:05:06.022361 mon.0 10.200.63.130:6789/0 19974 : [INF] pgmap v2900138: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:06.983262 mon.0 10.200.63.130:6789/0 19975 : [INF] osdmap e788: 2 osds: 2 up, 2 in
> 2013-02-17 06:05:07.103855 mon.0 10.200.63.130:6789/0 19976 : [INF] pgmap v2900139: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:08.085143 mon.0 10.200.63.130:6789/0 19977 : [INF] osdmap e789: 2 osds: 2 up, 2 in
> 2013-02-17 06:05:08.201700 mon.0 10.200.63.130:6789/0 19978 : [INF] pgmap v2900140: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:12.100060 mon.0 10.200.63.130:6789/0 19979 : [INF] pgmap v2900141: 576 pgs: 259 active, 271 active+clean, 45 peering, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:13.196692 mon.0 10.200.63.130:6789/0 19980 : [INF] pgmap v2900142: 576 pgs: 467 active, 109 peering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:04.880125 osd.1 10.200.63.133:6801/21178 292 : [WRN] map e786 wrongly marked me down
> 2013-02-17 06:05:17.088685 mon.0 10.200.63.130:6789/0 19981 : [INF] pgmap v2900143: 576 pgs: 479 active, 32 active+clean, 65 peering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:18.229214 mon.0 10.200.63.130:6789/0 19982 : [INF] pgmap v2900144: 576 pgs: 469 active, 105 active+clean, 2 peering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:22.702406 mon.0 10.200.63.130:6789/0 19983 : [INF] pgmap v2900145: 576 pgs: 198 active, 376 active+clean, 1 peering, 1 active+recovering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:23.795151 mon.0 10.200.63.130:6789/0 19984 : [INF] pgmap v2900146: 576 pgs: 574 active+clean, 1 active+recovery_wait, 1 active+recovering; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail; 1/211684 degraded (0.000%)
> 2013-02-17 06:05:27.689766 mon.0 10.200.63.130:6789/0 19985 : [INF] pgmap v2900147: 576 pgs: 575 active+clean, 1 active+recovery_wait; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail; 1/211684 degraded (0.000%)
> 2013-02-17 06:05:28.798006 mon.0 10.200.63.130:6789/0 19986 : [INF] pgmap v2900148: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
> 2013-02-17 06:05:32.688719 mon.0 10.200.63.130:6789/0 19987 : [INF] pgmap v2900149: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
> 2013-02-17 06:05:33.764091 mon.0 10.200.63.130:6789/0 19988 : [INF] pgmap v2900150: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
> 
> ----------------------------------------------------------------------
> ceph-mon.b2.log
> ----------------------------------------------------------------------
> 2013-02-17 06:04:40.032792 7fb315ca2700  0 log [INF] : pgmap v2900126: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:04:42.733647 7fb315ca2700  0 log [INF] : pgmap v2900127: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:04:45.031710 7fb315ca2700  0 log [INF] : pgmap v2900128: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:04:47.732805 7fb315ca2700  0 log [INF] : pgmap v2900129: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:04:50.046400 7fb315ca2700  0 log [INF] : pgmap v2900130: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:04:52.577640 7fb315ca2700  1 mon.b2@0(leader).osd e784 prepare_failure osd.1 10.200.63.133:6801/21178 from osd.0 10.200.63.132:6801/18444 is reporting failure:1
> 2013-02-17 06:04:52.579842 7fb315ca2700  0 log [DBG] : osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
> 2013-02-17 06:04:52.812722 7fb315ca2700  0 log [INF] : pgmap v2900131: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:04:55.026832 7fb315ca2700  0 log [INF] : pgmap v2900132: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:04:57.567460 7fb315ca2700  1 mon.b2@0(leader).osd e784 prepare_failure osd.1 10.200.63.133:6801/21178 from osd.0 10.200.63.132:6801/18444 is reporting failure:1
> 2013-02-17 06:04:57.567493 7fb315ca2700  0 log [DBG] : osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
> 2013-02-17 06:04:57.773210 7fb315ca2700  0 log [INF] : pgmap v2900133: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:00.043056 7fb315ca2700  0 log [INF] : pgmap v2900134: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:02.567921 7fb315ca2700  1 mon.b2@0(leader).osd e784 prepare_failure osd.1 10.200.63.133:6801/21178 from osd.0 10.200.63.132:6801/18444 is reporting failure:1
> 2013-02-17 06:05:02.567937 7fb315ca2700  0 log [DBG] : osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
> 2013-02-17 06:05:02.567974 7fb315ca2700  1 mon.b2@0(leader).osd e784  we have enough reports/reporters to mark osd.1 down
> 2013-02-17 06:05:02.567987 7fb315ca2700  0 log [INF] : osd.1 10.200.63.133:6801/21178 failed (3 reports from 1 peers after 2013-02-17 06:05:23.567928 >= grace 20.000021)
> 2013-02-17 06:05:02.772787 7fb315ca2700  1 mon.b2@0(leader).osd e785 e785: 2 osds: 1 up, 2 in
> 2013-02-17 06:05:02.787619 7fb315ca2700  0 log [INF] : osdmap e785: 2 osds: 1 up, 2 in
> 2013-02-17 06:05:02.891321 7fb315ca2700  0 log [INF] : pgmap v2900135: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:03.355205 7fb315ca2700  0 log [INF] : pgmap v2900136: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:03.847394 7fb315ca2700  1 mon.b2@0(leader).osd e786 e786: 2 osds: 1 up, 2 in
> 2013-02-17 06:05:03.884395 7fb315ca2700  0 log [INF] : osdmap e786: 2 osds: 1 up, 2 in
> 2013-02-17 06:05:04.057750 7fb315ca2700  0 log [INF] : pgmap v2900137: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:04.869787 7fb315ca2700  1 mon.b2@0(leader).pg v2900137  ignoring stats from non-active osd.
> 2013-02-17 06:05:05.884371 7fb315ca2700  1 mon.b2@0(leader).osd e787 e787: 2 osds: 2 up, 2 in
> 2013-02-17 06:05:05.921244 7fb315ca2700  0 log [INF] : osdmap e787: 2 osds: 2 up, 2 in
> 2013-02-17 06:05:05.921303 7fb315ca2700  0 log [INF] : osd.1 10.200.63.133:6801/21178 boot
> 2013-02-17 06:05:06.022350 7fb315ca2700  0 log [INF] : pgmap v2900138: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:06.946150 7fb315ca2700  1 mon.b2@0(leader).osd e788 e788: 2 osds: 2 up, 2 in
> 2013-02-17 06:05:06.983256 7fb315ca2700  0 log [INF] : osdmap e788: 2 osds: 2 up, 2 in
> 2013-02-17 06:05:07.103846 7fb315ca2700  0 log [INF] : pgmap v2900139: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:08.048069 7fb315ca2700  1 mon.b2@0(leader).osd e789 e789: 2 osds: 2 up, 2 in
> 2013-02-17 06:05:08.085140 7fb315ca2700  0 log [INF] : osdmap e789: 2 osds: 2 up, 2 in
> 2013-02-17 06:05:08.201692 7fb315ca2700  0 log [INF] : pgmap v2900140: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:12.100055 7fb315ca2700  0 log [INF] : pgmap v2900141: 576 pgs: 259 active, 271 active+clean, 45 peering, 1 active+clean+scrubbing; 407 GB data, 836 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:13.196685 7fb315ca2700  0 log [INF] : pgmap v2900142: 576 pgs: 467 active, 109 peering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:17.088677 7fb315ca2700  0 log [INF] : pgmap v2900143: 576 pgs: 479 active, 32 active+clean, 65 peering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:18.229204 7fb315ca2700  0 log [INF] : pgmap v2900144: 576 pgs: 469 active, 105 active+clean, 2 peering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:22.702400 7fb315ca2700  0 log [INF] : pgmap v2900145: 576 pgs: 198 active, 376 active+clean, 1 peering, 1 active+recovering; 407 GB data, 835 GB used, 2888 GB / 3724 GB avail
> 2013-02-17 06:05:23.795142 7fb315ca2700  0 log [INF] : pgmap v2900146: 576 pgs: 574 active+clean, 1 active+recovery_wait, 1 active+recovering; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail; 1/211684 degraded (0.000%)
> 2013-02-17 06:05:27.689761 7fb315ca2700  0 log [INF] : pgmap v2900147: 576 pgs: 575 active+clean, 1 active+recovery_wait; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail; 1/211684 degraded (0.000%)
> 2013-02-17 06:05:28.797998 7fb315ca2700  0 log [INF] : pgmap v2900148: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
> 2013-02-17 06:05:32.688713 7fb315ca2700  0 log [INF] : pgmap v2900149: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
> 2013-02-17 06:05:33.764083 7fb315ca2700  0 log [INF] : pgmap v2900150: 576 pgs: 576 active+clean; 407 GB data, 835 GB used, 2889 GB / 3724 GB avail
> 
> ----------------------------------------------------------------------
> ceph-mon.b4.log
> ----------------------------------------------------------------------
> 2013-02-17 06:05:01.197587 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197440 ==== paxos(auth lease lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3474466732 0 0) 0x68998c0 con 0x2d189a0
> 2013-02-17 06:05:01.197626 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(auth lease_ack lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c636000
> 2013-02-17 06:05:01.560527 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9965 10.200.63.132:0/2024602 550 ==== mon_subscribe({monmap=10+,osdmap=785}) v2 ==== 42+0+0 (601251667 0 0) 0x4fe9dc0 con 0x2ed8580
> 2013-02-17 06:05:01.560568 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9965 10.200.63.132:0/2024602 -- mon_subscribe_ack(300s) v1 -- ?+0 0xccd71e0
> 2013-02-17 06:05:02.130295 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197441 ==== paxos(osdmap lease lc 784 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2329396769 0 0) 0x1c636000 con 0x2d189a0
> 2013-02-17 06:05:02.130339 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 784 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x68998c0
> 2013-02-17 06:05:02.130384 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197442 ==== paxos(mdsmap lease lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3718949626 0 0) 0x6899600 con 0x2d189a0
> 2013-02-17 06:05:02.130406 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(mdsmap lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c636000
> 2013-02-17 06:05:02.130475 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197443 ==== paxos(monmap lease lc 9 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3086430848 0 0) 0x69698c0 con 0x2d189a0
> 2013-02-17 06:05:02.130488 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(monmap lease_ack lc 9 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x6899600
> 2013-02-17 06:05:02.684455 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197444 ==== paxos(osdmap begin lc 784 fc 0 pn 40700 opn 0 gv {785=5833342}) v2 ==== 245+0+0 (1216715648 0 0) 0x6899600 con 0x2d189a0
> 2013-02-17 06:05:02.750106 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap accept lc 784 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x69698c0
> 2013-02-17 06:05:02.750141 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197445 ==== paxos(logm begin lc 2928952 fc 0 pn 40700 opn 0 gv {2928953=5833343}) v2 ==== 370+0+0 (2110251190 0 0) 0x1c636000 con 0x2d189a0
> 2013-02-17 06:05:02.778810 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928952 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x6899600
> 2013-02-17 06:05:02.778848 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197446 ==== paxos(pgmap begin lc 2900134 fc 0 pn 40700 opn 0 gv {2900135=5833344}) v2 ==== 5451+0+0 (3356992033 0 0) 0x68998c0 con 0x2d189a0
> 2013-02-17 06:05:02.804185 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap accept lc 2900134 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x1c636000
> 2013-02-17 06:05:02.804242 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197447 ==== paxos(osdmap commit lc 785 fc 0 pn 40700 opn 0 gv {785=5833342}) v2 ==== 245+0+0 (2445046887 0 0) 0x699edc0 con 0x2d189a0
> 2013-02-17 06:05:02.853386 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197448 ==== paxos(osdmap lease lc 785 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3283911110 0 0) 0x69698c0 con 0x2d189a0
> 2013-02-17 06:05:02.853414 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 785 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x699edc0
> 2013-02-17 06:05:02.869698 7f2879ad7700  1 mon.b4@1(peon).osd e785 e785: 2 osds: 1 up, 2 in
> 2013-02-17 06:05:02.889915 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9962 10.200.63.132:0/1024602 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x1c671e00
> 2013-02-17 06:05:02.889945 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9793 10.200.63.133:0/1028590 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x6895200
> 2013-02-17 06:05:02.889966 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9998 10.200.63.132:0/1026778 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x6895400
> 2013-02-17 06:05:02.889985 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10007 10.200.63.132:0/1027293 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x6895600
> 2013-02-17 06:05:02.890011 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10202 10.200.63.132:0/1011962 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x3017000
> 2013-02-17 06:05:02.890031 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9980 10.200.63.132:0/1025294 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x6895a00
> 2013-02-17 06:05:02.890063 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9983 10.200.63.132:0/1025964 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x6895c00
> 2013-02-17 06:05:02.890104 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10001 10.200.63.132:0/2026778 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2f00400
> 2013-02-17 06:05:02.890142 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9814 10.200.63.132:0/2029392 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2f0ca00
> 2013-02-17 06:05:02.890161 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9811 10.200.63.132:0/1029392 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2f02000
> 2013-02-17 06:05:02.890198 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.8936 10.200.63.132:0/1024890 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x3f9ea00
> 2013-02-17 06:05:02.890235 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9965 10.200.63.132:0/2024602 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x6892400
> 2013-02-17 06:05:02.890262 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197449 ==== paxos(logm commit lc 2928953 fc 0 pn 40700 opn 0 gv {2928953=5833343}) v2 ==== 370+0+0 (96512701 0 0) 0x6899600 con 0x2d189a0
> 2013-02-17 06:05:03.057217 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197450 ==== paxos(logm lease lc 2928953 fc 2928452 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (303443369 0 0) 0x1c591080 con 0x2d189a0
> 2013-02-17 06:05:03.057244 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928953 fc 2928451 pn 0 opn 0 gv {}) v2 -- ?+0 0x6899600
> 2013-02-17 06:05:03.109714 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197451 ==== paxos(pgmap commit lc 2900135 fc 0 pn 40700 opn 0 gv {2900135=5833344}) v2 ==== 5451+0+0 (1375112371 0 0) 0x1c636000 con 0x2d189a0
> 2013-02-17 06:05:03.211911 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197452 ==== paxos(pgmap lease lc 2900135 fc 2899634 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2609033812 0 0) 0x1c5bc000 con 0x2d189a0
> 2013-02-17 06:05:03.211933 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900135 fc 2899633 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c636000
> 2013-02-17 06:05:03.267399 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197453 ==== paxos(pgmap begin lc 2900135 fc 0 pn 40700 opn 0 gv {2900136=5833345}) v2 ==== 106258+0+0 (1573278324 0 0) 0x699edc0 con 0x2d189a0
> 2013-02-17 06:05:03.316969 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap accept lc 2900135 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x1c5bc000
> 2013-02-17 06:05:03.317018 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197454 ==== paxos(pgmap commit lc 2900136 fc 0 pn 40700 opn 0 gv {2900136=5833345}) v2 ==== 106258+0+0 (2619992220 0 0) 0x6899600 con 0x2d189a0
> 2013-02-17 06:05:03.421783 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197455 ==== paxos(pgmap lease lc 2900136 fc 2899635 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (556628681 0 0) 0x1c5bc000 con 0x2d189a0
> 2013-02-17 06:05:03.421821 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900136 fc 2899634 pn 0 opn 0 gv {}) v2 -- ?+0 0x6899600
> 2013-02-17 06:05:03.468894 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9983 10.200.63.132:0/1025964 550 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0xdbb2700 con 0x2d19b80
> 2013-02-17 06:05:03.468930 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9983 10.200.63.132:0/1025964 -- mon_subscribe_ack(300s) v1 -- ?+0 0xccd7380
> 2013-02-17 06:05:03.468945 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9793 10.200.63.133:0/1028590 548 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0xdbb2540 con 0x2ed8420
> 2013-02-17 06:05:03.468956 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9793 10.200.63.133:0/1028590 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686c340
> 2013-02-17 06:05:03.715599 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9998 10.200.63.132:0/1026778 549 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x2fe7c00 con 0x2ed82c0
> 2013-02-17 06:05:03.715637 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9998 10.200.63.132:0/1026778 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686c4e0
> 2013-02-17 06:05:03.771214 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197456 ==== paxos(osdmap begin lc 785 fc 0 pn 40700 opn 0 gv {786=5833346}) v2 ==== 248+0+0 (2752781820 0 0) 0x6899600 con 0x2d189a0
> 2013-02-17 06:05:03.833918 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap accept lc 785 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x1c5bc000
> 2013-02-17 06:05:03.833969 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197457 ==== paxos(logm begin lc 2928953 fc 0 pn 40700 opn 0 gv {2928954=5833347}) v2 ==== 899+0+0 (2467051332 0 0) 0x1c636000 con 0x2d189a0
> 2013-02-17 06:05:03.866255 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928953 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x6899600
> 2013-02-17 06:05:03.866293 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197458 ==== paxos(osdmap commit lc 786 fc 0 pn 40700 opn 0 gv {786=5833346}) v2 ==== 248+0+0 (3711416935 0 0) 0x1c590000 con 0x2d189a0
> 2013-02-17 06:05:03.931450 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197459 ==== paxos(osdmap lease lc 786 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1686591545 0 0) 0x1c5bc000 con 0x2d189a0
> 2013-02-17 06:05:03.931493 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 786 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c590000
> 2013-02-17 06:05:03.959675 7f2879ad7700  1 mon.b4@1(peon).osd e786 e786: 2 osds: 1 up, 2 in
> 2013-02-17 06:05:03.990545 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9983 10.200.63.132:0/1025964 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6892600
> 2013-02-17 06:05:03.990586 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9793 10.200.63.133:0/1028590 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6892800
> 2013-02-17 06:05:03.990633 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9998 10.200.63.132:0/1026778 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6892a00
> 2013-02-17 06:05:03.990670 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197460 ==== paxos(pgmap begin lc 2900136 fc 0 pn 40700 opn 0 gv {2900137=5833348}) v2 ==== 162+0+0 (2049167727 0 0) 0x6899600 con 0x2d189a0
> 2013-02-17 06:05:04.032168 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap accept lc 2900136 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x1c5bc000
> 2013-02-17 06:05:04.032203 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197461 ==== paxos(logm commit lc 2928954 fc 0 pn 40700 opn 0 gv {2928954=5833347}) v2 ==== 899+0+0 (4069256124 0 0) 0x1c590840 con 0x2d189a0
> 2013-02-17 06:05:04.088965 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197462 ==== paxos(logm lease lc 2928954 fc 2928453 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2473265918 0 0) 0x1c590b00 con 0x2d189a0
> 2013-02-17 06:05:04.088996 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928954 fc 2928452 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c590840
> 2013-02-17 06:05:04.129567 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197463 ==== paxos(pgmap commit lc 2900137 fc 0 pn 40700 opn 0 gv {2900137=5833348}) v2 ==== 162+0+0 (2510848205 0 0) 0x1c590000 con 0x2d189a0
> 2013-02-17 06:05:04.210301 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197464 ==== paxos(pgmap lease lc 2900137 fc 2899636 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3384776277 0 0) 0x1c5bc000 con 0x2d189a0
> 2013-02-17 06:05:04.210368 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900137 fc 2899635 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c590000
> 2013-02-17 06:05:04.239384 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197465 ==== paxos(auth lease lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3037158960 0 0) 0x1c590840 con 0x2d189a0
> 2013-02-17 06:05:04.239405 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(auth lease_ack lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c5bc000
> 2013-02-17 06:05:04.239426 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.10007 10.200.63.132:0/1027293 549 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x30256c0 con 0x2d198c0
> 2013-02-17 06:05:04.239472 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10007 10.200.63.132:0/1027293 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6892c00
> 2013-02-17 06:05:04.239484 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10007 10.200.63.132:0/1027293 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686c680
> 2013-02-17 06:05:04.239494 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.10202 10.200.63.132:0/1011962 549 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x1b93afc0 con 0x2d19080
> 2013-02-17 06:05:04.239535 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10202 10.200.63.132:0/1011962 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x3283000
> 2013-02-17 06:05:04.239546 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10202 10.200.63.132:0/1011962 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686c820
> 2013-02-17 06:05:04.239554 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.10001 10.200.63.132:0/2026778 551 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x19a4c380 con 0x2d19a20
> 2013-02-17 06:05:04.239574 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10001 10.200.63.132:0/2026778 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6893000
> 2013-02-17 06:05:04.239584 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.10001 10.200.63.132:0/2026778 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686c9c0
> 2013-02-17 06:05:04.840455 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9814 10.200.63.132:0/2029392 550 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0xdbb08c0 con 0x2d18dc0
> 2013-02-17 06:05:04.840544 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9814 10.200.63.132:0/2029392 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6893200
> 2013-02-17 06:05:04.840564 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9814 10.200.63.132:0/2029392 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686cb60
> 2013-02-17 06:05:04.882186 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9811 10.200.63.132:0/1029392 550 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0xc19e8c0 con 0x2d18f20
> 2013-02-17 06:05:04.882245 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9811 10.200.63.132:0/1029392 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6893400
> 2013-02-17 06:05:04.882265 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9811 10.200.63.132:0/1029392 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686cd00
> 2013-02-17 06:05:04.920980 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197466 ==== paxos(logm begin lc 2928954 fc 0 pn 40700 opn 0 gv {2928955=5833349}) v2 ==== 636+0+0 (3516914051 0 0) 0x1c5bc000 con 0x2d189a0
> 2013-02-17 06:05:04.986213 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928954 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x1c590840
> 2013-02-17 06:05:04.986265 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197467 ==== paxos(logm commit lc 2928955 fc 0 pn 40700 opn 0 gv {2928955=5833349}) v2 ==== 636+0+0 (2318971175 0 0) 0x1c590000 con 0x2d189a0
> 2013-02-17 06:05:05.059002 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197468 ==== paxos(logm lease lc 2928955 fc 2928454 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (90852951 0 0) 0x1c590840 con 0x2d189a0
> 2013-02-17 06:05:05.059028 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928955 fc 2928453 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c590000
> 2013-02-17 06:05:05.130136 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197469 ==== paxos(mdsmap lease lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (439731423 0 0) 0x1c590000 con 0x2d189a0
> 2013-02-17 06:05:05.130163 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(mdsmap lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c590840
> 2013-02-17 06:05:05.130218 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197470 ==== paxos(monmap lease lc 9 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (286498847 0 0) 0x1c591b80 con 0x2d189a0
> 2013-02-17 06:05:05.130234 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(monmap lease_ack lc 9 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c590000
> 2013-02-17 06:05:05.159964 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.8936 10.200.63.132:0/1024890 550 ==== mon_subscribe({monmap=10+}) v2 ==== 23+0+0 (897212988 0 0) 0x19cb9a40 con 0x2d18b00
> 2013-02-17 06:05:05.159994 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.8936 10.200.63.132:0/1024890 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686cea0
> 2013-02-17 06:05:05.301727 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.8936 10.200.63.132:0/1024890 551 ==== mon_subscribe({monmap=10+,osdmap=787}) v2 ==== 42+0+0 (3460793650 0 0) 0x30d1340 con 0x2d18b00
> 2013-02-17 06:05:05.301785 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.8936 10.200.63.132:0/1024890 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686d040
> 2013-02-17 06:05:05.534281 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== client.9965 10.200.63.132:0/2024602 551 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x72d0e00 con 0x2ed8580
> 2013-02-17 06:05:05.534371 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9965 10.200.63.132:0/2024602 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x6893600
> 2013-02-17 06:05:05.534392 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> client.9965 10.200.63.132:0/2024602 -- mon_subscribe_ack(300s) v1 -- ?+0 0x686d1e0
> 2013-02-17 06:05:05.812437 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197471 ==== paxos(osdmap begin lc 786 fc 0 pn 40700 opn 0 gv {787=5833350}) v2 ==== 698+0+0 (1439899322 0 0) 0x1c590840 con 0x2d189a0
> 2013-02-17 06:05:05.870299 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap accept lc 786 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x1c591b80
> 2013-02-17 06:05:05.870363 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197472 ==== paxos(osdmap commit lc 787 fc 0 pn 40700 opn 0 gv {787=5833350}) v2 ==== 698+0+0 (1125289174 0 0) 0x1c5a98c0 con 0x2d189a0
> 2013-02-17 06:05:05.939275 7f2879ad7700  1 -- 10.200.63.132:6789/0 <== mon.0 10.200.63.130:6789/0 197473 ==== paxos(osdmap lease lc 787 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (885348033 0 0) 0x1c591b80 con 0x2d189a0
> 2013-02-17 06:05:05.939307 7f2879ad7700  1 -- 10.200.63.132:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 787 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x1c5a98c0
> 2013-02-17 06:05:05.951734 7f2879ad7700  1 mon.b4@1(peon).osd e787 e787: 2 osds: 2 up, 2 in
> 
> ----------------------------------------------------------------------
> ceph-mon.b5.log
> ----------------------------------------------------------------------
> 2013-02-17 06:05:00.003187 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217824 ==== paxos(pgmap commit lc 2900134 fc 0 pn 40700 opn 0 gv {2900134=5833340}) v2 ==== 4055+0+0 (2464340035 0 0) 0x27a70b00 con 0x268a9a0
> 2013-02-17 06:05:00.051665 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217825 ==== paxos(pgmap lease lc 2900134 fc 2899633 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2739579357 0 0) 0x3950580 con 0x268a9a0
> 2013-02-17 06:05:00.051693 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900134 fc 2899632 pn 0 opn 0 gv {}) v2 -- ?+0 0x27a70b00
> 2013-02-17 06:05:00.088234 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217826 ==== route(pg_stats_ack(11 pgs tid 15412) v1 tid 31190) v2 ==== 555+0+0 (4193262135 0 0) 0x27a62480 con 0x268a9a0
> 2013-02-17 06:05:00.088261 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.1 10.200.63.133:6801/21178 -- pg_stats_ack(11 pgs tid 15412) v1 -- ?+0 0x27bf4540
> 2013-02-17 06:05:00.124146 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217827 ==== paxos(logm begin lc 2928951 fc 0 pn 40700 opn 0 gv {2928952=5833341}) v2 ==== 408+0+0 (2121299380 0 0) 0x27a70b00 con 0x268a9a0
> 2013-02-17 06:05:00.152577 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928951 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x3950580
> 2013-02-17 06:05:00.164381 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217828 ==== paxos(logm commit lc 2928952 fc 0 pn 40700 opn 0 gv {2928952=5833341}) v2 ==== 408+0+0 (2048871512 0 0) 0x3950580 con 0x268a9a0
> 2013-02-17 06:05:00.221408 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217829 ==== paxos(logm lease lc 2928952 fc 2928451 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3800722919 0 0) 0x39b22c0 con 0x268a9a0
> 2013-02-17 06:05:00.221436 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928952 fc 2928450 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
> 2013-02-17 06:05:00.465887 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9971 10.200.63.132:0/1024854 548 ==== mon_subscribe({monmap=10+,osdmap=785}) v2 ==== 42+0+0 (601251667 0 0) 0x27a556c0 con 0x268b1e0
> 2013-02-17 06:05:00.465937 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9971 10.200.63.132:0/1024854 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed91e0
> 2013-02-17 06:05:00.543992 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9995 10.200.63.132:0/2026519 548 ==== mon_subscribe({monmap=10+,osdmap=785}) v2 ==== 42+0+0 (601251667 0 0) 0x99361c0 con 0x268b760
> 2013-02-17 06:05:00.544025 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9995 10.200.63.132:0/2026519 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed9380
> 2013-02-17 06:05:00.546004 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9992 10.200.63.132:0/1026519 548 ==== mon_subscribe({monmap=10+,osdmap=785}) v2 ==== 42+0+0 (601251667 0 0) 0x2745f340 con 0x268ba20
> 2013-02-17 06:05:00.546038 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9992 10.200.63.132:0/1026519 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed9520
> 2013-02-17 06:05:01.197554 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217830 ==== paxos(auth lease lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (364209048 0 0) 0x3950580 con 0x268a9a0
> 2013-02-17 06:05:01.197590 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(auth lease_ack lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 -- ?+0 0x39b22c0
> 2013-02-17 06:05:01.348870 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9953 10.200.63.133:0/1028024 549 ==== mon_subscribe({monmap=10+,osdmap=785}) v2 ==== 42+0+0 (601251667 0 0) 0x246a81c0 con 0x268ac60
> 2013-02-17 06:05:01.348913 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9953 10.200.63.133:0/1028024 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed96c0
> 2013-02-17 06:05:01.445022 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9977 10.200.63.132:0/1025185 549 ==== mon_subscribe({monmap=10+,osdmap=785}) v2 ==== 42+0+0 (601251667 0 0) 0x279848c0 con 0x268b080
> 2013-02-17 06:05:01.445054 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9977 10.200.63.132:0/1025185 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed9860
> 2013-02-17 06:05:02.129960 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217831 ==== paxos(osdmap lease lc 784 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2306880830 0 0) 0x39b22c0 con 0x268a9a0
> 2013-02-17 06:05:02.130015 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 784 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
> 2013-02-17 06:05:02.130058 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217832 ==== paxos(mdsmap lease lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1492732137 0 0) 0x3960dc0 con 0x268a9a0
> 2013-02-17 06:05:02.130071 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(mdsmap lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x39b22c0
> 2013-02-17 06:05:02.130177 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217833 ==== paxos(monmap lease lc 9 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (445342582 0 0) 0x39b2000 con 0x268a9a0
> 2013-02-17 06:05:02.130201 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(monmap lease_ack lc 9 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x3960dc0
> 2013-02-17 06:05:02.567361 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.0 10.200.63.132:6801/18444 16103 ==== osd_failure(failed osd.1 10.200.63.133:6801/21178 for 31sec e784 v784) v3 ==== 188+0+0 (2900250966 0 0) 0x2f69180 con 0x2ee6420
> 2013-02-17 06:05:02.567427 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- forward(osd_failure(failed osd.1 10.200.63.133:6801/21178 for 31sec e784 v784) v3) to leader v1 -- ?+0 0x39b2000
> 2013-02-17 06:05:02.568250 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217834 ==== route(no-reply tid 31188) v2 ==== 154+0+0 (247117294 0 0) 0x276ca000 con 0x268a9a0
> 2013-02-17 06:05:02.568572 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.0 10.200.63.132:6801/18444 16104 ==== pg_stats(15 pgs tid 15379 v 784) v1 ==== 5695+0+0 (862073156 0 0) 0x27b2e480 con 0x2ee6420
> 2013-02-17 06:05:02.568630 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- forward(pg_stats(15 pgs tid 15379 v 784) v1) to leader v1 -- ?+0 0x27a70b00
> 2013-02-17 06:05:02.683997 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217835 ==== paxos(osdmap begin lc 784 fc 0 pn 40700 opn 0 gv {785=5833342}) v2 ==== 245+0+0 (3059524207 0 0) 0x27a70b00 con 0x268a9a0
> 2013-02-17 06:05:02.733666 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap accept lc 784 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x381f8c0
> 2013-02-17 06:05:02.733705 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217836 ==== paxos(logm begin lc 2928952 fc 0 pn 40700 opn 0 gv {2928953=5833343}) v2 ==== 370+0+0 (1453139115 0 0) 0x39b2000 con 0x268a9a0
> 2013-02-17 06:05:02.758039 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928952 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x27a70b00
> 2013-02-17 06:05:02.758071 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217837 ==== paxos(pgmap begin lc 2900134 fc 0 pn 40700 opn 0 gv {2900135=5833344}) v2 ==== 5451+0+0 (2376171371 0 0) 0x3960dc0 con 0x268a9a0
> 2013-02-17 06:05:02.789264 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap accept lc 2900134 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x39b2000
> 2013-02-17 06:05:02.789297 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217838 ==== paxos(osdmap commit lc 785 fc 0 pn 40700 opn 0 gv {785=5833342}) v2 ==== 245+0+0 (2456140731 0 0) 0x381f8c0 con 0x268a9a0
> 2013-02-17 06:05:02.839358 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217839 ==== paxos(osdmap lease lc 785 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (655530447 0 0) 0x39b22c0 con 0x268a9a0
> 2013-02-17 06:05:02.839385 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 785 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x381f8c0
> 2013-02-17 06:05:02.859571 7faaec18d700  1 mon.b5@2(peon).osd e785 e785: 2 osds: 1 up, 2 in
> 2013-02-17 06:05:02.871798 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9995 10.200.63.132:0/2026519 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2ad3000
> 2013-02-17 06:05:02.871864 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9971 10.200.63.132:0/1024854 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2afae00
> 2013-02-17 06:05:02.871939 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9992 10.200.63.132:0/1026519 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2ba7200
> 2013-02-17 06:05:02.871977 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9977 10.200.63.132:0/1025185 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2b63400
> 2013-02-17 06:05:02.871998 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9953 10.200.63.133:0/1028024 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2ba6600
> 2013-02-17 06:05:02.872016 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9986 10.200.63.132:0/1026221 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2a9ae00
> 2013-02-17 06:05:02.872113 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9989 10.200.63.132:0/2026221 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2bce200
> 2013-02-17 06:05:02.872135 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9968 10.200.63.132:0/1024758 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x285e800
> 2013-02-17 06:05:02.872197 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9974 10.200.63.132:0/1024950 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2ba7c00
> 2013-02-17 06:05:02.872250 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9956 10.200.63.132:0/1024424 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2a4f000
> 2013-02-17 06:05:02.872272 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9959 10.200.63.132:0/2024424 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2b21a00
> 2013-02-17 06:05:02.872294 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- osd_map(784..785 src has 541..785) v3 -- ?+0 0x2b64c00
> 2013-02-17 06:05:02.872311 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217840 ==== route(osd_map(784..785 src has 541..785) v3 tid 31191) v2 ==== 549+0+0 (1208657806 0 0) 0xd2c6fc0 con 0x268a9a0
> 2013-02-17 06:05:02.872323 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- osd_map(784..785 src has 541..785) v3 -- ?+0 0x2c89c00
> 2013-02-17 06:05:02.872339 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217841 ==== paxos(logm commit lc 2928953 fc 0 pn 40700 opn 0 gv {2928953=5833343}) v2 ==== 370+0+0 (1947946272 0 0) 0x39b2000 con 0x268a9a0
> 2013-02-17 06:05:02.931840 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217842 ==== paxos(logm lease lc 2928953 fc 2928452 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2254758468 0 0) 0x27a70b00 con 0x268a9a0
> 2013-02-17 06:05:02.931867 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928953 fc 2928451 pn 0 opn 0 gv {}) v2 -- ?+0 0x39b2000
> 2013-02-17 06:05:02.953651 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217843 ==== paxos(pgmap commit lc 2900135 fc 0 pn 40700 opn 0 gv {2900135=5833344}) v2 ==== 5451+0+0 (3278650520 0 0) 0x3950580 con 0x268a9a0
> 2013-02-17 06:05:03.013296 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217844 ==== paxos(pgmap lease lc 2900135 fc 2899634 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (4264509609 0 0) 0x381fb80 con 0x268a9a0
> 2013-02-17 06:05:03.013323 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900135 fc 2899633 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
> 2013-02-17 06:05:03.043166 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217845 ==== paxos(pgmap begin lc 2900135 fc 0 pn 40700 opn 0 gv {2900136=5833345}) v2 ==== 106258+0+0 (3288368916 0 0) 0x381f8c0 con 0x268a9a0
> 2013-02-17 06:05:03.071470 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap accept lc 2900135 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x381fb80
> 2013-02-17 06:05:03.071503 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217846 ==== route(pg_stats_ack(15 pgs tid 15379) v1 tid 31192) v2 ==== 671+0+0 (3596506077 0 0) 0xcecf680 con 0x268a9a0
> 2013-02-17 06:05:03.071517 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- pg_stats_ack(15 pgs tid 15379) v1 -- ?+0 0x26b41c0
> 2013-02-17 06:05:03.071536 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.0 10.200.63.132:6801/18444 16105 ==== mon_subscribe({monmap=10+,osd_pg_creates=0,osdmap=785}) v2 ==== 69+0+0 (1495392616 0 0) 0x27a5c000 con 0x2ee6420
> 2013-02-17 06:05:03.071578 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- osd_map(785..785 src has 541..785) v3 -- ?+0 0x2a9bc00
> 2013-02-17 06:05:03.071590 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed9a00
> 2013-02-17 06:05:03.071640 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.0 10.200.63.132:6801/18444 16106 ==== osd_alive(want up_thru 785 have 785) v1 ==== 22+0+0 (1697281490 0 0) 0x26675180 con 0x2ee6420
> 2013-02-17 06:05:03.071678 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- forward(osd_alive(want up_thru 785 have 785) v1) to leader v1 -- ?+0 0x381edc0
> 2013-02-17 06:05:03.086178 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217847 ==== paxos(pgmap commit lc 2900136 fc 0 pn 40700 opn 0 gv {2900136=5833345}) v2 ==== 106258+0+0 (3946206785 0 0) 0x381edc0 con 0x268a9a0
> 2013-02-17 06:05:03.317782 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217848 ==== paxos(pgmap lease lc 2900136 fc 2899635 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3766713448 0 0) 0x381fb80 con 0x268a9a0
> 2013-02-17 06:05:03.317811 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900136 fc 2899634 pn 0 opn 0 gv {}) v2 -- ?+0 0x381edc0
> 2013-02-17 06:05:03.583718 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9977 10.200.63.132:0/1025185 550 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0xf313a40 con 0x268b080
> 2013-02-17 06:05:03.583753 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9977 10.200.63.132:0/1025185 -- mon_subscribe_ack(300s) v1 -- ?+0 0x3ed9d40
> 2013-02-17 06:05:03.765409 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9986 10.200.63.132:0/1026221 548 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x254b4700 con 0x268b340
> 2013-02-17 06:05:03.765443 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9986 10.200.63.132:0/1026221 -- mon_subscribe_ack(300s) v1 -- ?+0 0xc11e4e0
> 2013-02-17 06:05:03.771262 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217849 ==== paxos(osdmap begin lc 785 fc 0 pn 40700 opn 0 gv {786=5833346}) v2 ==== 248+0+0 (1123691466 0 0) 0x381edc0 con 0x268a9a0
> 2013-02-17 06:05:03.800417 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap accept lc 785 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x381fb80
> 2013-02-17 06:05:03.811504 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217850 ==== paxos(logm begin lc 2928953 fc 0 pn 40700 opn 0 gv {2928954=5833347}) v2 ==== 899+0+0 (976230293 0 0) 0x381fb80 con 0x268a9a0
> 2013-02-17 06:05:03.836730 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928953 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x381edc0
> 2013-02-17 06:05:03.836762 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217851 ==== paxos(osdmap commit lc 786 fc 0 pn 40700 opn 0 gv {786=5833346}) v2 ==== 248+0+0 (1548565610 0 0) 0x3950580 con 0x268a9a0
> 2013-02-17 06:05:03.885558 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217852 ==== paxos(osdmap lease lc 786 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (4021780416 0 0) 0x39b2000 con 0x268a9a0
> 2013-02-17 06:05:03.885585 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 786 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
> 2013-02-17 06:05:03.897747 7faaec18d700  1 mon.b5@2(peon).osd e786 e786: 2 osds: 1 up, 2 in
> 2013-02-17 06:05:03.910012 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9977 10.200.63.132:0/1025185 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x2ba6a00
> 2013-02-17 06:05:03.910042 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9986 10.200.63.132:0/1026221 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x2a9aa00
> 2013-02-17 06:05:03.910112 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- osd_map(785..786 src has 541..786) v3 -- ?+0 0x2d00800
> 2013-02-17 06:05:03.910135 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217853 ==== paxos(pgmap begin lc 2900136 fc 0 pn 40700 opn 0 gv {2900137=5833348}) v2 ==== 162+0+0 (3432763774 0 0) 0x381edc0 con 0x268a9a0
> 2013-02-17 06:05:03.934366 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap accept lc 2900136 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x39b2000
> 2013-02-17 06:05:03.934397 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217854 ==== route(osd_map(785..786 src has 541..786) v3 tid 31193) v2 ==== 541+0+0 (2399459519 0 0) 0x27b72fc0 con 0x268a9a0
> 2013-02-17 06:05:03.934411 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.0 10.200.63.132:6801/18444 -- osd_map(785..786 src has 541..786) v3 -- ?+0 0x285f600
> 2013-02-17 06:05:03.934428 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217855 ==== paxos(logm commit lc 2928954 fc 0 pn 40700 opn 0 gv {2928954=5833347}) v2 ==== 899+0+0 (2190316667 0 0) 0x3950580 con 0x268a9a0
> 2013-02-17 06:05:03.986328 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217856 ==== paxos(logm lease lc 2928954 fc 2928453 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1074855049 0 0) 0x387f8c0 con 0x268a9a0
> 2013-02-17 06:05:03.986355 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928954 fc 2928452 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
> 2013-02-17 06:05:04.008094 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217857 ==== paxos(pgmap commit lc 2900137 fc 0 pn 40700 opn 0 gv {2900137=5833348}) v2 ==== 162+0+0 (3385534158 0 0) 0x39b2000 con 0x268a9a0
> 2013-02-17 06:05:04.061016 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217858 ==== paxos(pgmap lease lc 2900137 fc 2899636 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1623274848 0 0) 0x3950580 con 0x268a9a0
> 2013-02-17 06:05:04.061043 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(pgmap lease_ack lc 2900137 fc 2899635 pn 0 opn 0 gv {}) v2 -- ?+0 0x39b2000
> 2013-02-17 06:05:04.089621 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9974 10.200.63.132:0/1024950 550 ==== mon_subscribe({monmap=10+}) v2 ==== 23+0+0 (897212988 0 0) 0x99376c0 con 0x268b8c0
> 2013-02-17 06:05:04.089641 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9974 10.200.63.132:0/1024950 -- mon_subscribe_ack(300s) v1 -- ?+0 0xc11e9c0
> 2013-02-17 06:05:04.161576 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9953 10.200.63.133:0/1028024 550 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x270f3500 con 0x268ac60
> 2013-02-17 06:05:04.161635 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9953 10.200.63.133:0/1028024 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x290e400
> 2013-02-17 06:05:04.161654 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9953 10.200.63.133:0/1028024 -- mon_subscribe_ack(300s) v1 -- ?+0 0xc11e340
> 2013-02-17 06:05:04.197772 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217859 ==== paxos(auth lease lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (335508935 0 0) 0x39b2000 con 0x268a9a0
> 2013-02-17 06:05:04.197801 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(auth lease_ack lc 3461 fc 3441 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
> 2013-02-17 06:05:04.198129 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9989 10.200.63.132:0/2026221 549 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x9937340 con 0x2ee66e0
> 2013-02-17 06:05:04.198188 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9989 10.200.63.132:0/2026221 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x2a4e000
> 2013-02-17 06:05:04.198208 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9989 10.200.63.132:0/2026221 -- mon_subscribe_ack(300s) v1 -- ?+0 0xc11f1e0
> 2013-02-17 06:05:04.868960 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.1 10.200.63.133:6801/21178 16167 ==== pg_stats(11 pgs tid 15413 v 784) v1 ==== 4203+0+0 (1978113006 0 0) 0xd460fc0 con 0x268b4a0
> 2013-02-17 06:05:04.869046 7faaec18d700  1 -- 10.200.63.133:6789/0 --> 10.200.63.133:6801/21178 -- osd_map(785..786 src has 541..786) v3 -- ?+0 0x2b20600 con 0x268b4a0
> 2013-02-17 06:05:04.869070 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- forward(pg_stats(11 pgs tid 15413 v 784) v1) to leader v1 -- ?+0 0x39b2000
> 2013-02-17 06:05:04.870167 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217860 ==== route(osd_map(785..786 src has 541..786) v3 tid 31194) v2 ==== 541+0+0 (4109387781 0 0) 0xd2946c0 con 0x268a9a0
> 2013-02-17 06:05:04.870195 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.1 10.200.63.133:6801/21178 -- osd_map(785..786 src has 541..786) v3 -- ?+0 0x2ad2600
> 2013-02-17 06:05:04.920875 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217861 ==== paxos(logm begin lc 2928954 fc 0 pn 40700 opn 0 gv {2928955=5833349}) v2 ==== 636+0+0 (2711767532 0 0) 0x39b2000 con 0x268a9a0
> 2013-02-17 06:05:04.964213 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm accept lc 2928954 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x387f8c0
> 2013-02-17 06:05:04.977102 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217862 ==== paxos(logm commit lc 2928955 fc 0 pn 40700 opn 0 gv {2928955=5833349}) v2 ==== 636+0+0 (4237095061 0 0) 0x387f8c0 con 0x268a9a0
> 2013-02-17 06:05:05.039837 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217863 ==== paxos(logm lease lc 2928955 fc 2928454 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3942935192 0 0) 0x3950580 con 0x268a9a0
> 2013-02-17 06:05:05.039864 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(logm lease_ack lc 2928955 fc 2928453 pn 0 opn 0 gv {}) v2 -- ?+0 0x387f8c0
> 2013-02-17 06:05:05.129923 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217864 ==== paxos(mdsmap lease lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2330587895 0 0) 0x387f8c0 con 0x268a9a0
> 2013-02-17 06:05:05.129952 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(mdsmap lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
> 2013-02-17 06:05:05.130245 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217865 ==== paxos(monmap lease lc 9 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1303228847 0 0) 0x3950840 con 0x268a9a0
> 2013-02-17 06:05:05.130263 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(monmap lease_ack lc 9 fc 1 pn 0 opn 0 gv {}) v2 -- ?+0 0x387f8c0
> 2013-02-17 06:05:05.380504 7faaec18d700  1 -- 10.200.63.133:6789/0 <== client.9968 10.200.63.132:0/1024758 549 ==== mon_subscribe({monmap=10+,osdmap=786}) v2 ==== 42+0+0 (981029754 0 0) 0x27a0c380 con 0x268b600
> 2013-02-17 06:05:05.380583 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9968 10.200.63.132:0/1024758 -- osd_map(786..786 src has 541..786) v3 -- ?+0 0x2c89a00
> 2013-02-17 06:05:05.380603 7faaec18d700  1 -- 10.200.63.133:6789/0 --> client.9968 10.200.63.132:0/1024758 -- mon_subscribe_ack(300s) v1 -- ?+0 0xc11fd40
> 2013-02-17 06:05:05.726864 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.1 10.200.63.133:6801/21178 16168 ==== mon_get_version(what=osdmap handle=2) v1 ==== 18+0+0 (3896555503 0 0) 0xd3e5860 con 0x268b4a0
> 2013-02-17 06:05:05.726889 7faaec18d700  1 -- 10.200.63.133:6789/0 --> osd.1 10.200.63.133:6801/21178 -- mon_check_map_ack(handle=2 version=786) v2 -- ?+0 0x4aa4820
> 2013-02-17 06:05:05.727547 7faaec18d700  1 -- 10.200.63.133:6789/0 <== osd.1 10.200.63.133:6801/21178 16169 ==== osd_boot(osd.1 booted 780 v786) v3 ==== 581+0+0 (2497288980 0 0) 0x8340000 con 0x268b4a0
> 2013-02-17 06:05:05.727606 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- forward(osd_boot(osd.1 booted 780 v786) v3) to leader v1 -- ?+0 0x3950840
> 2013-02-17 06:05:05.812260 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217866 ==== paxos(osdmap begin lc 786 fc 0 pn 40700 opn 0 gv {787=5833350}) v2 ==== 698+0+0 (4074253838 0 0) 0x387f8c0 con 0x268a9a0
> 2013-02-17 06:05:05.841338 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap accept lc 786 fc 0 pn 40700 opn 0 gv {}) v2 -- ?+0 0x39b2000
> 2013-02-17 06:05:05.856277 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217867 ==== paxos(osdmap commit lc 787 fc 0 pn 40700 opn 0 gv {787=5833350}) v2 ==== 698+0+0 (62741733 0 0) 0x3950580 con 0x268a9a0
> 2013-02-17 06:05:05.905956 7faaec18d700  1 -- 10.200.63.133:6789/0 <== mon.0 10.200.63.130:6789/0 217868 ==== paxos(osdmap lease lc 787 fc 541 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (719158437 0 0) 0x27a74b00 con 0x268a9a0
> 2013-02-17 06:05:05.905976 7faaec18d700  1 -- 10.200.63.133:6789/0 --> mon.0 10.200.63.130:6789/0 -- paxos(osdmap lease_ack lc 787 fc 541 pn 0 opn 0 gv {}) v2 -- ?+0 0x3950580
> 2013-02-17 06:05:05.918189 7faaec18d700  1 mon.b5@2(peon).osd e787 e787: 2 osds: 2 up, 2 in
> 
> ----------------------------------------------------------------------
> iostat
> ----------------------------------------------------------------------
> 
> b4:osd.0            load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
> 2013-02-17-06:04:04  0.4   1.2   0.0   1.4   0.6   0.0  94.1  sdn    0.0    0.0    9.7   13.7    49.33   126.70  15.00   0.37  15.88   3.77  24.5  10.5
> 2013-02-17-06:04:19  0.4   1.4   0.0   1.3   0.6   0.0  93.8  sdn    0.0    0.1    2.9   18.2    12.53   171.73  17.49   0.44  20.92   6.98  23.1  11.3
> 2013-02-17-06:04:34  0.6   1.8   0.0   1.6   0.5   0.0  93.7  sdn    0.0    0.0    2.7   20.3    11.47   169.67  15.80   0.56  14.74   4.00  16.2  11.5
> 2013-02-17-06:04:49  0.5   1.4   0.0   1.3   0.5   0.0  93.9  sdn    0.0    0.1    2.9   15.0    12.27   157.50  18.93   0.25  26.17   5.45  30.2   7.9
> 2013-02-17-06:05:04  0.7   2.0   0.0   1.1   1.2   0.0  92.5  sdn    0.0    0.1    7.7   62.7    32.53   498.20  15.09   1.44  20.36  22.43  20.1  22.6
> 2013-02-17-06:05:19  1.4   1.0   0.0   0.5   8.8   0.0  88.8  sdn    0.0    0.1    2.7  209.8    20.53  3959.80  37.47  89.88 414.63 159.00 417.9  99.9
> 2013-02-17-06:05:34  1.4   2.0   0.0   1.0   1.3   0.0  93.2  sdn    0.0    1.7    4.8   54.5   544.53   462.63  33.99   4.40 104.45  15.00 112.3  33.8
> 2013-02-17-06:05:49  1.2   2.2   0.0   1.3   0.6   0.0  93.3  sdn    0.0    0.6    0.0   18.5     0.00   175.90  18.98   0.49  26.44   0.00  26.4  10.1
> 2013-02-17-06:06:04  1.0   1.2   0.0   1.0   1.1   0.0  93.7  sdn    0.0    1.3    0.6   31.4     2.40   225.17  14.22   1.49  46.48  48.89  46.4  21.9
> 2013-02-17-06:06:19  0.8   4.0   0.0   1.3   0.7   0.0  90.7  sdn    0.0    0.7    2.1   21.5    16.80   202.60  18.65   0.45  18.95  16.45  19.2  13.1
> 2013-02-17-06:06:34  1.2   4.4   0.0   1.3   0.9   0.0  90.1  sdn    0.0    0.3    1.5   18.7    45.33   161.67  20.56   0.45  22.45  48.18  20.4  11.1
> 2013-02-17-06:06:49  1.2   3.4   0.0   1.5   1.0   0.0  90.5  sdn    0.0    0.5    0.7   26.7     3.47   246.93  18.32   1.38  50.61  74.00  50.0  21.0
> 2013-02-17-06:07:04  1.3   4.1   0.0   1.4   0.7   0.0  90.3  sdn    0.0    0.2    0.9   13.2     4.27   128.00  18.81   0.42  30.19  18.46  31.0  10.9
> 
> b4:osd.0-journal    load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
> 2013-02-17-06:04:04  0.4   1.2   0.0   1.4   0.6   0.0  94.1  md2    0.0    0.0    0.0    9.3     0.00    96.53  20.69   0.00   0.00   0.00   0.0   0.0
> 2013-02-17-06:04:19  0.4   1.3   0.0   1.3   0.6   0.0  93.8  md2    0.0    0.0    0.0    7.3     0.00   113.87  31.05   0.00   0.00   0.00   0.0   0.0
> 2013-02-17-06:04:34  0.6   1.8   0.0   1.6   0.5   0.0  93.7  md2    0.0    0.0    0.0    9.7     0.00   126.67  26.03   0.00   0.00   0.00   0.0   0.0
> 2013-02-17-06:04:49  0.5   1.4   0.0   1.2   0.5   0.0  93.9  md2    0.0    0.0    0.0    8.7     0.00   125.33  28.92   0.00   0.00   0.00   0.0   0.0
> 2013-02-17-06:05:04  0.7   2.0   0.0   1.1   1.2   0.0  92.4  md2    0.0    0.0    0.0   19.1     0.00   573.07  59.90   0.00   0.00   0.00   0.0   0.0
> 2013-02-17-06:05:19  1.4   1.0   0.0   0.5   8.7   0.0  88.8  md2    0.0    0.0    0.0   34.7     0.00  3702.13 213.58   0.00   0.00   0.00   0.0   0.0
> 2013-02-17-06:05:34  1.4   2.0   0.0   1.0   1.4   0.0  93.1  md2    0.0    0.0    0.0   18.5     0.00   297.07  32.06   0.00   0.00   0.00   0.0   0.0
> 2013-02-17-06:05:49  1.2   2.1   0.0   1.4   0.6   0.0  93.3  md2    0.0    0.0    0.0   10.5     0.00   124.27  23.59   0.00   0.00   0.00   0.0   0.0
> 2013-02-17-06:06:04  1.0   1.2   0.0   0.9   1.1   0.0  93.8  md2    0.0    0.0    0.0   15.3     0.00   187.47  24.45   0.00   0.00   0.00   0.0   0.0
> 2013-02-17-06:06:19  0.8   4.0   0.0   1.3   0.7   0.0  90.8  md2    0.0    0.0    0.0    9.2     0.00   108.80  23.65   0.00   0.00   0.00   0.0   0.0
> 2013-02-17-06:06:34  1.2   4.4   0.0   1.3   0.9   0.0  90.1  md2    0.0    0.0    0.0   16.0     0.00   170.40  21.30   0.00   0.00   0.00   0.0   0.0
> 2013-02-17-06:06:49  1.2   3.4   0.0   1.5   1.0   0.0  90.4  md2    0.0    0.0    0.0   11.9     0.00   141.87  23.91   0.00   0.00   0.00   0.0   0.0
> 2013-02-17-06:07:04  1.3   4.1   0.0   1.4   0.7   0.0  90.4  md2    0.0    0.0    0.0    9.3     0.00   108.53  23.26   0.00   0.00   0.00   0.0   0.0
> 
> b5:osd.1            load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
> 2013-02-17-06:04:05  0.7   0.1   0.0   0.1   0.7   0.0  98.2  sdg    0.0    0.0    9.4   19.1    48.00   166.17  15.01   1.17  40.89  20.92  50.7  27.2
> 2013-02-17-06:04:20  0.6   0.1   0.0   0.1   0.6   0.0  98.2  sdg    0.0    0.0    2.7   15.8    10.67   147.00  17.08   0.68  36.64   9.50  41.2  13.5
> 2013-02-17-06:04:35  0.5   0.2   0.0   0.1   0.7   0.0  98.0  sdg    0.0    0.0    3.2   22.5    13.07   204.63  16.92   1.10  42.77  30.62  44.5  23.1
> 2013-02-17-06:04:50  0.4   0.1   0.0   0.1   0.4   0.0  98.3  sdg    0.0    0.0    2.7   13.6    10.67   142.43  18.82   0.59  36.39  22.50  39.1  13.1
> 2013-02-17-06:05:05  0.3   0.1   0.0   0.1   1.1   0.0  97.7  sdg    0.0    0.0    3.0   20.7    11.80   190.57  17.05   0.96  40.48  23.56  42.9  20.1
> 2013-02-17-06:05:20  1.4   0.7   0.0   0.6   6.2   0.0  91.9  sdg    0.0    0.0    7.1  233.9    30.17  2871.43  24.08  91.60 342.67 132.99 349.1  99.5
> 2013-02-17-06:05:35  1.1   0.2   0.0   0.1   2.3   0.0  96.8  sdg    0.0    1.5    1.7   73.5    18.93  1111.63  30.09   9.59 247.66  76.00 251.6  49.6
> 2013-02-17-06:05:50  1.0   0.1   0.0   0.1   0.6   0.0  98.8  sdg    0.0    0.1    0.0   14.6     0.00   144.87  19.84   0.52  35.62   0.00  35.6   9.9
> 2013-02-17-06:06:05  0.8   0.2   0.0   0.1   1.5   0.0  97.7  sdg    0.0    1.1    2.1   37.9     8.27   297.93  15.31   1.97  49.17  64.84  48.3  30.1
> 2013-02-17-06:06:20  0.8   0.4   0.0   0.2   0.7   0.0  97.9  sdg    0.0    0.1   11.1   15.1   123.73   137.47  19.89   0.65  24.62  17.07  30.2  22.3
> 2013-02-17-06:06:35  0.7   0.4   0.0   0.2   1.5   0.0  97.1  sdg    0.0    0.1    9.4   23.7   150.13   223.33  22.54   1.27  38.31  19.79  45.6  26.4
> 2013-02-17-06:06:50  0.8   0.4   0.0   0.2   1.2   0.0  97.4  sdg    0.0    0.2    8.5   22.5   138.40   188.87  21.11   1.71  55.01  28.35  65.0  30.4
> 2013-02-17-06:07:05  0.8   0.5   0.0   0.3   1.2   0.0  97.2  sdg    0.0    0.0    7.2   18.4    93.07   185.93  21.80   1.13  44.30  26.02  51.4  30.0
> 
> [b5:osd.1-journal not recorded]
> 
> ----------------------------------------------------------------------
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-18  1:44       ` Sage Weil
@ 2013-02-19  3:02         ` Chris Dunlop
  2013-02-20  2:07           ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-02-19  3:02 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote:
> On Mon, 18 Feb 2013, Chris Dunlop wrote:
>> On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
>>> On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
>>>> On Fri, 15 Feb 2013, Chris Dunlop wrote:
>>>>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
>>>>> mons to lose touch with the osds?
>>>> 
>>>> Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
>>>> hopes that this happens again?  It will give us more information to go on.
>>> 
>>> Debug turned on.
>> 
>> We haven't experienced the cluster losing touch with the osds completely
>> since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
>> for a few seconds before it recovered. See below for logs (reminder: 3
>> boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).
> 
> Hrm, I don't see any obvious clues.  You could enable 'debug ms = 1' on 
> the osds as well.  That will give us more to go on if/when it happens 
> again, and should not affect performance significantly.

Done: ceph osd tell '*' injectargs '--debug-ms 1'

Now to wait for it to happen again.

Chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-19  3:02         ` Chris Dunlop
@ 2013-02-20  2:07           ` Chris Dunlop
  2013-02-22  3:06             ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-02-20  2:07 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Tue, Feb 19, 2013 at 02:02:03PM +1100, Chris Dunlop wrote:
> On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote:
>> On Mon, 18 Feb 2013, Chris Dunlop wrote:
>>> On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
>>>> On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
>>>>> On Fri, 15 Feb 2013, Chris Dunlop wrote:
>>>>>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
>>>>>> mons to lose touch with the osds?
>>>>> 
>>>>> Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
>>>>> hopes that this happens again?  It will give us more information to go on.
>>>> 
>>>> Debug turned on.
>>> 
>>> We haven't experienced the cluster losing touch with the osds completely
>>> since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
>>> for a few seconds before it recovered. See below for logs (reminder: 3
>>> boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).
>> 
>> Hrm, I don't see any obvious clues.  You could enable 'debug ms = 1' on 
>> the osds as well.  That will give us more to go on if/when it happens 
>> again, and should not affect performance significantly.
> 
> Done: ceph osd tell '*' injectargs '--debug-ms 1'
> 
> Now to wait for it to happen again.

OK, we got it again. Full logs covering the incident available at:

https://www.dropbox.com/s/kguzwyjfglv3ypl/ceph-logs.zip

Archive:  /tmp/ceph-logs.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
   11492  Defl:X     1186  90% 2013-02-20 12:04 c0cba4ae  ceph-mon.b2.log
 1270789  Defl:X    89278  93% 2013-02-20 12:00 2208d035  ceph-mon.b4.log
 1375858  Defl:X   104025  92% 2013-02-20 12:05 c64c1dad  ceph-mon.b5.log
 2020042  Defl:X   215000  89% 2013-02-20 10:40 f74ae4ca  ceph-osd.0.log
 2075512  Defl:X   224098  89% 2013-02-20 12:05 b454d2ec  ceph-osd.1.log
  154938  Defl:X    12989  92% 2013-02-20 12:04 d2729b05  ceph.log
--------          -------  ---                            -------
 6908631           646576  91%                            6 files

My naive analysis, based on the log extracts below (best viewed on a wide
screen!)...

Osd.0 starts hearing much-delayed ping_replies from osd.1 and tells the mon,
which marks osd.1 down.

However the whole time, the osd.1 log indicates that it's receiving and
responding to each ping from osd.0 in a timely fashion. In contrast, the osd.0
log indicates it isn't seeing the osd.1 replies for a while, then sees them all
arrive in a flurry, until they're "delayed" enough to cause osd.0 to tell the
mon.

During the time osd.0 is not seeing the osd.1 ping_replies, there's other traffic
(osd_op, osd_sub_op, osd_sub_op_reply etc.) between osd.0 and osd.1, indicating
that it's not a network problem.

The load on both osds during this period was >90% idle and <1% iow.

Is this pointing to osd.0 experiencing some kind of scheduling or priority
starvation on the ping thread (assuming the ping is in it's own thread)?

The next odd thing is that, although the osds are both back by 04:38:50 ("2
osds: 2 up, 2 in"), the system still wasn't working (see the disk stats for
both osd.0 and osd.1) and didn't recover until ceph (mon + osd) was restarted
on one of the boxes at around 05:50 (not shown in the logs, but full logs
available if needed).

Prior to the restart:

# ceph health
HEALTH_WARN 281 pgs peering; 281 pgs stuck inactive; 576 pgs stuck unclean

(Sorry, once again didn't get a 'ceph -s' prior to the restart.)

Chris.

----------------------------------------------------------------------
ceph.log
----------------------------------------------------------------------
2013-02-20 04:37:51.074128 mon.0 10.200.63.130:6789/0 120771 : [INF] pgmap v3000932: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:37:53.541471 mon.0 10.200.63.130:6789/0 120772 : [INF] pgmap v3000933: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:37:56.063059 mon.0 10.200.63.130:6789/0 120773 : [INF] pgmap v3000934: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:37:58.532763 mon.0 10.200.63.130:6789/0 120774 : [INF] pgmap v3000935: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
<<<< osd.0 sees delayed ping_replies from here >>>>
2013-02-20 04:38:01.057939 mon.0 10.200.63.130:6789/0 120775 : [INF] pgmap v3000936: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:03.541404 mon.0 10.200.63.130:6789/0 120776 : [INF] pgmap v3000937: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:06.133004 mon.0 10.200.63.130:6789/0 120777 : [INF] pgmap v3000938: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:08.540471 mon.0 10.200.63.130:6789/0 120778 : [INF] pgmap v3000939: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:11.064003 mon.0 10.200.63.130:6789/0 120779 : [INF] pgmap v3000940: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:13.547845 mon.0 10.200.63.130:6789/0 120780 : [INF] pgmap v3000941: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:16.062892 mon.0 10.200.63.130:6789/0 120781 : [INF] pgmap v3000942: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:18.530804 mon.0 10.200.63.130:6789/0 120782 : [INF] pgmap v3000943: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:21.080347 mon.0 10.200.63.130:6789/0 120783 : [INF] pgmap v3000944: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:23.555523 mon.0 10.200.63.130:6789/0 120784 : [INF] pgmap v3000945: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:26.071449 mon.0 10.200.63.130:6789/0 120785 : [INF] pgmap v3000946: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:28.561133 mon.0 10.200.63.130:6789/0 120786 : [INF] pgmap v3000947: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:31.068101 mon.0 10.200.63.130:6789/0 120787 : [INF] pgmap v3000948: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:33.536022 mon.0 10.200.63.130:6789/0 120788 : [INF] pgmap v3000949: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:36.081591 mon.0 10.200.63.130:6789/0 120789 : [INF] pgmap v3000950: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:38.380909 mon.0 10.200.63.130:6789/0 120790 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
2013-02-20 04:38:43.372798 mon.0 10.200.63.130:6789/0 120793 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
2013-02-20 04:38:48.373930 mon.0 10.200.63.130:6789/0 120796 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
2013-02-20 04:38:48.373990 mon.0 10.200.63.130:6789/0 120797 : [INF] osd.1 10.200.63.133:6801/21178 failed (3 reports from 1 peers after 2013-02-20 04:39:11.373918 >= grace 20.000000)
2013-02-20 04:38:48.565717 mon.0 10.200.63.130:6789/0 120798 : [INF] osdmap e791: 2 osds: 1 up, 2 in
2013-02-20 04:38:48.670726 mon.0 10.200.63.130:6789/0 120799 : [INF] pgmap v3000955: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:49.073328 mon.0 10.200.63.130:6789/0 120800 : [INF] pgmap v3000956: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:49.654554 mon.0 10.200.63.130:6789/0 120801 : [INF] osdmap e792: 2 osds: 1 up, 2 in
2013-02-20 04:38:49.857067 mon.0 10.200.63.130:6789/0 120802 : [INF] pgmap v3000957: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:50.749644 mon.0 10.200.63.130:6789/0 120803 : [INF] osdmap e793: 2 osds: 2 up, 2 in
2013-02-20 04:38:50.749710 mon.0 10.200.63.130:6789/0 120804 : [INF] osd.1 10.200.63.133:6801/21178 boot
2013-02-20 04:38:50.850887 mon.0 10.200.63.130:6789/0 120805 : [INF] pgmap v3000958: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:51.834189 mon.0 10.200.63.130:6789/0 120806 : [INF] osdmap e794: 2 osds: 2 up, 2 in
2013-02-20 04:38:51.956560 mon.0 10.200.63.130:6789/0 120807 : [INF] pgmap v3000959: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:56.162743 mon.0 10.200.63.130:6789/0 120808 : [INF] pgmap v3000960: 576 pgs: 295 active, 271 active+clean, 9 peering, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
2013-02-20 04:38:57.235082 mon.0 10.200.63.130:6789/0 120809 : [INF] pgmap v3000961: 576 pgs: 295 active, 281 peering; 410 GB data, 841 GB used, 2882 GB / 3724 GB avail
2013-02-20 04:38:48.660979 osd.1 10.200.63.133:6801/21178 997 : [WRN] map e791 wrongly marked me down
2013-02-20 04:39:01.158928 mon.0 10.200.63.130:6789/0 120810 : [INF] pgmap v3000962: 576 pgs: 295 active, 281 peering; 410 GB data, 841 GB used, 2882 GB / 3724 GB avail
2013-02-20 04:39:19.111723 osd.0 10.200.63.132:6801/18444 800 : [WRN] 6 slow requests, 6 included below; oldest blocked for > 30.770732 secs
2013-02-20 04:39:19.111729 osd.0 10.200.63.132:6801/18444 801 : [WRN] slow request 30.770732 seconds old, received at 2013-02-20 04:38:48.340908: osd_op(client.9971.0:685981 rb.0.1c62.2ae8944a.0000000003aa [write 3878912~4096] 2.c82ee285) v4 currently reached pg
2013-02-20 04:39:19.111735 osd.0 10.200.63.132:6801/18444 802 : [WRN] slow request 30.770225 seconds old, received at 2013-02-20 04:38:48.341415: osd_op(client.9971.0:685984 rb.0.1c62.2ae8944a.000000000439 [write 364544~20480] 2.b16f5ace) v4 currently reached pg
2013-02-20 04:39:19.111738 osd.0 10.200.63.132:6801/18444 803 : [WRN] slow request 30.456112 seconds old, received at 2013-02-20 04:38:48.655528: osd_op(client.9986.0:178417 broot.rbd [watch 1~0] 2.d30a2f40) v4 currently reached pg
2013-02-20 04:39:19.111743 osd.0 10.200.63.132:6801/18444 804 : [WRN] slow request 30.456106 seconds old, received at 2013-02-20 04:38:48.655534: osd_op(client.9989.0:215170 broot-nfs2.rbd [watch 1~0] 2.7802d31e) v4 currently reached pg
2013-02-20 04:39:19.111747 osd.0 10.200.63.132:6801/18444 805 : [WRN] slow request 30.455860 seconds old, received at 2013-02-20 04:38:48.655780: osd_op(client.9968.0:302450 dns1.rbd [watch 1~0] 2.383712c1) v4 currently reached pg

----------------------------------------------------------------------
grep osd_ping ceph-osd.0.log
----------------------------------------------------------------------
2013-02-20 04:37:57.347387 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:37:57.347384) v2 -- ?+0 0xbd248c0 con 0xa2dcdc0
2013-02-20 04:37:57.349406 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79153 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:37:57.347384) v2 ==== 47+0+0 (2837779695 0 0) 0xbc28a80 con 0xa2dcdc0
2013-02-20 04:37:57.847588 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:37:57.847586) v2 -- ?+0 0xa1ea540 con 0xa2dcdc0
2013-02-20 04:37:58.050400 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79154 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:37:57.847586) v2 ==== 47+0+0 (3920125339 0 0) 0xdb34000 con 0xa2dcdc0
<<<< osd.0 sees delayed ping_replies from here >>>>
2013-02-20 04:37:59.547719 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:37:59.547716) v2 -- ?+0 0x73e5340 con 0xa2dcdc0
2013-02-20 04:38:00.047911 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:00.047909) v2 -- ?+0 0x99bbc00 con 0xa2dcdc0
2013-02-20 04:38:01.748080 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:01.748077) v2 -- ?+0 0xa1eb6c0 con 0xa2dcdc0
2013-02-20 04:38:03.448223 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:03.448220) v2 -- ?+0 0x99bb180 con 0xa2dcdc0
2013-02-20 04:38:03.948413 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:03.948411) v2 -- ?+0 0x99ba700 con 0xa2dcdc0
2013-02-20 04:38:04.448601 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:04.448598) v2 -- ?+0 0x99ba540 con 0xa2dcdc0
2013-02-20 04:38:04.948724 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:04.948720) v2 -- ?+0 0xa1ebdc0 con 0xa2dcdc0
2013-02-20 04:38:08.448860 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:08.448856) v2 -- ?+0 0xb676380 con 0xa2dcdc0
2013-02-20 04:38:08.949028 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:08.949025) v2 -- ?+0 0xbc28380 con 0xa2dcdc0
2013-02-20 04:38:10.649263 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:10.649260) v2 -- ?+0 0x965d340 con 0xa2dcdc0
2013-02-20 04:38:11.749458 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:11.749455) v2 -- ?+0 0xa1eafc0 con 0xa2dcdc0
2013-02-20 04:38:12.799154 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79155 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:37:59.547716) v2 ==== 47+0+0 (1242454262 0 0) 0xbc29880 con 0xa2dcdc0
2013-02-20 04:38:12.799459 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79156 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:00.047909) v2 ==== 47+0+0 (3852750933 0 0) 0xcaff180 con 0xa2dcdc0
2013-02-20 04:38:12.799496 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79157 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:01.748077) v2 ==== 47+0+0 (3672189647 0 0) 0xb677340 con 0xa2dcdc0
2013-02-20 04:38:12.799542 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79158 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:03.448220) v2 ==== 47+0+0 (38366945 0 0) 0xbc28c40 con 0xa2dcdc0
2013-02-20 04:38:12.799554 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79159 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:03.948411) v2 ==== 47+0+0 (83904766 0 0) 0x884ee00 con 0xa2dcdc0
2013-02-20 04:38:12.799573 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79160 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:04.448598) v2 ==== 47+0+0 (2688468082 0 0) 0x10c5c1c0 con 0xa2dcdc0
2013-02-20 04:38:12.799667 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79161 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:04.948720) v2 ==== 47+0+0 (4187258751 0 0) 0xb21a540 con 0xa2dcdc0
2013-02-20 04:38:12.799689 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79162 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:08.448856) v2 ==== 47+0+0 (4176431512 0 0) 0xb21b180 con 0xa2dcdc0
2013-02-20 04:38:12.799710 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79163 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:08.949025) v2 ==== 47+0+0 (2888471344 0 0) 0xb21b340 con 0xa2dcdc0
2013-02-20 04:38:12.799728 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79164 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:10.649260) v2 ==== 47+0+0 (3060931781 0 0) 0xb21aa80 con 0xa2dcdc0
2013-02-20 04:38:12.799745 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79165 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:11.749455) v2 ==== 47+0+0 (2767620502 0 0) 0x8d4e380 con 0xa2dcdc0
2013-02-20 04:38:14.049649 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:14.049645) v2 -- ?+0 0xa1ea000 con 0xa2dcdc0
2013-02-20 04:38:14.260608 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79166 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:14.049645) v2 ==== 47+0+0 (462572634 0 0) 0xbc29a40 con 0xa2dcdc0
2013-02-20 04:38:15.149828 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:15.149826) v2 -- ?+0 0xac85340 con 0xa2dcdc0
2013-02-20 04:38:15.151892 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79167 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:15.149826) v2 ==== 47+0+0 (2092320694 0 0) 0xdb34380 con 0xa2dcdc0
2013-02-20 04:38:21.050059 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:21.050056) v2 -- ?+0 0xb677c00 con 0xa2dcdc0
2013-02-20 04:38:25.750198 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:25.750195) v2 -- ?+0 0xbd24000 con 0xa2dcdc0
2013-02-20 04:38:28.650370 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:28.650367) v2 -- ?+0 0x7e94a80 con 0xa2dcdc0
2013-02-20 04:38:32.150553 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:32.150550) v2 -- ?+0 0xb677500 con 0xa2dcdc0
2013-02-20 04:38:34.450740 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:34.450737) v2 -- ?+0 0x9bc7500 con 0xa2dcdc0
2013-02-20 04:38:35.369720 7f109af37700 -1 osd.0 790 heartbeat_check: no reply from osd.1 since 2013-02-20 04:38:15.149826 (cutoff 2013-02-20 04:38:15.369719)
2013-02-20 04:38:36.369895 7f109af37700 -1 osd.0 790 heartbeat_check: no reply from osd.1 since 2013-02-20 04:38:15.149826 (cutoff 2013-02-20 04:38:16.369894)

----------------------------------------------------------------------
grep osd_ping ceph-osd.1.log
----------------------------------------------------------------------
2013-02-20 04:37:57.847878 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79154 ==== osd_ping(ping e790 stamp 2013-02-20 04:37:57.847586) v2 ==== 47+0+0 (2625351075 0 0) 0xb441880 con 0xb9a89a0
2013-02-20 04:37:57.847957 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:37:57.847586) v2 -- ?+0 0xe5921c0 con 0xb9a89a0
<<<< osd.0 sees delayed ping_replies from here >>>>
2013-02-20 04:37:59.547994 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79155 ==== osd_ping(ping e790 stamp 2013-02-20 04:37:59.547716) v2 ==== 47+0+0 (1071491278 0 0) 0xb440700 con 0xb9a89a0
2013-02-20 04:37:59.548066 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:37:59.547716) v2 -- ?+0 0xb441880 con 0xb9a89a0
2013-02-20 04:38:00.048174 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79156 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:00.047909) v2 ==== 47+0+0 (2423758957 0 0) 0xc987a40 con 0xb9a89a0
2013-02-20 04:38:00.048262 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:00.047909) v2 -- ?+0 0xb440700 con 0xb9a89a0
2013-02-20 04:38:01.748248 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79157 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:01.748077) v2 ==== 47+0+0 (2939345655 0 0) 0xb6d96c0 con 0xb9a89a0
2013-02-20 04:38:01.748330 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:01.748077) v2 -- ?+0 0xc987a40 con 0xb9a89a0
2013-02-20 04:38:03.448435 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79158 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:03.448220) v2 ==== 47+0+0 (2006621913 0 0) 0xb71c540 con 0xb9a89a0
2013-02-20 04:38:03.448531 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:03.448220) v2 -- ?+0 0xb6d96c0 con 0xb9a89a0
2013-02-20 04:38:04.163566 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79159 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:03.948411) v2 ==== 47+0+0 (1892923590 0 0) 0xc02b6c0 con 0xb9a89a0
2013-02-20 04:38:04.163648 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:03.948411) v2 -- ?+0 0xb71c540 con 0xb9a89a0
2013-02-20 04:38:04.448837 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79160 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:04.448598) v2 ==== 47+0+0 (3589092426 0 0) 0xc02a8c0 con 0xb9a89a0
2013-02-20 04:38:04.448876 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:04.448598) v2 -- ?+0 0xc02b6c0 con 0xb9a89a0
2013-02-20 04:38:04.949019 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79161 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:04.948720) v2 ==== 47+0+0 (2353499975 0 0) 0x6fae700 con 0xb9a89a0
2013-02-20 04:38:04.949106 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:04.948720) v2 -- ?+0 0xc02a8c0 con 0xb9a89a0
2013-02-20 04:38:08.449126 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79162 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:08.448856) v2 ==== 47+0+0 (2369567136 0 0) 0xc02ac40 con 0xb9a89a0
2013-02-20 04:38:08.449210 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:08.448856) v2 -- ?+0 0x6fae700 con 0xb9a89a0
2013-02-20 04:38:08.949215 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79163 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:08.949025) v2 ==== 47+0+0 (3656999688 0 0) 0xc02ba40 con 0xb9a89a0
2013-02-20 04:38:08.949277 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:08.949025) v2 -- ?+0 0xc02ac40 con 0xb9a89a0
2013-02-20 04:38:10.649580 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79164 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:10.649260) v2 ==== 47+0+0 (3282169085 0 0) 0xc02a000 con 0xb9a89a0
2013-02-20 04:38:10.649647 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:10.649260) v2 -- ?+0 0xc02ba40 con 0xb9a89a0
2013-02-20 04:38:11.749750 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79165 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:11.749455) v2 ==== 47+0+0 (3508894126 0 0) 0xe593180 con 0xb9a89a0
2013-02-20 04:38:11.749835 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:11.749455) v2 -- ?+0 0xc02a000 con 0xb9a89a0
2013-02-20 04:38:14.049868 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79166 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:14.049645) v2 ==== 47+0+0 (1849801826 0 0) 0xc02bdc0 con 0xb9a89a0
2013-02-20 04:38:14.049943 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:14.049645) v2 -- ?+0 0xe593180 con 0xb9a89a0
2013-02-20 04:38:15.150155 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79167 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:15.149826) v2 ==== 47+0+0 (157661070 0 0) 0xb6d9340 con 0xb9a89a0
2013-02-20 04:38:15.150242 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:15.149826) v2 -- ?+0 0xc02bdc0 con 0xb9a89a0
2013-02-20 04:38:21.050348 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79168 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:21.050056) v2 ==== 47+0+0 (2307149705 0 0) 0xbf4bc00 con 0xb9a89a0
2013-02-20 04:38:21.050433 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:21.050056) v2 -- ?+0 0xb6d9340 con 0xb9a89a0
2013-02-20 04:38:25.750415 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79169 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:25.750195) v2 ==== 47+0+0 (3873827452 0 0) 0xe592000 con 0xb9a89a0
2013-02-20 04:38:25.750548 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:25.750195) v2 -- ?+0 0xbf4bc00 con 0xb9a89a0
2013-02-20 04:38:28.650634 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79170 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:28.650367) v2 ==== 47+0+0 (718534970 0 0) 0x78ff880 con 0xb9a89a0
2013-02-20 04:38:28.650713 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:28.650367) v2 -- ?+0 0xe592000 con 0xb9a89a0
2013-02-20 04:38:32.150855 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79171 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:32.150550) v2 ==== 47+0+0 (1909552382 0 0) 0x6fae000 con 0xb9a89a0
2013-02-20 04:38:32.150939 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:32.150550) v2 -- ?+0 0x78ff880 con 0xb9a89a0
2013-02-20 04:38:34.450994 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79172 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:34.450737) v2 ==== 47+0+0 (886776134 0 0) 0xd304700 con 0xb9a89a0
2013-02-20 04:38:34.451033 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:34.450737) v2 -- ?+0 0x6fae000 con 0xb9a89a0
2013-02-20 04:38:37.351175 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79173 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:37.350908) v2 ==== 47+0+0 (2444468724 0 0) 0xe593340 con 0xb9a89a0
2013-02-20 04:38:37.351215 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:37.350908) v2 -- ?+0 0xd304700 con 0xb9a89a0
2013-02-20 04:38:41.451417 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79174 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:41.451094) v2 ==== 47+0+0 (4159941099 0 0) 0xb6d8380 con 0xb9a89a0
2013-02-20 04:38:41.451477 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:41.451094) v2 -- ?+0 0xe593340 con 0xb9a89a0
2013-02-20 04:38:43.751635 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79175 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:43.751289) v2 ==== 47+0+0 (800627449 0 0) 0xbb0b180 con 0xb9a89a0
2013-02-20 04:38:43.751675 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:43.751289) v2 -- ?+0 0xb6d8380 con 0xb9a89a0
2013-02-20 04:38:53.151872 7f564b733700  1 -- 192.168.254.133:6802/21178 <== osd.0 192.168.254.132:0/18444 1 ==== osd_ping(ping e794 stamp 2013-02-20 04:38:53.151589) v2 ==== 47+0+0 (3010557341 0 0) 0x78ff880 con 0x10d39340
2013-02-20 04:38:53.151908 7f564b733700  1 -- 192.168.254.133:6802/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e794 stamp 2013-02-20 04:38:53.151589) v2 -- ?+0 0xbb0b180 con 0x10d39340
2013-02-20 04:38:56.052292 7f564b733700  1 -- 192.168.254.133:6802/21178 <== osd.0 192.168.254.132:0/18444 2 ==== osd_ping(ping e794 stamp 2013-02-20 04:38:56.051788) v2 ==== 47+0+0 (4254185894 0 0) 0xb6d9340 con 0x10d39340
2013-02-20 04:38:56.052330 7f564b733700  1 -- 192.168.254.133:6802/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e794 stamp 2013-02-20 04:38:56.051788) v2 -- ?+0 0x78ff880 con 0x10d39340
2013-02-20 04:38:57.152438 7f564b733700  1 -- 192.168.254.133:6802/21178 <== osd.0 192.168.254.132:0/18444 3 ==== osd_ping(ping e794 stamp 2013-02-20 04:38:57.152006) v2 ==== 47+0+0 (1503720898 0 0) 0xb6d96c0 con 0x10d39340
2013-02-20 04:38:57.152472 7f564b733700  1 -- 192.168.254.133:6802/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e794 stamp 2013-02-20 04:38:57.152006) v2 -- ?+0 0xb6d9340 con 0x10d39340

----------------------------------------------------------------------
ceph-osd.0.log, showing osd.0 <=> osd.1 activity whilst osd.1 ping_replies aren't being seen
----------------------------------------------------------------------
<<<< osd.0 sees delayed ping_replies from here >>>>
2013-02-20 04:37:59.547719 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:37:59.547716) v2 -- ?+0 0x73e5340 con 0xa2dcdc0
2013-02-20 04:38:00.047911 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:00.047909) v2 -- ?+0 0x99bbc00 con 0xa2dcdc0
2013-02-20 04:38:00.342856 7f108e71e700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.9811.0:498225 2.51 4d55cd51/rb.0.2150.2ae8944a.000000000628/head//2 [] v 790'271511 snapset=0=[]:[] snapc=0=[]) v7 -- ?+8799 0x8f91400
2013-02-20 04:38:01.046192 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989786 ==== osd_sub_op(unknown.0.0:0 2.4e 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v7 ==== 955+0+21617430 (245718815 0 180673924) 0xd19d200 con 0x9571080
2013-02-20 04:38:01.046297 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989787 ==== osd_sub_op_reply(client.9971.0:685924 2.4e b16f5ace/rb.0.1c62.2ae8944a.000000000439/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (3427135234 0 0) 0xa9c4f00 con 0x9571080
2013-02-20 04:38:01.046352 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989788 ==== osd_sub_op_reply(client.9953.0:367558 2.5a 6d3052da/rbd_data.242f2ae8944a.000000000000002c/head//2 [] ondisk, result = 0) v1 ==== 164+0+0 (2641157471 0 0) 0x86f2280 con 0x9571080
2013-02-20 04:38:01.046390 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989789 ==== osd_sub_op_reply(client.9953.0:367559 2.28 8340fa28/rbd_data.242f2ae8944a.0000000000000096/head//2 [] ondisk, result = 0) v1 ==== 164+0+0 (2759328511 0 0) 0x95d2780 con 0x9571080
2013-02-20 04:38:01.046743 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989790 ==== osd_sub_op_reply(client.9953.0:367560 2.5a 6678d7da/rbd_data.242f2ae8944a.0000000000000b8b/head//2 [] ondisk, result = 0) v1 ==== 164+0+0 (602117107 0 0) 0xda1fb80 con 0x9571080
2013-02-20 04:38:01.047020 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989791 ==== osd_sub_op_reply(client.9811.0:498225 2.51 4d55cd51/rb.0.2150.2ae8944a.000000000628/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (435514219 0 0) 0xac43400 con 0x9571080
2013-02-20 04:38:01.047240 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989792 ==== osd_sub_op(client.9995.0:4402170 2.77 3b454af7/rb.0.1cf8.2ae8944a.000000000e35/head//2 [] v 790'353333 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+8799 (240510986 0 1818934932) 0x8f91400 con 0x9571080
2013-02-20 04:38:01.047600 7f108cf1b700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- replica scrub(pg: 2.4e,from:0'0,to:0'0,epoch:790,start:6388b3ce//0//-1,end:cff8e3ce//0//-1,chunky:1,deep:0,version:4) v4 -- ?+0 0xb0b3b00
2013-02-20 04:38:01.050467 7f108df1d700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.9811.0:498226 2.51 4d55cd51/rb.0.2150.2ae8944a.000000000628/head//2 [] v 790'271512 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0xd19d200
2013-02-20 04:38:01.060518 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989793 ==== osd_sub_op(client.9968.0:302420 2.74 2c5fc7f4/rb.0.1c64.74b0dc51.00000000043b/head//2 [] v 790'498230 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+16991 (2821939739 0 1047147434) 0xd19d200 con 0x9571080
2013-02-20 04:38:01.072310 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989794 ==== osd_sub_op_reply(client.9811.0:498226 2.51 4d55cd51/rb.0.2150.2ae8944a.000000000628/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (3589886011 0 0) 0xaf1ef00 con 0x9571080
2013-02-20 04:38:01.086276 7f1093f29700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op_reply(client.9995.0:4402170 2.77 3b454af7/rb.0.1cf8.2ae8944a.000000000e35/head//2 [] ondisk, result = 0) v1 -- ?+0 0xc477400
2013-02-20 04:38:01.090522 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989795 ==== osd_sub_op(client.9995.0:4402171 2.77 3b454af7/rb.0.1cf8.2ae8944a.000000000e35/head//2 [] v 790'353334 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+4703 (4105904555 0 2838991357) 0xbddea00 con 0x9571080
2013-02-20 04:38:01.097246 7f1093f29700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op_reply(client.9968.0:302420 2.74 2c5fc7f4/rb.0.1c64.74b0dc51.00000000043b/head//2 [] ondisk, result = 0) v1 -- ?+0 0x9055400
2013-02-20 04:38:01.100259 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989796 ==== osd_sub_op(client.9968.0:302421 2.74 2c5fc7f4/rb.0.1c64.74b0dc51.00000000043b/head//2 [] v 790'498231 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+4703 (3340239323 0 2873835958) 0x810f400 con 0x9571080
2013-02-20 04:38:01.110432 7f1093f29700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op_reply(client.9995.0:4402171 2.77 3b454af7/rb.0.1cf8.2ae8944a.000000000e35/head//2 [] ondisk, result = 0) v1 -- ?+0 0xc477680
2013-02-20 04:38:01.121128 7f1093f29700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op_reply(client.9968.0:302421 2.74 2c5fc7f4/rb.0.1c64.74b0dc51.00000000043b/head//2 [] ondisk, result = 0) v1 -- ?+0 0x296d680
2013-02-20 04:38:01.366189 7f108e71e700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.10001.0:3423766 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] v 790'768516 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0x9fe1400
2013-02-20 04:38:01.366377 7f108e71e700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.10001.0:3423767 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] v 790'768517 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0xb978a00
2013-02-20 04:38:01.366519 7f108e71e700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.10001.0:3423768 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] v 790'768518 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0x815d200
2013-02-20 04:38:01.366728 7f108df1d700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.10001.0:3423770 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] v 790'768519 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0xc557e00
2013-02-20 04:38:01.367681 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989797 ==== osd_sub_op(client.10001.0:3423769 2.17 37564117/rb.0.209f.74b0dc51.000000001200/head//2 [] v 790'353128 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+4703 (1225650648 0 1162937880) 0xa052800 con 0x9571080
2013-02-20 04:38:01.378938 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989798 ==== osd_sub_op_reply(client.10001.0:3423766 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (2250676101 0 0) 0xc47768 0 con 0x9571080
2013-02-20 04:38:01.379109 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989799 ==== osd_sub_op_reply(client.10001.0:3423767 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (2794751629 0 0) 0x7b1aa0 0 con 0x9571080
2013-02-20 04:38:01.379162 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989800 ==== osd_sub_op_reply(client.10001.0:3423768 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (41013162 0 0) 0x8086a00 con 0x9571080
2013-02-20 04:38:01.379396 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989801 ==== osd_sub_op_reply(client.10001.0:3423770 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (2912526559 0 0) 0xb05450 0 con 0x9571080

----------------------------------------------------------------------
osd.0 load and disk
----------------------------------------------------------------------
                    load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
2013-02-20-04:37:56  0.6   1.4   0.0   1.4   0.5   0.0  94.1  sdn    0.0    0.0    0.0   19.1     0.00   182.00  19.09   0.59  30.87   0.00  30.9  14.5
2013-02-20-04:38:11  0.6   1.2   0.0   1.1   1.0   0.0  93.4  sdn    0.0    0.0    0.0   20.8     0.00   152.00  14.62   0.79  38.14   0.00  38.1  12.5
2013-02-20-04:38:26  0.6   1.3   0.0   1.1   0.5   0.0  94.0  sdn    0.0    0.0    0.0   17.4     0.00   170.53  19.60   0.69  39.85   0.00  39.8  14.7
2013-02-20-04:38:41  0.6   1.1   0.0   1.2   0.5   0.0  94.3  sdn    0.0    0.2    0.0   21.8     0.00   179.83  16.50   0.61  28.17   0.00  28.2  12.1
2013-02-20-04:38:56  0.8   1.7   0.0   1.2   4.3   0.0  90.6  sdn    0.0    0.1    0.2  141.1     1.60  1838.33  26.04  41.68 294.67 283.33 294.7  53.9
2013-02-20-04:39:11  0.7   0.3   0.0   0.3   0.3   0.0  98.3  sdn    0.0    0.0    0.0    2.3     0.00    45.20  39.88   0.16  85.00   0.00  85.0   4.8
2013-02-20-04:39:26  0.6   0.1   0.0   0.1   0.0   0.0  99.2  sdn    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
2013-02-20-04:39:41  0.5   0.4   0.0   0.3   0.1   0.0  99.4  sdn    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
2013-02-20-04:39:56  0.4   0.2   0.0   0.2   0.1   0.0  99.5  sdn    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
2013-02-20-04:40:11  0.3   0.3   0.0   0.3   0.0   0.0  99.5  sdn    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0

----------------------------------------------------------------------
osd.1 load and disk
----------------------------------------------------------------------
                    load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
2013-02-20-04:37:57  0.8   0.3   0.0   0.2   0.6   0.0  97.7  sdg    0.0    0.1    2.5   19.7    10.40   179.93  17.15   0.97  43.63  38.16  44.3  19.3
2013-02-20-04:38:12  1.0   0.3   0.0   0.3   0.7   0.0  97.4  sdg    0.0    0.1    2.8   28.0    11.47   232.20  15.82   0.71  23.20  25.48  23.0  16.1
2013-02-20-04:38:27  0.8   0.4   0.0   0.2   0.7   0.0  97.4  sdg    0.0    0.0    2.7   11.3    10.67   102.87  16.30   0.31  22.30  14.25  24.2  11.1
2013-02-20-04:38:42  0.7   0.2   0.0   0.2   0.8   0.0  97.6  sdg    0.0    0.0    2.3   22.3     9.07   198.07  16.84   1.15  46.72  31.18  48.3  20.9
2013-02-20-04:38:57  0.6   0.6   0.0   0.5   5.2   0.0  92.9  sdg    0.0    0.0   15.8  162.7    63.73  1613.37  18.79  32.51 182.07  24.94 197.3  63.1
2013-02-20-04:39:12  0.5   0.0   0.0   0.0   0.1   0.0  99.8  sdg    0.0    0.0    0.0    5.8     0.00    76.87  26.51   0.32  57.82   0.00  57.8   9.2
2013-02-20-04:39:27  0.4   0.0   0.0   0.0   0.1   0.0  99.9  sdg    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
2013-02-20-04:39:42  0.6   0.0   0.0   0.0   0.1   0.0  99.8  sdg    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
2013-02-20-04:39:57  0.5   0.0   0.0   0.0   0.1   0.0  99.8  sdg    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
2013-02-20-04:40:12  0.4   0.0   0.0   0.0   0.1   0.0  99.8  sdg    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0


----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-20  2:07           ` Chris Dunlop
@ 2013-02-22  3:06             ` Chris Dunlop
  2013-02-22 21:57               ` Sage Weil
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-02-22  3:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

G'day,

It seems there might be two issues here: the first being the delayed
receipt of echo replies causing an seemingly otherwise healthy osd to be
marked down, the second being the lack of recovery once the downed osd is
recognised as up again.

Is it worth my opening tracker reports for this, just so it doesn't get
lost?

Cheers,

Chris

On Wed, Feb 20, 2013 at 01:07:03PM +1100, Chris Dunlop wrote:
> On Tue, Feb 19, 2013 at 02:02:03PM +1100, Chris Dunlop wrote:
>> On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote:
>>> On Mon, 18 Feb 2013, Chris Dunlop wrote:
>>>> On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
>>>>> On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
>>>>>> On Fri, 15 Feb 2013, Chris Dunlop wrote:
>>>>>>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
>>>>>>> mons to lose touch with the osds?
>>>>>> 
>>>>>> Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
>>>>>> hopes that this happens again?  It will give us more information to go on.
>>>>> 
>>>>> Debug turned on.
>>>> 
>>>> We haven't experienced the cluster losing touch with the osds completely
>>>> since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
>>>> for a few seconds before it recovered. See below for logs (reminder: 3
>>>> boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).
>>> 
>>> Hrm, I don't see any obvious clues.  You could enable 'debug ms = 1' on 
>>> the osds as well.  That will give us more to go on if/when it happens 
>>> again, and should not affect performance significantly.
>> 
>> Done: ceph osd tell '*' injectargs '--debug-ms 1'
>> 
>> Now to wait for it to happen again.
> 
> OK, we got it again. Full logs covering the incident available at:
> 
> https://www.dropbox.com/s/kguzwyjfglv3ypl/ceph-logs.zip
> 
> Archive:  /tmp/ceph-logs.zip
>  Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
> --------  ------  ------- ---- ---------- ----- --------  ----
>    11492  Defl:X     1186  90% 2013-02-20 12:04 c0cba4ae  ceph-mon.b2.log
>  1270789  Defl:X    89278  93% 2013-02-20 12:00 2208d035  ceph-mon.b4.log
>  1375858  Defl:X   104025  92% 2013-02-20 12:05 c64c1dad  ceph-mon.b5.log
>  2020042  Defl:X   215000  89% 2013-02-20 10:40 f74ae4ca  ceph-osd.0.log
>  2075512  Defl:X   224098  89% 2013-02-20 12:05 b454d2ec  ceph-osd.1.log
>   154938  Defl:X    12989  92% 2013-02-20 12:04 d2729b05  ceph.log
> --------          -------  ---                            -------
>  6908631           646576  91%                            6 files
> 
> My naive analysis, based on the log extracts below (best viewed on a wide
> screen!)...
> 
> Osd.0 starts hearing much-delayed ping_replies from osd.1 and tells the mon,
> which marks osd.1 down.
> 
> However the whole time, the osd.1 log indicates that it's receiving and
> responding to each ping from osd.0 in a timely fashion. In contrast, the osd.0
> log indicates it isn't seeing the osd.1 replies for a while, then sees them all
> arrive in a flurry, until they're "delayed" enough to cause osd.0 to tell the
> mon.
> 
> During the time osd.0 is not seeing the osd.1 ping_replies, there's other traffic
> (osd_op, osd_sub_op, osd_sub_op_reply etc.) between osd.0 and osd.1, indicating
> that it's not a network problem.
> 
> The load on both osds during this period was >90% idle and <1% iow.
> 
> Is this pointing to osd.0 experiencing some kind of scheduling or priority
> starvation on the ping thread (assuming the ping is in it's own thread)?
> 
> The next odd thing is that, although the osds are both back by 04:38:50 ("2
> osds: 2 up, 2 in"), the system still wasn't working (see the disk stats for
> both osd.0 and osd.1) and didn't recover until ceph (mon + osd) was restarted
> on one of the boxes at around 05:50 (not shown in the logs, but full logs
> available if needed).
> 
> Prior to the restart:
> 
> # ceph health
> HEALTH_WARN 281 pgs peering; 281 pgs stuck inactive; 576 pgs stuck unclean
> 
> (Sorry, once again didn't get a 'ceph -s' prior to the restart.)
> 
> Chris.
> 
> ----------------------------------------------------------------------
> ceph.log
> ----------------------------------------------------------------------
> 2013-02-20 04:37:51.074128 mon.0 10.200.63.130:6789/0 120771 : [INF] pgmap v3000932: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:37:53.541471 mon.0 10.200.63.130:6789/0 120772 : [INF] pgmap v3000933: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:37:56.063059 mon.0 10.200.63.130:6789/0 120773 : [INF] pgmap v3000934: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:37:58.532763 mon.0 10.200.63.130:6789/0 120774 : [INF] pgmap v3000935: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> <<<< osd.0 sees delayed ping_replies from here >>>>
> 2013-02-20 04:38:01.057939 mon.0 10.200.63.130:6789/0 120775 : [INF] pgmap v3000936: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:03.541404 mon.0 10.200.63.130:6789/0 120776 : [INF] pgmap v3000937: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:06.133004 mon.0 10.200.63.130:6789/0 120777 : [INF] pgmap v3000938: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:08.540471 mon.0 10.200.63.130:6789/0 120778 : [INF] pgmap v3000939: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:11.064003 mon.0 10.200.63.130:6789/0 120779 : [INF] pgmap v3000940: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:13.547845 mon.0 10.200.63.130:6789/0 120780 : [INF] pgmap v3000941: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:16.062892 mon.0 10.200.63.130:6789/0 120781 : [INF] pgmap v3000942: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:18.530804 mon.0 10.200.63.130:6789/0 120782 : [INF] pgmap v3000943: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:21.080347 mon.0 10.200.63.130:6789/0 120783 : [INF] pgmap v3000944: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:23.555523 mon.0 10.200.63.130:6789/0 120784 : [INF] pgmap v3000945: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:26.071449 mon.0 10.200.63.130:6789/0 120785 : [INF] pgmap v3000946: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:28.561133 mon.0 10.200.63.130:6789/0 120786 : [INF] pgmap v3000947: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:31.068101 mon.0 10.200.63.130:6789/0 120787 : [INF] pgmap v3000948: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:33.536022 mon.0 10.200.63.130:6789/0 120788 : [INF] pgmap v3000949: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:36.081591 mon.0 10.200.63.130:6789/0 120789 : [INF] pgmap v3000950: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:38.380909 mon.0 10.200.63.130:6789/0 120790 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
> 2013-02-20 04:38:43.372798 mon.0 10.200.63.130:6789/0 120793 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
> 2013-02-20 04:38:48.373930 mon.0 10.200.63.130:6789/0 120796 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
> 2013-02-20 04:38:48.373990 mon.0 10.200.63.130:6789/0 120797 : [INF] osd.1 10.200.63.133:6801/21178 failed (3 reports from 1 peers after 2013-02-20 04:39:11.373918 >= grace 20.000000)
> 2013-02-20 04:38:48.565717 mon.0 10.200.63.130:6789/0 120798 : [INF] osdmap e791: 2 osds: 1 up, 2 in
> 2013-02-20 04:38:48.670726 mon.0 10.200.63.130:6789/0 120799 : [INF] pgmap v3000955: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:49.073328 mon.0 10.200.63.130:6789/0 120800 : [INF] pgmap v3000956: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:49.654554 mon.0 10.200.63.130:6789/0 120801 : [INF] osdmap e792: 2 osds: 1 up, 2 in
> 2013-02-20 04:38:49.857067 mon.0 10.200.63.130:6789/0 120802 : [INF] pgmap v3000957: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:50.749644 mon.0 10.200.63.130:6789/0 120803 : [INF] osdmap e793: 2 osds: 2 up, 2 in
> 2013-02-20 04:38:50.749710 mon.0 10.200.63.130:6789/0 120804 : [INF] osd.1 10.200.63.133:6801/21178 boot
> 2013-02-20 04:38:50.850887 mon.0 10.200.63.130:6789/0 120805 : [INF] pgmap v3000958: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:51.834189 mon.0 10.200.63.130:6789/0 120806 : [INF] osdmap e794: 2 osds: 2 up, 2 in
> 2013-02-20 04:38:51.956560 mon.0 10.200.63.130:6789/0 120807 : [INF] pgmap v3000959: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:56.162743 mon.0 10.200.63.130:6789/0 120808 : [INF] pgmap v3000960: 576 pgs: 295 active, 271 active+clean, 9 peering, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> 2013-02-20 04:38:57.235082 mon.0 10.200.63.130:6789/0 120809 : [INF] pgmap v3000961: 576 pgs: 295 active, 281 peering; 410 GB data, 841 GB used, 2882 GB / 3724 GB avail
> 2013-02-20 04:38:48.660979 osd.1 10.200.63.133:6801/21178 997 : [WRN] map e791 wrongly marked me down
> 2013-02-20 04:39:01.158928 mon.0 10.200.63.130:6789/0 120810 : [INF] pgmap v3000962: 576 pgs: 295 active, 281 peering; 410 GB data, 841 GB used, 2882 GB / 3724 GB avail
> 2013-02-20 04:39:19.111723 osd.0 10.200.63.132:6801/18444 800 : [WRN] 6 slow requests, 6 included below; oldest blocked for > 30.770732 secs
> 2013-02-20 04:39:19.111729 osd.0 10.200.63.132:6801/18444 801 : [WRN] slow request 30.770732 seconds old, received at 2013-02-20 04:38:48.340908: osd_op(client.9971.0:685981 rb.0.1c62.2ae8944a.0000000003aa [write 3878912~4096] 2.c82ee285) v4 currently reached pg
> 2013-02-20 04:39:19.111735 osd.0 10.200.63.132:6801/18444 802 : [WRN] slow request 30.770225 seconds old, received at 2013-02-20 04:38:48.341415: osd_op(client.9971.0:685984 rb.0.1c62.2ae8944a.000000000439 [write 364544~20480] 2.b16f5ace) v4 currently reached pg
> 2013-02-20 04:39:19.111738 osd.0 10.200.63.132:6801/18444 803 : [WRN] slow request 30.456112 seconds old, received at 2013-02-20 04:38:48.655528: osd_op(client.9986.0:178417 broot.rbd [watch 1~0] 2.d30a2f40) v4 currently reached pg
> 2013-02-20 04:39:19.111743 osd.0 10.200.63.132:6801/18444 804 : [WRN] slow request 30.456106 seconds old, received at 2013-02-20 04:38:48.655534: osd_op(client.9989.0:215170 broot-nfs2.rbd [watch 1~0] 2.7802d31e) v4 currently reached pg
> 2013-02-20 04:39:19.111747 osd.0 10.200.63.132:6801/18444 805 : [WRN] slow request 30.455860 seconds old, received at 2013-02-20 04:38:48.655780: osd_op(client.9968.0:302450 dns1.rbd [watch 1~0] 2.383712c1) v4 currently reached pg
> 
> ----------------------------------------------------------------------
> grep osd_ping ceph-osd.0.log
> ----------------------------------------------------------------------
> 2013-02-20 04:37:57.347387 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:37:57.347384) v2 -- ?+0 0xbd248c0 con 0xa2dcdc0
> 2013-02-20 04:37:57.349406 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79153 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:37:57.347384) v2 ==== 47+0+0 (2837779695 0 0) 0xbc28a80 con 0xa2dcdc0
> 2013-02-20 04:37:57.847588 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:37:57.847586) v2 -- ?+0 0xa1ea540 con 0xa2dcdc0
> 2013-02-20 04:37:58.050400 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79154 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:37:57.847586) v2 ==== 47+0+0 (3920125339 0 0) 0xdb34000 con 0xa2dcdc0
> <<<< osd.0 sees delayed ping_replies from here >>>>
> 2013-02-20 04:37:59.547719 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:37:59.547716) v2 -- ?+0 0x73e5340 con 0xa2dcdc0
> 2013-02-20 04:38:00.047911 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:00.047909) v2 -- ?+0 0x99bbc00 con 0xa2dcdc0
> 2013-02-20 04:38:01.748080 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:01.748077) v2 -- ?+0 0xa1eb6c0 con 0xa2dcdc0
> 2013-02-20 04:38:03.448223 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:03.448220) v2 -- ?+0 0x99bb180 con 0xa2dcdc0
> 2013-02-20 04:38:03.948413 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:03.948411) v2 -- ?+0 0x99ba700 con 0xa2dcdc0
> 2013-02-20 04:38:04.448601 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:04.448598) v2 -- ?+0 0x99ba540 con 0xa2dcdc0
> 2013-02-20 04:38:04.948724 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:04.948720) v2 -- ?+0 0xa1ebdc0 con 0xa2dcdc0
> 2013-02-20 04:38:08.448860 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:08.448856) v2 -- ?+0 0xb676380 con 0xa2dcdc0
> 2013-02-20 04:38:08.949028 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:08.949025) v2 -- ?+0 0xbc28380 con 0xa2dcdc0
> 2013-02-20 04:38:10.649263 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:10.649260) v2 -- ?+0 0x965d340 con 0xa2dcdc0
> 2013-02-20 04:38:11.749458 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:11.749455) v2 -- ?+0 0xa1eafc0 con 0xa2dcdc0
> 2013-02-20 04:38:12.799154 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79155 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:37:59.547716) v2 ==== 47+0+0 (1242454262 0 0) 0xbc29880 con 0xa2dcdc0
> 2013-02-20 04:38:12.799459 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79156 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:00.047909) v2 ==== 47+0+0 (3852750933 0 0) 0xcaff180 con 0xa2dcdc0
> 2013-02-20 04:38:12.799496 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79157 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:01.748077) v2 ==== 47+0+0 (3672189647 0 0) 0xb677340 con 0xa2dcdc0
> 2013-02-20 04:38:12.799542 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79158 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:03.448220) v2 ==== 47+0+0 (38366945 0 0) 0xbc28c40 con 0xa2dcdc0
> 2013-02-20 04:38:12.799554 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79159 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:03.948411) v2 ==== 47+0+0 (83904766 0 0) 0x884ee00 con 0xa2dcdc0
> 2013-02-20 04:38:12.799573 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79160 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:04.448598) v2 ==== 47+0+0 (2688468082 0 0) 0x10c5c1c0 con 0xa2dcdc0
> 2013-02-20 04:38:12.799667 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79161 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:04.948720) v2 ==== 47+0+0 (4187258751 0 0) 0xb21a540 con 0xa2dcdc0
> 2013-02-20 04:38:12.799689 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79162 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:08.448856) v2 ==== 47+0+0 (4176431512 0 0) 0xb21b180 con 0xa2dcdc0
> 2013-02-20 04:38:12.799710 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79163 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:08.949025) v2 ==== 47+0+0 (2888471344 0 0) 0xb21b340 con 0xa2dcdc0
> 2013-02-20 04:38:12.799728 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79164 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:10.649260) v2 ==== 47+0+0 (3060931781 0 0) 0xb21aa80 con 0xa2dcdc0
> 2013-02-20 04:38:12.799745 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79165 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:11.749455) v2 ==== 47+0+0 (2767620502 0 0) 0x8d4e380 con 0xa2dcdc0
> 2013-02-20 04:38:14.049649 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:14.049645) v2 -- ?+0 0xa1ea000 con 0xa2dcdc0
> 2013-02-20 04:38:14.260608 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79166 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:14.049645) v2 ==== 47+0+0 (462572634 0 0) 0xbc29a40 con 0xa2dcdc0
> 2013-02-20 04:38:15.149828 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:15.149826) v2 -- ?+0 0xac85340 con 0xa2dcdc0
> 2013-02-20 04:38:15.151892 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79167 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:15.149826) v2 ==== 47+0+0 (2092320694 0 0) 0xdb34380 con 0xa2dcdc0
> 2013-02-20 04:38:21.050059 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:21.050056) v2 -- ?+0 0xb677c00 con 0xa2dcdc0
> 2013-02-20 04:38:25.750198 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:25.750195) v2 -- ?+0 0xbd24000 con 0xa2dcdc0
> 2013-02-20 04:38:28.650370 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:28.650367) v2 -- ?+0 0x7e94a80 con 0xa2dcdc0
> 2013-02-20 04:38:32.150553 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:32.150550) v2 -- ?+0 0xb677500 con 0xa2dcdc0
> 2013-02-20 04:38:34.450740 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:34.450737) v2 -- ?+0 0x9bc7500 con 0xa2dcdc0
> 2013-02-20 04:38:35.369720 7f109af37700 -1 osd.0 790 heartbeat_check: no reply from osd.1 since 2013-02-20 04:38:15.149826 (cutoff 2013-02-20 04:38:15.369719)
> 2013-02-20 04:38:36.369895 7f109af37700 -1 osd.0 790 heartbeat_check: no reply from osd.1 since 2013-02-20 04:38:15.149826 (cutoff 2013-02-20 04:38:16.369894)
> 
> ----------------------------------------------------------------------
> grep osd_ping ceph-osd.1.log
> ----------------------------------------------------------------------
> 2013-02-20 04:37:57.847878 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79154 ==== osd_ping(ping e790 stamp 2013-02-20 04:37:57.847586) v2 ==== 47+0+0 (2625351075 0 0) 0xb441880 con 0xb9a89a0
> 2013-02-20 04:37:57.847957 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:37:57.847586) v2 -- ?+0 0xe5921c0 con 0xb9a89a0
> <<<< osd.0 sees delayed ping_replies from here >>>>
> 2013-02-20 04:37:59.547994 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79155 ==== osd_ping(ping e790 stamp 2013-02-20 04:37:59.547716) v2 ==== 47+0+0 (1071491278 0 0) 0xb440700 con 0xb9a89a0
> 2013-02-20 04:37:59.548066 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:37:59.547716) v2 -- ?+0 0xb441880 con 0xb9a89a0
> 2013-02-20 04:38:00.048174 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79156 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:00.047909) v2 ==== 47+0+0 (2423758957 0 0) 0xc987a40 con 0xb9a89a0
> 2013-02-20 04:38:00.048262 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:00.047909) v2 -- ?+0 0xb440700 con 0xb9a89a0
> 2013-02-20 04:38:01.748248 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79157 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:01.748077) v2 ==== 47+0+0 (2939345655 0 0) 0xb6d96c0 con 0xb9a89a0
> 2013-02-20 04:38:01.748330 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:01.748077) v2 -- ?+0 0xc987a40 con 0xb9a89a0
> 2013-02-20 04:38:03.448435 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79158 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:03.448220) v2 ==== 47+0+0 (2006621913 0 0) 0xb71c540 con 0xb9a89a0
> 2013-02-20 04:38:03.448531 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:03.448220) v2 -- ?+0 0xb6d96c0 con 0xb9a89a0
> 2013-02-20 04:38:04.163566 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79159 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:03.948411) v2 ==== 47+0+0 (1892923590 0 0) 0xc02b6c0 con 0xb9a89a0
> 2013-02-20 04:38:04.163648 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:03.948411) v2 -- ?+0 0xb71c540 con 0xb9a89a0
> 2013-02-20 04:38:04.448837 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79160 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:04.448598) v2 ==== 47+0+0 (3589092426 0 0) 0xc02a8c0 con 0xb9a89a0
> 2013-02-20 04:38:04.448876 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:04.448598) v2 -- ?+0 0xc02b6c0 con 0xb9a89a0
> 2013-02-20 04:38:04.949019 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79161 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:04.948720) v2 ==== 47+0+0 (2353499975 0 0) 0x6fae700 con 0xb9a89a0
> 2013-02-20 04:38:04.949106 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:04.948720) v2 -- ?+0 0xc02a8c0 con 0xb9a89a0
> 2013-02-20 04:38:08.449126 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79162 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:08.448856) v2 ==== 47+0+0 (2369567136 0 0) 0xc02ac40 con 0xb9a89a0
> 2013-02-20 04:38:08.449210 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:08.448856) v2 -- ?+0 0x6fae700 con 0xb9a89a0
> 2013-02-20 04:38:08.949215 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79163 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:08.949025) v2 ==== 47+0+0 (3656999688 0 0) 0xc02ba40 con 0xb9a89a0
> 2013-02-20 04:38:08.949277 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:08.949025) v2 -- ?+0 0xc02ac40 con 0xb9a89a0
> 2013-02-20 04:38:10.649580 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79164 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:10.649260) v2 ==== 47+0+0 (3282169085 0 0) 0xc02a000 con 0xb9a89a0
> 2013-02-20 04:38:10.649647 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:10.649260) v2 -- ?+0 0xc02ba40 con 0xb9a89a0
> 2013-02-20 04:38:11.749750 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79165 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:11.749455) v2 ==== 47+0+0 (3508894126 0 0) 0xe593180 con 0xb9a89a0
> 2013-02-20 04:38:11.749835 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:11.749455) v2 -- ?+0 0xc02a000 con 0xb9a89a0
> 2013-02-20 04:38:14.049868 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79166 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:14.049645) v2 ==== 47+0+0 (1849801826 0 0) 0xc02bdc0 con 0xb9a89a0
> 2013-02-20 04:38:14.049943 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:14.049645) v2 -- ?+0 0xe593180 con 0xb9a89a0
> 2013-02-20 04:38:15.150155 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79167 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:15.149826) v2 ==== 47+0+0 (157661070 0 0) 0xb6d9340 con 0xb9a89a0
> 2013-02-20 04:38:15.150242 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:15.149826) v2 -- ?+0 0xc02bdc0 con 0xb9a89a0
> 2013-02-20 04:38:21.050348 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79168 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:21.050056) v2 ==== 47+0+0 (2307149705 0 0) 0xbf4bc00 con 0xb9a89a0
> 2013-02-20 04:38:21.050433 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:21.050056) v2 -- ?+0 0xb6d9340 con 0xb9a89a0
> 2013-02-20 04:38:25.750415 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79169 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:25.750195) v2 ==== 47+0+0 (3873827452 0 0) 0xe592000 con 0xb9a89a0
> 2013-02-20 04:38:25.750548 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:25.750195) v2 -- ?+0 0xbf4bc00 con 0xb9a89a0
> 2013-02-20 04:38:28.650634 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79170 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:28.650367) v2 ==== 47+0+0 (718534970 0 0) 0x78ff880 con 0xb9a89a0
> 2013-02-20 04:38:28.650713 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:28.650367) v2 -- ?+0 0xe592000 con 0xb9a89a0
> 2013-02-20 04:38:32.150855 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79171 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:32.150550) v2 ==== 47+0+0 (1909552382 0 0) 0x6fae000 con 0xb9a89a0
> 2013-02-20 04:38:32.150939 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:32.150550) v2 -- ?+0 0x78ff880 con 0xb9a89a0
> 2013-02-20 04:38:34.450994 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79172 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:34.450737) v2 ==== 47+0+0 (886776134 0 0) 0xd304700 con 0xb9a89a0
> 2013-02-20 04:38:34.451033 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:34.450737) v2 -- ?+0 0x6fae000 con 0xb9a89a0
> 2013-02-20 04:38:37.351175 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79173 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:37.350908) v2 ==== 47+0+0 (2444468724 0 0) 0xe593340 con 0xb9a89a0
> 2013-02-20 04:38:37.351215 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:37.350908) v2 -- ?+0 0xd304700 con 0xb9a89a0
> 2013-02-20 04:38:41.451417 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79174 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:41.451094) v2 ==== 47+0+0 (4159941099 0 0) 0xb6d8380 con 0xb9a89a0
> 2013-02-20 04:38:41.451477 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:41.451094) v2 -- ?+0 0xe593340 con 0xb9a89a0
> 2013-02-20 04:38:43.751635 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79175 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:43.751289) v2 ==== 47+0+0 (800627449 0 0) 0xbb0b180 con 0xb9a89a0
> 2013-02-20 04:38:43.751675 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:43.751289) v2 -- ?+0 0xb6d8380 con 0xb9a89a0
> 2013-02-20 04:38:53.151872 7f564b733700  1 -- 192.168.254.133:6802/21178 <== osd.0 192.168.254.132:0/18444 1 ==== osd_ping(ping e794 stamp 2013-02-20 04:38:53.151589) v2 ==== 47+0+0 (3010557341 0 0) 0x78ff880 con 0x10d39340
> 2013-02-20 04:38:53.151908 7f564b733700  1 -- 192.168.254.133:6802/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e794 stamp 2013-02-20 04:38:53.151589) v2 -- ?+0 0xbb0b180 con 0x10d39340
> 2013-02-20 04:38:56.052292 7f564b733700  1 -- 192.168.254.133:6802/21178 <== osd.0 192.168.254.132:0/18444 2 ==== osd_ping(ping e794 stamp 2013-02-20 04:38:56.051788) v2 ==== 47+0+0 (4254185894 0 0) 0xb6d9340 con 0x10d39340
> 2013-02-20 04:38:56.052330 7f564b733700  1 -- 192.168.254.133:6802/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e794 stamp 2013-02-20 04:38:56.051788) v2 -- ?+0 0x78ff880 con 0x10d39340
> 2013-02-20 04:38:57.152438 7f564b733700  1 -- 192.168.254.133:6802/21178 <== osd.0 192.168.254.132:0/18444 3 ==== osd_ping(ping e794 stamp 2013-02-20 04:38:57.152006) v2 ==== 47+0+0 (1503720898 0 0) 0xb6d96c0 con 0x10d39340
> 2013-02-20 04:38:57.152472 7f564b733700  1 -- 192.168.254.133:6802/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e794 stamp 2013-02-20 04:38:57.152006) v2 -- ?+0 0xb6d9340 con 0x10d39340
> 
> ----------------------------------------------------------------------
> ceph-osd.0.log, showing osd.0 <=> osd.1 activity whilst osd.1 ping_replies aren't being seen
> ----------------------------------------------------------------------
> <<<< osd.0 sees delayed ping_replies from here >>>>
> 2013-02-20 04:37:59.547719 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:37:59.547716) v2 -- ?+0 0x73e5340 con 0xa2dcdc0
> 2013-02-20 04:38:00.047911 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:00.047909) v2 -- ?+0 0x99bbc00 con 0xa2dcdc0
> 2013-02-20 04:38:00.342856 7f108e71e700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.9811.0:498225 2.51 4d55cd51/rb.0.2150.2ae8944a.000000000628/head//2 [] v 790'271511 snapset=0=[]:[] snapc=0=[]) v7 -- ?+8799 0x8f91400
> 2013-02-20 04:38:01.046192 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989786 ==== osd_sub_op(unknown.0.0:0 2.4e 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v7 ==== 955+0+21617430 (245718815 0 180673924) 0xd19d200 con 0x9571080
> 2013-02-20 04:38:01.046297 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989787 ==== osd_sub_op_reply(client.9971.0:685924 2.4e b16f5ace/rb.0.1c62.2ae8944a.000000000439/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (3427135234 0 0) 0xa9c4f00 con 0x9571080
> 2013-02-20 04:38:01.046352 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989788 ==== osd_sub_op_reply(client.9953.0:367558 2.5a 6d3052da/rbd_data.242f2ae8944a.000000000000002c/head//2 [] ondisk, result = 0) v1 ==== 164+0+0 (2641157471 0 0) 0x86f2280 con 0x9571080
> 2013-02-20 04:38:01.046390 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989789 ==== osd_sub_op_reply(client.9953.0:367559 2.28 8340fa28/rbd_data.242f2ae8944a.0000000000000096/head//2 [] ondisk, result = 0) v1 ==== 164+0+0 (2759328511 0 0) 0x95d2780 con 0x9571080
> 2013-02-20 04:38:01.046743 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989790 ==== osd_sub_op_reply(client.9953.0:367560 2.5a 6678d7da/rbd_data.242f2ae8944a.0000000000000b8b/head//2 [] ondisk, result = 0) v1 ==== 164+0+0 (602117107 0 0) 0xda1fb80 con 0x9571080
> 2013-02-20 04:38:01.047020 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989791 ==== osd_sub_op_reply(client.9811.0:498225 2.51 4d55cd51/rb.0.2150.2ae8944a.000000000628/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (435514219 0 0) 0xac43400 con 0x9571080
> 2013-02-20 04:38:01.047240 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989792 ==== osd_sub_op(client.9995.0:4402170 2.77 3b454af7/rb.0.1cf8.2ae8944a.000000000e35/head//2 [] v 790'353333 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+8799 (240510986 0 1818934932) 0x8f91400 con 0x9571080
> 2013-02-20 04:38:01.047600 7f108cf1b700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- replica scrub(pg: 2.4e,from:0'0,to:0'0,epoch:790,start:6388b3ce//0//-1,end:cff8e3ce//0//-1,chunky:1,deep:0,version:4) v4 -- ?+0 0xb0b3b00
> 2013-02-20 04:38:01.050467 7f108df1d700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.9811.0:498226 2.51 4d55cd51/rb.0.2150.2ae8944a.000000000628/head//2 [] v 790'271512 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0xd19d200
> 2013-02-20 04:38:01.060518 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989793 ==== osd_sub_op(client.9968.0:302420 2.74 2c5fc7f4/rb.0.1c64.74b0dc51.00000000043b/head//2 [] v 790'498230 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+16991 (2821939739 0 1047147434) 0xd19d200 con 0x9571080
> 2013-02-20 04:38:01.072310 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989794 ==== osd_sub_op_reply(client.9811.0:498226 2.51 4d55cd51/rb.0.2150.2ae8944a.000000000628/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (3589886011 0 0) 0xaf1ef00 con 0x9571080
> 2013-02-20 04:38:01.086276 7f1093f29700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op_reply(client.9995.0:4402170 2.77 3b454af7/rb.0.1cf8.2ae8944a.000000000e35/head//2 [] ondisk, result = 0) v1 -- ?+0 0xc477400
> 2013-02-20 04:38:01.090522 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989795 ==== osd_sub_op(client.9995.0:4402171 2.77 3b454af7/rb.0.1cf8.2ae8944a.000000000e35/head//2 [] v 790'353334 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+4703 (4105904555 0 2838991357) 0xbddea00 con 0x9571080
> 2013-02-20 04:38:01.097246 7f1093f29700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op_reply(client.9968.0:302420 2.74 2c5fc7f4/rb.0.1c64.74b0dc51.00000000043b/head//2 [] ondisk, result = 0) v1 -- ?+0 0x9055400
> 2013-02-20 04:38:01.100259 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989796 ==== osd_sub_op(client.9968.0:302421 2.74 2c5fc7f4/rb.0.1c64.74b0dc51.00000000043b/head//2 [] v 790'498231 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+4703 (3340239323 0 2873835958) 0x810f400 con 0x9571080
> 2013-02-20 04:38:01.110432 7f1093f29700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op_reply(client.9995.0:4402171 2.77 3b454af7/rb.0.1cf8.2ae8944a.000000000e35/head//2 [] ondisk, result = 0) v1 -- ?+0 0xc477680
> 2013-02-20 04:38:01.121128 7f1093f29700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op_reply(client.9968.0:302421 2.74 2c5fc7f4/rb.0.1c64.74b0dc51.00000000043b/head//2 [] ondisk, result = 0) v1 -- ?+0 0x296d680
> 2013-02-20 04:38:01.366189 7f108e71e700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.10001.0:3423766 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] v 790'768516 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0x9fe1400
> 2013-02-20 04:38:01.366377 7f108e71e700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.10001.0:3423767 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] v 790'768517 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0xb978a00
> 2013-02-20 04:38:01.366519 7f108e71e700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.10001.0:3423768 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] v 790'768518 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0x815d200
> 2013-02-20 04:38:01.366728 7f108df1d700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.10001.0:3423770 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] v 790'768519 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0xc557e00
> 2013-02-20 04:38:01.367681 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989797 ==== osd_sub_op(client.10001.0:3423769 2.17 37564117/rb.0.209f.74b0dc51.000000001200/head//2 [] v 790'353128 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+4703 (1225650648 0 1162937880) 0xa052800 con 0x9571080
> 2013-02-20 04:38:01.378938 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989798 ==== osd_sub_op_reply(client.10001.0:3423766 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (2250676101 0 0) 0xc47768 0 con 0x9571080
> 2013-02-20 04:38:01.379109 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989799 ==== osd_sub_op_reply(client.10001.0:3423767 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (2794751629 0 0) 0x7b1aa0 0 con 0x9571080
> 2013-02-20 04:38:01.379162 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989800 ==== osd_sub_op_reply(client.10001.0:3423768 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (41013162 0 0) 0x8086a00 con 0x9571080
> 2013-02-20 04:38:01.379396 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989801 ==== osd_sub_op_reply(client.10001.0:3423770 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (2912526559 0 0) 0xb05450 0 con 0x9571080
> 
> ----------------------------------------------------------------------
> osd.0 load and disk
> ----------------------------------------------------------------------
>                     load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
> 2013-02-20-04:37:56  0.6   1.4   0.0   1.4   0.5   0.0  94.1  sdn    0.0    0.0    0.0   19.1     0.00   182.00  19.09   0.59  30.87   0.00  30.9  14.5
> 2013-02-20-04:38:11  0.6   1.2   0.0   1.1   1.0   0.0  93.4  sdn    0.0    0.0    0.0   20.8     0.00   152.00  14.62   0.79  38.14   0.00  38.1  12.5
> 2013-02-20-04:38:26  0.6   1.3   0.0   1.1   0.5   0.0  94.0  sdn    0.0    0.0    0.0   17.4     0.00   170.53  19.60   0.69  39.85   0.00  39.8  14.7
> 2013-02-20-04:38:41  0.6   1.1   0.0   1.2   0.5   0.0  94.3  sdn    0.0    0.2    0.0   21.8     0.00   179.83  16.50   0.61  28.17   0.00  28.2  12.1
> 2013-02-20-04:38:56  0.8   1.7   0.0   1.2   4.3   0.0  90.6  sdn    0.0    0.1    0.2  141.1     1.60  1838.33  26.04  41.68 294.67 283.33 294.7  53.9
> 2013-02-20-04:39:11  0.7   0.3   0.0   0.3   0.3   0.0  98.3  sdn    0.0    0.0    0.0    2.3     0.00    45.20  39.88   0.16  85.00   0.00  85.0   4.8
> 2013-02-20-04:39:26  0.6   0.1   0.0   0.1   0.0   0.0  99.2  sdn    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> 2013-02-20-04:39:41  0.5   0.4   0.0   0.3   0.1   0.0  99.4  sdn    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> 2013-02-20-04:39:56  0.4   0.2   0.0   0.2   0.1   0.0  99.5  sdn    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> 2013-02-20-04:40:11  0.3   0.3   0.0   0.3   0.0   0.0  99.5  sdn    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> 
> ----------------------------------------------------------------------
> osd.1 load and disk
> ----------------------------------------------------------------------
>                     load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
> 2013-02-20-04:37:57  0.8   0.3   0.0   0.2   0.6   0.0  97.7  sdg    0.0    0.1    2.5   19.7    10.40   179.93  17.15   0.97  43.63  38.16  44.3  19.3
> 2013-02-20-04:38:12  1.0   0.3   0.0   0.3   0.7   0.0  97.4  sdg    0.0    0.1    2.8   28.0    11.47   232.20  15.82   0.71  23.20  25.48  23.0  16.1
> 2013-02-20-04:38:27  0.8   0.4   0.0   0.2   0.7   0.0  97.4  sdg    0.0    0.0    2.7   11.3    10.67   102.87  16.30   0.31  22.30  14.25  24.2  11.1
> 2013-02-20-04:38:42  0.7   0.2   0.0   0.2   0.8   0.0  97.6  sdg    0.0    0.0    2.3   22.3     9.07   198.07  16.84   1.15  46.72  31.18  48.3  20.9
> 2013-02-20-04:38:57  0.6   0.6   0.0   0.5   5.2   0.0  92.9  sdg    0.0    0.0   15.8  162.7    63.73  1613.37  18.79  32.51 182.07  24.94 197.3  63.1
> 2013-02-20-04:39:12  0.5   0.0   0.0   0.0   0.1   0.0  99.8  sdg    0.0    0.0    0.0    5.8     0.00    76.87  26.51   0.32  57.82   0.00  57.8   9.2
> 2013-02-20-04:39:27  0.4   0.0   0.0   0.0   0.1   0.0  99.9  sdg    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> 2013-02-20-04:39:42  0.6   0.0   0.0   0.0   0.1   0.0  99.8  sdg    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> 2013-02-20-04:39:57  0.5   0.0   0.0   0.0   0.1   0.0  99.8  sdg    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> 2013-02-20-04:40:12  0.4   0.0   0.0   0.0   0.1   0.0  99.8  sdg    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-22  3:06             ` Chris Dunlop
@ 2013-02-22 21:57               ` Sage Weil
  2013-02-22 23:35                 ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Sage Weil @ 2013-02-22 21:57 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

On Fri, 22 Feb 2013, Chris Dunlop wrote:
> G'day,
> 
> It seems there might be two issues here: the first being the delayed
> receipt of echo replies causing an seemingly otherwise healthy osd to be
> marked down, the second being the lack of recovery once the downed osd is
> recognised as up again.
> 
> Is it worth my opening tracker reports for this, just so it doesn't get
> lost?

I just looked at the logs.  I can't tell what happend to cause that 10 
second delay.. strangely, messages were passing from 0 -> 1, but nothing 
came back from 1 -> 0 (although 1 was queuing, if not sending, them).

The strange bit is that after this, you get those indefinite hangs.  From 
the logs it looks like the OSD rebound to an old port that was previously 
open from osd.0.. probably from way back.  Do you have logs going further 
back than what you posted?  Also, do you have osdmaps, say, 750 and 
onward?  It looks like there is a bug in the connection handling code 
(that is unrelated to the delay above).

Thanks!
sage


> 
> Cheers,
> 
> Chris
> 
> On Wed, Feb 20, 2013 at 01:07:03PM +1100, Chris Dunlop wrote:
> > On Tue, Feb 19, 2013 at 02:02:03PM +1100, Chris Dunlop wrote:
> >> On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote:
> >>> On Mon, 18 Feb 2013, Chris Dunlop wrote:
> >>>> On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
> >>>>> On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
> >>>>>> On Fri, 15 Feb 2013, Chris Dunlop wrote:
> >>>>>>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might cause the
> >>>>>>> mons to lose touch with the osds?
> >>>>>> 
> >>>>>> Can you enable 'debug ms = 1' on the mons and leave them that way, in the 
> >>>>>> hopes that this happens again?  It will give us more information to go on.
> >>>>> 
> >>>>> Debug turned on.
> >>>> 
> >>>> We haven't experienced the cluster losing touch with the osds completely
> >>>> since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
> >>>> for a few seconds before it recovered. See below for logs (reminder: 3
> >>>> boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).
> >>> 
> >>> Hrm, I don't see any obvious clues.  You could enable 'debug ms = 1' on 
> >>> the osds as well.  That will give us more to go on if/when it happens 
> >>> again, and should not affect performance significantly.
> >> 
> >> Done: ceph osd tell '*' injectargs '--debug-ms 1'
> >> 
> >> Now to wait for it to happen again.
> > 
> > OK, we got it again. Full logs covering the incident available at:
> > 
> > https://www.dropbox.com/s/kguzwyjfglv3ypl/ceph-logs.zip
> > 
> > Archive:  /tmp/ceph-logs.zip
> >  Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
> > --------  ------  ------- ---- ---------- ----- --------  ----
> >    11492  Defl:X     1186  90% 2013-02-20 12:04 c0cba4ae  ceph-mon.b2.log
> >  1270789  Defl:X    89278  93% 2013-02-20 12:00 2208d035  ceph-mon.b4.log
> >  1375858  Defl:X   104025  92% 2013-02-20 12:05 c64c1dad  ceph-mon.b5.log
> >  2020042  Defl:X   215000  89% 2013-02-20 10:40 f74ae4ca  ceph-osd.0.log
> >  2075512  Defl:X   224098  89% 2013-02-20 12:05 b454d2ec  ceph-osd.1.log
> >   154938  Defl:X    12989  92% 2013-02-20 12:04 d2729b05  ceph.log
> > --------          -------  ---                            -------
> >  6908631           646576  91%                            6 files
> > 
> > My naive analysis, based on the log extracts below (best viewed on a wide
> > screen!)...
> > 
> > Osd.0 starts hearing much-delayed ping_replies from osd.1 and tells the mon,
> > which marks osd.1 down.
> > 
> > However the whole time, the osd.1 log indicates that it's receiving and
> > responding to each ping from osd.0 in a timely fashion. In contrast, the osd.0
> > log indicates it isn't seeing the osd.1 replies for a while, then sees them all
> > arrive in a flurry, until they're "delayed" enough to cause osd.0 to tell the
> > mon.
> > 
> > During the time osd.0 is not seeing the osd.1 ping_replies, there's other traffic
> > (osd_op, osd_sub_op, osd_sub_op_reply etc.) between osd.0 and osd.1, indicating
> > that it's not a network problem.
> > 
> > The load on both osds during this period was >90% idle and <1% iow.
> > 
> > Is this pointing to osd.0 experiencing some kind of scheduling or priority
> > starvation on the ping thread (assuming the ping is in it's own thread)?
> > 
> > The next odd thing is that, although the osds are both back by 04:38:50 ("2
> > osds: 2 up, 2 in"), the system still wasn't working (see the disk stats for
> > both osd.0 and osd.1) and didn't recover until ceph (mon + osd) was restarted
> > on one of the boxes at around 05:50 (not shown in the logs, but full logs
> > available if needed).
> > 
> > Prior to the restart:
> > 
> > # ceph health
> > HEALTH_WARN 281 pgs peering; 281 pgs stuck inactive; 576 pgs stuck unclean
> > 
> > (Sorry, once again didn't get a 'ceph -s' prior to the restart.)
> > 
> > Chris.
> > 
> > ----------------------------------------------------------------------
> > ceph.log
> > ----------------------------------------------------------------------
> > 2013-02-20 04:37:51.074128 mon.0 10.200.63.130:6789/0 120771 : [INF] pgmap v3000932: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:37:53.541471 mon.0 10.200.63.130:6789/0 120772 : [INF] pgmap v3000933: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:37:56.063059 mon.0 10.200.63.130:6789/0 120773 : [INF] pgmap v3000934: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:37:58.532763 mon.0 10.200.63.130:6789/0 120774 : [INF] pgmap v3000935: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > <<<< osd.0 sees delayed ping_replies from here >>>>
> > 2013-02-20 04:38:01.057939 mon.0 10.200.63.130:6789/0 120775 : [INF] pgmap v3000936: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:03.541404 mon.0 10.200.63.130:6789/0 120776 : [INF] pgmap v3000937: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:06.133004 mon.0 10.200.63.130:6789/0 120777 : [INF] pgmap v3000938: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:08.540471 mon.0 10.200.63.130:6789/0 120778 : [INF] pgmap v3000939: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:11.064003 mon.0 10.200.63.130:6789/0 120779 : [INF] pgmap v3000940: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:13.547845 mon.0 10.200.63.130:6789/0 120780 : [INF] pgmap v3000941: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:16.062892 mon.0 10.200.63.130:6789/0 120781 : [INF] pgmap v3000942: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:18.530804 mon.0 10.200.63.130:6789/0 120782 : [INF] pgmap v3000943: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:21.080347 mon.0 10.200.63.130:6789/0 120783 : [INF] pgmap v3000944: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:23.555523 mon.0 10.200.63.130:6789/0 120784 : [INF] pgmap v3000945: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:26.071449 mon.0 10.200.63.130:6789/0 120785 : [INF] pgmap v3000946: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:28.561133 mon.0 10.200.63.130:6789/0 120786 : [INF] pgmap v3000947: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:31.068101 mon.0 10.200.63.130:6789/0 120787 : [INF] pgmap v3000948: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:33.536022 mon.0 10.200.63.130:6789/0 120788 : [INF] pgmap v3000949: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:36.081591 mon.0 10.200.63.130:6789/0 120789 : [INF] pgmap v3000950: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:38.380909 mon.0 10.200.63.130:6789/0 120790 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
> > 2013-02-20 04:38:43.372798 mon.0 10.200.63.130:6789/0 120793 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
> > 2013-02-20 04:38:48.373930 mon.0 10.200.63.130:6789/0 120796 : [DBG] osd.1 10.200.63.133:6801/21178 reported failed by osd.0 10.200.63.132:6801/18444
> > 2013-02-20 04:38:48.373990 mon.0 10.200.63.130:6789/0 120797 : [INF] osd.1 10.200.63.133:6801/21178 failed (3 reports from 1 peers after 2013-02-20 04:39:11.373918 >= grace 20.000000)
> > 2013-02-20 04:38:48.565717 mon.0 10.200.63.130:6789/0 120798 : [INF] osdmap e791: 2 osds: 1 up, 2 in
> > 2013-02-20 04:38:48.670726 mon.0 10.200.63.130:6789/0 120799 : [INF] pgmap v3000955: 576 pgs: 575 active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:49.073328 mon.0 10.200.63.130:6789/0 120800 : [INF] pgmap v3000956: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:49.654554 mon.0 10.200.63.130:6789/0 120801 : [INF] osdmap e792: 2 osds: 1 up, 2 in
> > 2013-02-20 04:38:49.857067 mon.0 10.200.63.130:6789/0 120802 : [INF] pgmap v3000957: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:50.749644 mon.0 10.200.63.130:6789/0 120803 : [INF] osdmap e793: 2 osds: 2 up, 2 in
> > 2013-02-20 04:38:50.749710 mon.0 10.200.63.130:6789/0 120804 : [INF] osd.1 10.200.63.133:6801/21178 boot
> > 2013-02-20 04:38:50.850887 mon.0 10.200.63.130:6789/0 120805 : [INF] pgmap v3000958: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:51.834189 mon.0 10.200.63.130:6789/0 120806 : [INF] osdmap e794: 2 osds: 2 up, 2 in
> > 2013-02-20 04:38:51.956560 mon.0 10.200.63.130:6789/0 120807 : [INF] pgmap v3000959: 576 pgs: 271 active+clean, 304 stale+active+clean, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:56.162743 mon.0 10.200.63.130:6789/0 120808 : [INF] pgmap v3000960: 576 pgs: 295 active, 271 active+clean, 9 peering, 1 active+clean+scrubbing; 410 GB data, 842 GB used, 2881 GB / 3724 GB avail
> > 2013-02-20 04:38:57.235082 mon.0 10.200.63.130:6789/0 120809 : [INF] pgmap v3000961: 576 pgs: 295 active, 281 peering; 410 GB data, 841 GB used, 2882 GB / 3724 GB avail
> > 2013-02-20 04:38:48.660979 osd.1 10.200.63.133:6801/21178 997 : [WRN] map e791 wrongly marked me down
> > 2013-02-20 04:39:01.158928 mon.0 10.200.63.130:6789/0 120810 : [INF] pgmap v3000962: 576 pgs: 295 active, 281 peering; 410 GB data, 841 GB used, 2882 GB / 3724 GB avail
> > 2013-02-20 04:39:19.111723 osd.0 10.200.63.132:6801/18444 800 : [WRN] 6 slow requests, 6 included below; oldest blocked for > 30.770732 secs
> > 2013-02-20 04:39:19.111729 osd.0 10.200.63.132:6801/18444 801 : [WRN] slow request 30.770732 seconds old, received at 2013-02-20 04:38:48.340908: osd_op(client.9971.0:685981 rb.0.1c62.2ae8944a.0000000003aa [write 3878912~4096] 2.c82ee285) v4 currently reached pg
> > 2013-02-20 04:39:19.111735 osd.0 10.200.63.132:6801/18444 802 : [WRN] slow request 30.770225 seconds old, received at 2013-02-20 04:38:48.341415: osd_op(client.9971.0:685984 rb.0.1c62.2ae8944a.000000000439 [write 364544~20480] 2.b16f5ace) v4 currently reached pg
> > 2013-02-20 04:39:19.111738 osd.0 10.200.63.132:6801/18444 803 : [WRN] slow request 30.456112 seconds old, received at 2013-02-20 04:38:48.655528: osd_op(client.9986.0:178417 broot.rbd [watch 1~0] 2.d30a2f40) v4 currently reached pg
> > 2013-02-20 04:39:19.111743 osd.0 10.200.63.132:6801/18444 804 : [WRN] slow request 30.456106 seconds old, received at 2013-02-20 04:38:48.655534: osd_op(client.9989.0:215170 broot-nfs2.rbd [watch 1~0] 2.7802d31e) v4 currently reached pg
> > 2013-02-20 04:39:19.111747 osd.0 10.200.63.132:6801/18444 805 : [WRN] slow request 30.455860 seconds old, received at 2013-02-20 04:38:48.655780: osd_op(client.9968.0:302450 dns1.rbd [watch 1~0] 2.383712c1) v4 currently reached pg
> > 
> > ----------------------------------------------------------------------
> > grep osd_ping ceph-osd.0.log
> > ----------------------------------------------------------------------
> > 2013-02-20 04:37:57.347387 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:37:57.347384) v2 -- ?+0 0xbd248c0 con 0xa2dcdc0
> > 2013-02-20 04:37:57.349406 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79153 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:37:57.347384) v2 ==== 47+0+0 (2837779695 0 0) 0xbc28a80 con 0xa2dcdc0
> > 2013-02-20 04:37:57.847588 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:37:57.847586) v2 -- ?+0 0xa1ea540 con 0xa2dcdc0
> > 2013-02-20 04:37:58.050400 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79154 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:37:57.847586) v2 ==== 47+0+0 (3920125339 0 0) 0xdb34000 con 0xa2dcdc0
> > <<<< osd.0 sees delayed ping_replies from here >>>>
> > 2013-02-20 04:37:59.547719 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:37:59.547716) v2 -- ?+0 0x73e5340 con 0xa2dcdc0
> > 2013-02-20 04:38:00.047911 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:00.047909) v2 -- ?+0 0x99bbc00 con 0xa2dcdc0
> > 2013-02-20 04:38:01.748080 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:01.748077) v2 -- ?+0 0xa1eb6c0 con 0xa2dcdc0
> > 2013-02-20 04:38:03.448223 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:03.448220) v2 -- ?+0 0x99bb180 con 0xa2dcdc0
> > 2013-02-20 04:38:03.948413 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:03.948411) v2 -- ?+0 0x99ba700 con 0xa2dcdc0
> > 2013-02-20 04:38:04.448601 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:04.448598) v2 -- ?+0 0x99ba540 con 0xa2dcdc0
> > 2013-02-20 04:38:04.948724 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:04.948720) v2 -- ?+0 0xa1ebdc0 con 0xa2dcdc0
> > 2013-02-20 04:38:08.448860 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:08.448856) v2 -- ?+0 0xb676380 con 0xa2dcdc0
> > 2013-02-20 04:38:08.949028 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:08.949025) v2 -- ?+0 0xbc28380 con 0xa2dcdc0
> > 2013-02-20 04:38:10.649263 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:10.649260) v2 -- ?+0 0x965d340 con 0xa2dcdc0
> > 2013-02-20 04:38:11.749458 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:11.749455) v2 -- ?+0 0xa1eafc0 con 0xa2dcdc0
> > 2013-02-20 04:38:12.799154 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79155 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:37:59.547716) v2 ==== 47+0+0 (1242454262 0 0) 0xbc29880 con 0xa2dcdc0
> > 2013-02-20 04:38:12.799459 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79156 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:00.047909) v2 ==== 47+0+0 (3852750933 0 0) 0xcaff180 con 0xa2dcdc0
> > 2013-02-20 04:38:12.799496 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79157 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:01.748077) v2 ==== 47+0+0 (3672189647 0 0) 0xb677340 con 0xa2dcdc0
> > 2013-02-20 04:38:12.799542 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79158 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:03.448220) v2 ==== 47+0+0 (38366945 0 0) 0xbc28c40 con 0xa2dcdc0
> > 2013-02-20 04:38:12.799554 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79159 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:03.948411) v2 ==== 47+0+0 (83904766 0 0) 0x884ee00 con 0xa2dcdc0
> > 2013-02-20 04:38:12.799573 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79160 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:04.448598) v2 ==== 47+0+0 (2688468082 0 0) 0x10c5c1c0 con 0xa2dcdc0
> > 2013-02-20 04:38:12.799667 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79161 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:04.948720) v2 ==== 47+0+0 (4187258751 0 0) 0xb21a540 con 0xa2dcdc0
> > 2013-02-20 04:38:12.799689 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79162 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:08.448856) v2 ==== 47+0+0 (4176431512 0 0) 0xb21b180 con 0xa2dcdc0
> > 2013-02-20 04:38:12.799710 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79163 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:08.949025) v2 ==== 47+0+0 (2888471344 0 0) 0xb21b340 con 0xa2dcdc0
> > 2013-02-20 04:38:12.799728 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79164 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:10.649260) v2 ==== 47+0+0 (3060931781 0 0) 0xb21aa80 con 0xa2dcdc0
> > 2013-02-20 04:38:12.799745 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79165 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:11.749455) v2 ==== 47+0+0 (2767620502 0 0) 0x8d4e380 con 0xa2dcdc0
> > 2013-02-20 04:38:14.049649 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:14.049645) v2 -- ?+0 0xa1ea000 con 0xa2dcdc0
> > 2013-02-20 04:38:14.260608 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79166 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:14.049645) v2 ==== 47+0+0 (462572634 0 0) 0xbc29a40 con 0xa2dcdc0
> > 2013-02-20 04:38:15.149828 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:15.149826) v2 -- ?+0 0xac85340 con 0xa2dcdc0
> > 2013-02-20 04:38:15.151892 7f1090f23700  1 -- 192.168.254.132:0/18444 <== osd.1 192.168.254.133:6803/21178 79167 ==== osd_ping(ping_reply e790 stamp 2013-02-20 04:38:15.149826) v2 ==== 47+0+0 (2092320694 0 0) 0xdb34380 con 0xa2dcdc0
> > 2013-02-20 04:38:21.050059 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:21.050056) v2 -- ?+0 0xb677c00 con 0xa2dcdc0
> > 2013-02-20 04:38:25.750198 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:25.750195) v2 -- ?+0 0xbd24000 con 0xa2dcdc0
> > 2013-02-20 04:38:28.650370 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:28.650367) v2 -- ?+0 0x7e94a80 con 0xa2dcdc0
> > 2013-02-20 04:38:32.150553 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:32.150550) v2 -- ?+0 0xb677500 con 0xa2dcdc0
> > 2013-02-20 04:38:34.450740 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:34.450737) v2 -- ?+0 0x9bc7500 con 0xa2dcdc0
> > 2013-02-20 04:38:35.369720 7f109af37700 -1 osd.0 790 heartbeat_check: no reply from osd.1 since 2013-02-20 04:38:15.149826 (cutoff 2013-02-20 04:38:15.369719)
> > 2013-02-20 04:38:36.369895 7f109af37700 -1 osd.0 790 heartbeat_check: no reply from osd.1 since 2013-02-20 04:38:15.149826 (cutoff 2013-02-20 04:38:16.369894)
> > 
> > ----------------------------------------------------------------------
> > grep osd_ping ceph-osd.1.log
> > ----------------------------------------------------------------------
> > 2013-02-20 04:37:57.847878 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79154 ==== osd_ping(ping e790 stamp 2013-02-20 04:37:57.847586) v2 ==== 47+0+0 (2625351075 0 0) 0xb441880 con 0xb9a89a0
> > 2013-02-20 04:37:57.847957 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:37:57.847586) v2 -- ?+0 0xe5921c0 con 0xb9a89a0
> > <<<< osd.0 sees delayed ping_replies from here >>>>
> > 2013-02-20 04:37:59.547994 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79155 ==== osd_ping(ping e790 stamp 2013-02-20 04:37:59.547716) v2 ==== 47+0+0 (1071491278 0 0) 0xb440700 con 0xb9a89a0
> > 2013-02-20 04:37:59.548066 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:37:59.547716) v2 -- ?+0 0xb441880 con 0xb9a89a0
> > 2013-02-20 04:38:00.048174 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79156 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:00.047909) v2 ==== 47+0+0 (2423758957 0 0) 0xc987a40 con 0xb9a89a0
> > 2013-02-20 04:38:00.048262 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:00.047909) v2 -- ?+0 0xb440700 con 0xb9a89a0
> > 2013-02-20 04:38:01.748248 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79157 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:01.748077) v2 ==== 47+0+0 (2939345655 0 0) 0xb6d96c0 con 0xb9a89a0
> > 2013-02-20 04:38:01.748330 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:01.748077) v2 -- ?+0 0xc987a40 con 0xb9a89a0
> > 2013-02-20 04:38:03.448435 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79158 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:03.448220) v2 ==== 47+0+0 (2006621913 0 0) 0xb71c540 con 0xb9a89a0
> > 2013-02-20 04:38:03.448531 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:03.448220) v2 -- ?+0 0xb6d96c0 con 0xb9a89a0
> > 2013-02-20 04:38:04.163566 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79159 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:03.948411) v2 ==== 47+0+0 (1892923590 0 0) 0xc02b6c0 con 0xb9a89a0
> > 2013-02-20 04:38:04.163648 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:03.948411) v2 -- ?+0 0xb71c540 con 0xb9a89a0
> > 2013-02-20 04:38:04.448837 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79160 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:04.448598) v2 ==== 47+0+0 (3589092426 0 0) 0xc02a8c0 con 0xb9a89a0
> > 2013-02-20 04:38:04.448876 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:04.448598) v2 -- ?+0 0xc02b6c0 con 0xb9a89a0
> > 2013-02-20 04:38:04.949019 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79161 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:04.948720) v2 ==== 47+0+0 (2353499975 0 0) 0x6fae700 con 0xb9a89a0
> > 2013-02-20 04:38:04.949106 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:04.948720) v2 -- ?+0 0xc02a8c0 con 0xb9a89a0
> > 2013-02-20 04:38:08.449126 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79162 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:08.448856) v2 ==== 47+0+0 (2369567136 0 0) 0xc02ac40 con 0xb9a89a0
> > 2013-02-20 04:38:08.449210 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:08.448856) v2 -- ?+0 0x6fae700 con 0xb9a89a0
> > 2013-02-20 04:38:08.949215 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79163 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:08.949025) v2 ==== 47+0+0 (3656999688 0 0) 0xc02ba40 con 0xb9a89a0
> > 2013-02-20 04:38:08.949277 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:08.949025) v2 -- ?+0 0xc02ac40 con 0xb9a89a0
> > 2013-02-20 04:38:10.649580 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79164 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:10.649260) v2 ==== 47+0+0 (3282169085 0 0) 0xc02a000 con 0xb9a89a0
> > 2013-02-20 04:38:10.649647 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:10.649260) v2 -- ?+0 0xc02ba40 con 0xb9a89a0
> > 2013-02-20 04:38:11.749750 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79165 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:11.749455) v2 ==== 47+0+0 (3508894126 0 0) 0xe593180 con 0xb9a89a0
> > 2013-02-20 04:38:11.749835 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:11.749455) v2 -- ?+0 0xc02a000 con 0xb9a89a0
> > 2013-02-20 04:38:14.049868 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79166 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:14.049645) v2 ==== 47+0+0 (1849801826 0 0) 0xc02bdc0 con 0xb9a89a0
> > 2013-02-20 04:38:14.049943 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:14.049645) v2 -- ?+0 0xe593180 con 0xb9a89a0
> > 2013-02-20 04:38:15.150155 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79167 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:15.149826) v2 ==== 47+0+0 (157661070 0 0) 0xb6d9340 con 0xb9a89a0
> > 2013-02-20 04:38:15.150242 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:15.149826) v2 -- ?+0 0xc02bdc0 con 0xb9a89a0
> > 2013-02-20 04:38:21.050348 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79168 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:21.050056) v2 ==== 47+0+0 (2307149705 0 0) 0xbf4bc00 con 0xb9a89a0
> > 2013-02-20 04:38:21.050433 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:21.050056) v2 -- ?+0 0xb6d9340 con 0xb9a89a0
> > 2013-02-20 04:38:25.750415 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79169 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:25.750195) v2 ==== 47+0+0 (3873827452 0 0) 0xe592000 con 0xb9a89a0
> > 2013-02-20 04:38:25.750548 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:25.750195) v2 -- ?+0 0xbf4bc00 con 0xb9a89a0
> > 2013-02-20 04:38:28.650634 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79170 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:28.650367) v2 ==== 47+0+0 (718534970 0 0) 0x78ff880 con 0xb9a89a0
> > 2013-02-20 04:38:28.650713 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:28.650367) v2 -- ?+0 0xe592000 con 0xb9a89a0
> > 2013-02-20 04:38:32.150855 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79171 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:32.150550) v2 ==== 47+0+0 (1909552382 0 0) 0x6fae000 con 0xb9a89a0
> > 2013-02-20 04:38:32.150939 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:32.150550) v2 -- ?+0 0x78ff880 con 0xb9a89a0
> > 2013-02-20 04:38:34.450994 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79172 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:34.450737) v2 ==== 47+0+0 (886776134 0 0) 0xd304700 con 0xb9a89a0
> > 2013-02-20 04:38:34.451033 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:34.450737) v2 -- ?+0 0x6fae000 con 0xb9a89a0
> > 2013-02-20 04:38:37.351175 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79173 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:37.350908) v2 ==== 47+0+0 (2444468724 0 0) 0xe593340 con 0xb9a89a0
> > 2013-02-20 04:38:37.351215 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:37.350908) v2 -- ?+0 0xd304700 con 0xb9a89a0
> > 2013-02-20 04:38:41.451417 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79174 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:41.451094) v2 ==== 47+0+0 (4159941099 0 0) 0xb6d8380 con 0xb9a89a0
> > 2013-02-20 04:38:41.451477 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:41.451094) v2 -- ?+0 0xe593340 con 0xb9a89a0
> > 2013-02-20 04:38:43.751635 7f564b733700  1 -- 192.168.254.133:6803/21178 <== osd.0 192.168.254.132:0/18444 79175 ==== osd_ping(ping e790 stamp 2013-02-20 04:38:43.751289) v2 ==== 47+0+0 (800627449 0 0) 0xbb0b180 con 0xb9a89a0
> > 2013-02-20 04:38:43.751675 7f564b733700  1 -- 192.168.254.133:6803/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e790 stamp 2013-02-20 04:38:43.751289) v2 -- ?+0 0xb6d8380 con 0xb9a89a0
> > 2013-02-20 04:38:53.151872 7f564b733700  1 -- 192.168.254.133:6802/21178 <== osd.0 192.168.254.132:0/18444 1 ==== osd_ping(ping e794 stamp 2013-02-20 04:38:53.151589) v2 ==== 47+0+0 (3010557341 0 0) 0x78ff880 con 0x10d39340
> > 2013-02-20 04:38:53.151908 7f564b733700  1 -- 192.168.254.133:6802/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e794 stamp 2013-02-20 04:38:53.151589) v2 -- ?+0 0xbb0b180 con 0x10d39340
> > 2013-02-20 04:38:56.052292 7f564b733700  1 -- 192.168.254.133:6802/21178 <== osd.0 192.168.254.132:0/18444 2 ==== osd_ping(ping e794 stamp 2013-02-20 04:38:56.051788) v2 ==== 47+0+0 (4254185894 0 0) 0xb6d9340 con 0x10d39340
> > 2013-02-20 04:38:56.052330 7f564b733700  1 -- 192.168.254.133:6802/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e794 stamp 2013-02-20 04:38:56.051788) v2 -- ?+0 0x78ff880 con 0x10d39340
> > 2013-02-20 04:38:57.152438 7f564b733700  1 -- 192.168.254.133:6802/21178 <== osd.0 192.168.254.132:0/18444 3 ==== osd_ping(ping e794 stamp 2013-02-20 04:38:57.152006) v2 ==== 47+0+0 (1503720898 0 0) 0xb6d96c0 con 0x10d39340
> > 2013-02-20 04:38:57.152472 7f564b733700  1 -- 192.168.254.133:6802/21178 --> 192.168.254.132:0/18444 -- osd_ping(ping_reply e794 stamp 2013-02-20 04:38:57.152006) v2 -- ?+0 0xb6d9340 con 0x10d39340
> > 
> > ----------------------------------------------------------------------
> > ceph-osd.0.log, showing osd.0 <=> osd.1 activity whilst osd.1 ping_replies aren't being seen
> > ----------------------------------------------------------------------
> > <<<< osd.0 sees delayed ping_replies from here >>>>
> > 2013-02-20 04:37:59.547719 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:37:59.547716) v2 -- ?+0 0x73e5340 con 0xa2dcdc0
> > 2013-02-20 04:38:00.047911 7f108bf19700  1 -- 192.168.254.132:0/18444 --> 192.168.254.133:6803/21178 -- osd_ping(ping e790 stamp 2013-02-20 04:38:00.047909) v2 -- ?+0 0x99bbc00 con 0xa2dcdc0
> > 2013-02-20 04:38:00.342856 7f108e71e700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.9811.0:498225 2.51 4d55cd51/rb.0.2150.2ae8944a.000000000628/head//2 [] v 790'271511 snapset=0=[]:[] snapc=0=[]) v7 -- ?+8799 0x8f91400
> > 2013-02-20 04:38:01.046192 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989786 ==== osd_sub_op(unknown.0.0:0 2.4e 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v7 ==== 955+0+21617430 (245718815 0 180673924) 0xd19d200 con 0x9571080
> > 2013-02-20 04:38:01.046297 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989787 ==== osd_sub_op_reply(client.9971.0:685924 2.4e b16f5ace/rb.0.1c62.2ae8944a.000000000439/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (3427135234 0 0) 0xa9c4f00 con 0x9571080
> > 2013-02-20 04:38:01.046352 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989788 ==== osd_sub_op_reply(client.9953.0:367558 2.5a 6d3052da/rbd_data.242f2ae8944a.000000000000002c/head//2 [] ondisk, result = 0) v1 ==== 164+0+0 (2641157471 0 0) 0x86f2280 con 0x9571080
> > 2013-02-20 04:38:01.046390 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989789 ==== osd_sub_op_reply(client.9953.0:367559 2.28 8340fa28/rbd_data.242f2ae8944a.0000000000000096/head//2 [] ondisk, result = 0) v1 ==== 164+0+0 (2759328511 0 0) 0x95d2780 con 0x9571080
> > 2013-02-20 04:38:01.046743 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989790 ==== osd_sub_op_reply(client.9953.0:367560 2.5a 6678d7da/rbd_data.242f2ae8944a.0000000000000b8b/head//2 [] ondisk, result = 0) v1 ==== 164+0+0 (602117107 0 0) 0xda1fb80 con 0x9571080
> > 2013-02-20 04:38:01.047020 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989791 ==== osd_sub_op_reply(client.9811.0:498225 2.51 4d55cd51/rb.0.2150.2ae8944a.000000000628/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (435514219 0 0) 0xac43400 con 0x9571080
> > 2013-02-20 04:38:01.047240 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989792 ==== osd_sub_op(client.9995.0:4402170 2.77 3b454af7/rb.0.1cf8.2ae8944a.000000000e35/head//2 [] v 790'353333 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+8799 (240510986 0 1818934932) 0x8f91400 con 0x9571080
> > 2013-02-20 04:38:01.047600 7f108cf1b700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- replica scrub(pg: 2.4e,from:0'0,to:0'0,epoch:790,start:6388b3ce//0//-1,end:cff8e3ce//0//-1,chunky:1,deep:0,version:4) v4 -- ?+0 0xb0b3b00
> > 2013-02-20 04:38:01.050467 7f108df1d700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.9811.0:498226 2.51 4d55cd51/rb.0.2150.2ae8944a.000000000628/head//2 [] v 790'271512 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0xd19d200
> > 2013-02-20 04:38:01.060518 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989793 ==== osd_sub_op(client.9968.0:302420 2.74 2c5fc7f4/rb.0.1c64.74b0dc51.00000000043b/head//2 [] v 790'498230 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+16991 (2821939739 0 1047147434) 0xd19d200 con 0x9571080
> > 2013-02-20 04:38:01.072310 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989794 ==== osd_sub_op_reply(client.9811.0:498226 2.51 4d55cd51/rb.0.2150.2ae8944a.000000000628/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (3589886011 0 0) 0xaf1ef00 con 0x9571080
> > 2013-02-20 04:38:01.086276 7f1093f29700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op_reply(client.9995.0:4402170 2.77 3b454af7/rb.0.1cf8.2ae8944a.000000000e35/head//2 [] ondisk, result = 0) v1 -- ?+0 0xc477400
> > 2013-02-20 04:38:01.090522 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989795 ==== osd_sub_op(client.9995.0:4402171 2.77 3b454af7/rb.0.1cf8.2ae8944a.000000000e35/head//2 [] v 790'353334 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+4703 (4105904555 0 2838991357) 0xbddea00 con 0x9571080
> > 2013-02-20 04:38:01.097246 7f1093f29700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op_reply(client.9968.0:302420 2.74 2c5fc7f4/rb.0.1c64.74b0dc51.00000000043b/head//2 [] ondisk, result = 0) v1 -- ?+0 0x9055400
> > 2013-02-20 04:38:01.100259 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989796 ==== osd_sub_op(client.9968.0:302421 2.74 2c5fc7f4/rb.0.1c64.74b0dc51.00000000043b/head//2 [] v 790'498231 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+4703 (3340239323 0 2873835958) 0x810f400 con 0x9571080
> > 2013-02-20 04:38:01.110432 7f1093f29700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op_reply(client.9995.0:4402171 2.77 3b454af7/rb.0.1cf8.2ae8944a.000000000e35/head//2 [] ondisk, result = 0) v1 -- ?+0 0xc477680
> > 2013-02-20 04:38:01.121128 7f1093f29700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op_reply(client.9968.0:302421 2.74 2c5fc7f4/rb.0.1c64.74b0dc51.00000000043b/head//2 [] ondisk, result = 0) v1 -- ?+0 0x296d680
> > 2013-02-20 04:38:01.366189 7f108e71e700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.10001.0:3423766 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] v 790'768516 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0x9fe1400
> > 2013-02-20 04:38:01.366377 7f108e71e700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.10001.0:3423767 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] v 790'768517 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0xb978a00
> > 2013-02-20 04:38:01.366519 7f108e71e700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.10001.0:3423768 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] v 790'768518 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0x815d200
> > 2013-02-20 04:38:01.366728 7f108df1d700  1 -- 192.168.254.132:6802/18444 --> osd.1 192.168.254.133:6801/21178 -- osd_sub_op(client.10001.0:3423770 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] v 790'768519 snapset=0=[]:[] snapc=0=[]) v7 -- ?+4703 0xc557e00
> > 2013-02-20 04:38:01.367681 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989797 ==== osd_sub_op(client.10001.0:3423769 2.17 37564117/rb.0.209f.74b0dc51.000000001200/head//2 [] v 790'353128 snapset=0=[]:[] snapc=0=[]) v7 ==== 1107+0+4703 (1225650648 0 1162937880) 0xa052800 con 0x9571080
> > 2013-02-20 04:38:01.378938 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989798 ==== osd_sub_op_reply(client.10001.0:3423766 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (2250676101 0 0) 0xc47768 0 con 0x9571080
> > 2013-02-20 04:38:01.379109 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989799 ==== osd_sub_op_reply(client.10001.0:3423767 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (2794751629 0 0) 0x7b1aa0 0 con 0x9571080
> > 2013-02-20 04:38:01.379162 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989800 ==== osd_sub_op_reply(client.10001.0:3423768 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (41013162 0 0) 0x8086a00 con 0x9571080
> > 2013-02-20 04:38:01.379396 7f1091f25700  1 -- 192.168.254.132:6802/18444 <== osd.1 192.168.254.133:6801/21178 4989801 ==== osd_sub_op_reply(client.10001.0:3423770 2.7d f598cffd/rb.0.209f.74b0dc51.000000000a00/head//2 [] ondisk, result = 0) v1 ==== 157+0+0 (2912526559 0 0) 0xb05450 0 con 0x9571080
> > 
> > ----------------------------------------------------------------------
> > osd.0 load and disk
> > ----------------------------------------------------------------------
> >                     load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
> > 2013-02-20-04:37:56  0.6   1.4   0.0   1.4   0.5   0.0  94.1  sdn    0.0    0.0    0.0   19.1     0.00   182.00  19.09   0.59  30.87   0.00  30.9  14.5
> > 2013-02-20-04:38:11  0.6   1.2   0.0   1.1   1.0   0.0  93.4  sdn    0.0    0.0    0.0   20.8     0.00   152.00  14.62   0.79  38.14   0.00  38.1  12.5
> > 2013-02-20-04:38:26  0.6   1.3   0.0   1.1   0.5   0.0  94.0  sdn    0.0    0.0    0.0   17.4     0.00   170.53  19.60   0.69  39.85   0.00  39.8  14.7
> > 2013-02-20-04:38:41  0.6   1.1   0.0   1.2   0.5   0.0  94.3  sdn    0.0    0.2    0.0   21.8     0.00   179.83  16.50   0.61  28.17   0.00  28.2  12.1
> > 2013-02-20-04:38:56  0.8   1.7   0.0   1.2   4.3   0.0  90.6  sdn    0.0    0.1    0.2  141.1     1.60  1838.33  26.04  41.68 294.67 283.33 294.7  53.9
> > 2013-02-20-04:39:11  0.7   0.3   0.0   0.3   0.3   0.0  98.3  sdn    0.0    0.0    0.0    2.3     0.00    45.20  39.88   0.16  85.00   0.00  85.0   4.8
> > 2013-02-20-04:39:26  0.6   0.1   0.0   0.1   0.0   0.0  99.2  sdn    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> > 2013-02-20-04:39:41  0.5   0.4   0.0   0.3   0.1   0.0  99.4  sdn    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> > 2013-02-20-04:39:56  0.4   0.2   0.0   0.2   0.1   0.0  99.5  sdn    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> > 2013-02-20-04:40:11  0.3   0.3   0.0   0.3   0.0   0.0  99.5  sdn    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> > 
> > ----------------------------------------------------------------------
> > osd.1 load and disk
> > ----------------------------------------------------------------------
> >                     load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
> > 2013-02-20-04:37:57  0.8   0.3   0.0   0.2   0.6   0.0  97.7  sdg    0.0    0.1    2.5   19.7    10.40   179.93  17.15   0.97  43.63  38.16  44.3  19.3
> > 2013-02-20-04:38:12  1.0   0.3   0.0   0.3   0.7   0.0  97.4  sdg    0.0    0.1    2.8   28.0    11.47   232.20  15.82   0.71  23.20  25.48  23.0  16.1
> > 2013-02-20-04:38:27  0.8   0.4   0.0   0.2   0.7   0.0  97.4  sdg    0.0    0.0    2.7   11.3    10.67   102.87  16.30   0.31  22.30  14.25  24.2  11.1
> > 2013-02-20-04:38:42  0.7   0.2   0.0   0.2   0.8   0.0  97.6  sdg    0.0    0.0    2.3   22.3     9.07   198.07  16.84   1.15  46.72  31.18  48.3  20.9
> > 2013-02-20-04:38:57  0.6   0.6   0.0   0.5   5.2   0.0  92.9  sdg    0.0    0.0   15.8  162.7    63.73  1613.37  18.79  32.51 182.07  24.94 197.3  63.1
> > 2013-02-20-04:39:12  0.5   0.0   0.0   0.0   0.1   0.0  99.8  sdg    0.0    0.0    0.0    5.8     0.00    76.87  26.51   0.32  57.82   0.00  57.8   9.2
> > 2013-02-20-04:39:27  0.4   0.0   0.0   0.0   0.1   0.0  99.9  sdg    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> > 2013-02-20-04:39:42  0.6   0.0   0.0   0.0   0.1   0.0  99.8  sdg    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> > 2013-02-20-04:39:57  0.5   0.0   0.0   0.0   0.1   0.0  99.8  sdg    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> > 2013-02-20-04:40:12  0.4   0.0   0.0   0.0   0.1   0.0  99.8  sdg    0.0    0.0    0.0    0.0     0.00     0.00   0.00   0.00   0.00   0.00   0.0   0.0
> > 
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-22 21:57               ` Sage Weil
@ 2013-02-22 23:35                 ` Chris Dunlop
  2013-02-22 23:43                   ` Sage Weil
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-02-22 23:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> On Fri, 22 Feb 2013, Chris Dunlop wrote:
>> G'day,
>> 
>> It seems there might be two issues here: the first being the delayed
>> receipt of echo replies causing an seemingly otherwise healthy osd to be
>> marked down, the second being the lack of recovery once the downed osd is
>> recognised as up again.
>> 
>> Is it worth my opening tracker reports for this, just so it doesn't get
>> lost?
> 
> I just looked at the logs.  I can't tell what happend to cause that 10 
> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> 
> The strange bit is that after this, you get those indefinite hangs.  From 
> the logs it looks like the OSD rebound to an old port that was previously 
> open from osd.0.. probably from way back.  Do you have logs going further 
> back than what you posted?  Also, do you have osdmaps, say, 750 and 
> onward?  It looks like there is a bug in the connection handling code 
> (that is unrelated to the delay above).

Currently uploading logs starting midnight to dropbox, will send
links when when they're up.

How would I retrieve the interesting osdmaps?

Chris.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-22 23:35                 ` Chris Dunlop
@ 2013-02-22 23:43                   ` Sage Weil
  2013-02-23  0:08                     ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Sage Weil @ 2013-02-22 23:43 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

On Sat, 23 Feb 2013, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> > On Fri, 22 Feb 2013, Chris Dunlop wrote:
> >> G'day,
> >> 
> >> It seems there might be two issues here: the first being the delayed
> >> receipt of echo replies causing an seemingly otherwise healthy osd to be
> >> marked down, the second being the lack of recovery once the downed osd is
> >> recognised as up again.
> >> 
> >> Is it worth my opening tracker reports for this, just so it doesn't get
> >> lost?
> > 
> > I just looked at the logs.  I can't tell what happend to cause that 10 
> > second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> > came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> > 
> > The strange bit is that after this, you get those indefinite hangs.  From 
> > the logs it looks like the OSD rebound to an old port that was previously 
> > open from osd.0.. probably from way back.  Do you have logs going further 
> > back than what you posted?  Also, do you have osdmaps, say, 750 and 
> > onward?  It looks like there is a bug in the connection handling code 
> > (that is unrelated to the delay above).
> 
> Currently uploading logs starting midnight to dropbox, will send
> links when when they're up.
> 
> How would I retrieve the interesting osdmaps?

They are in the monitor data directory, in the osdmap_full dir.

Thanks!
sage

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-22 23:43                   ` Sage Weil
@ 2013-02-23  0:08                     ` Chris Dunlop
  2013-02-23  0:13                       ` Sage Weil
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-02-23  0:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>> On Fri, 22 Feb 2013, Chris Dunlop wrote:
>>>> G'day,
>>>> 
>>>> It seems there might be two issues here: the first being the delayed
>>>> receipt of echo replies causing an seemingly otherwise healthy osd to be
>>>> marked down, the second being the lack of recovery once the downed osd is
>>>> recognised as up again.
>>>> 
>>>> Is it worth my opening tracker reports for this, just so it doesn't get
>>>> lost?
>>> 
>>> I just looked at the logs.  I can't tell what happend to cause that 10 
>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).

Is there any way of telling where they were delayed, i.e. in the 1's output
queue or 0's input queue?

>>> The strange bit is that after this, you get those indefinite hangs.  From 
>>> the logs it looks like the OSD rebound to an old port that was previously 
>>> open from osd.0.. probably from way back.  Do you have logs going further 
>>> back than what you posted?  Also, do you have osdmaps, say, 750 and 
>>> onward?  It looks like there is a bug in the connection handling code 
>>> (that is unrelated to the delay above).
>> 
>> Currently uploading logs starting midnight to dropbox, will send
>> links when when they're up.
>> 
>> How would I retrieve the interesting osdmaps?
> 
> They are in the monitor data directory, in the osdmap_full dir.

Logs from midnight onwards and osdmaps are in this folder:

https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2

  ceph-mon.b2.log.bz2
  ceph-mon.b4.log.bz2
  ceph-mon.b5.log.bz2
  ceph-osd.0.log.bz2
  ceph-osd.1.log.bz2 (still uploading as I type)
  osdmaps.zip

Cheers,

Chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-23  0:08                     ` Chris Dunlop
@ 2013-02-23  0:13                       ` Sage Weil
  2013-02-23  0:25                         ` Sage Weil
  2013-02-23  0:57                         ` Chris Dunlop
  0 siblings, 2 replies; 25+ messages in thread
From: Sage Weil @ 2013-02-23  0:13 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

On Sat, 23 Feb 2013, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> > On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> >>> On Fri, 22 Feb 2013, Chris Dunlop wrote:
> >>>> G'day,
> >>>> 
> >>>> It seems there might be two issues here: the first being the delayed
> >>>> receipt of echo replies causing an seemingly otherwise healthy osd to be
> >>>> marked down, the second being the lack of recovery once the downed osd is
> >>>> recognised as up again.
> >>>> 
> >>>> Is it worth my opening tracker reports for this, just so it doesn't get
> >>>> lost?
> >>> 
> >>> I just looked at the logs.  I can't tell what happend to cause that 10 
> >>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> >>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> 
> Is there any way of telling where they were delayed, i.e. in the 1's output
> queue or 0's input queue?

Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
generate a lot of logging, though.

> >>> The strange bit is that after this, you get those indefinite hangs.  From 
> >>> the logs it looks like the OSD rebound to an old port that was previously 
> >>> open from osd.0.. probably from way back.  Do you have logs going further 
> >>> back than what you posted?  Also, do you have osdmaps, say, 750 and 
> >>> onward?  It looks like there is a bug in the connection handling code 
> >>> (that is unrelated to the delay above).
> >> 
> >> Currently uploading logs starting midnight to dropbox, will send
> >> links when when they're up.
> >> 
> >> How would I retrieve the interesting osdmaps?
> > 
> > They are in the monitor data directory, in the osdmap_full dir.
> 
> Logs from midnight onwards and osdmaps are in this folder:
> 
> https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2
> 
>   ceph-mon.b2.log.bz2
>   ceph-mon.b4.log.bz2
>   ceph-mon.b5.log.bz2
>   ceph-osd.0.log.bz2
>   ceph-osd.1.log.bz2 (still uploading as I type)
>   osdmaps.zip

I'll take a look...

sage

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-23  0:13                       ` Sage Weil
@ 2013-02-23  0:25                         ` Sage Weil
  2013-02-23  0:50                           ` Chris Dunlop
  2013-02-23  0:57                         ` Chris Dunlop
  1 sibling, 1 reply; 25+ messages in thread
From: Sage Weil @ 2013-02-23  0:25 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

Hi Chris-

Can you confirm that both ceph-osd daemons are running v0.56.3 (i.e., 
they were restarted after the upgrade)?

sage

On Fri, 22 Feb 2013, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> > On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> > > On Sat, 23 Feb 2013, Chris Dunlop wrote:
> > >> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> > >>> On Fri, 22 Feb 2013, Chris Dunlop wrote:
> > >>>> G'day,
> > >>>> 
> > >>>> It seems there might be two issues here: the first being the delayed
> > >>>> receipt of echo replies causing an seemingly otherwise healthy osd to be
> > >>>> marked down, the second being the lack of recovery once the downed osd is
> > >>>> recognised as up again.
> > >>>> 
> > >>>> Is it worth my opening tracker reports for this, just so it doesn't get
> > >>>> lost?
> > >>> 
> > >>> I just looked at the logs.  I can't tell what happend to cause that 10 
> > >>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> > >>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> > 
> > Is there any way of telling where they were delayed, i.e. in the 1's output
> > queue or 0's input queue?
> 
> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> generate a lot of logging, though.
> 
> > >>> The strange bit is that after this, you get those indefinite hangs.  From 
> > >>> the logs it looks like the OSD rebound to an old port that was previously 
> > >>> open from osd.0.. probably from way back.  Do you have logs going further 
> > >>> back than what you posted?  Also, do you have osdmaps, say, 750 and 
> > >>> onward?  It looks like there is a bug in the connection handling code 
> > >>> (that is unrelated to the delay above).
> > >> 
> > >> Currently uploading logs starting midnight to dropbox, will send
> > >> links when when they're up.
> > >> 
> > >> How would I retrieve the interesting osdmaps?
> > > 
> > > They are in the monitor data directory, in the osdmap_full dir.
> > 
> > Logs from midnight onwards and osdmaps are in this folder:
> > 
> > https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2
> > 
> >   ceph-mon.b2.log.bz2
> >   ceph-mon.b4.log.bz2
> >   ceph-mon.b5.log.bz2
> >   ceph-osd.0.log.bz2
> >   ceph-osd.1.log.bz2 (still uploading as I type)
> >   osdmaps.zip
> 
> I'll take a look...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-23  0:25                         ` Sage Weil
@ 2013-02-23  0:50                           ` Chris Dunlop
  2013-02-23  1:10                             ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-02-23  0:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, Feb 22, 2013 at 04:25:39PM -0800, Sage Weil wrote:
> Hi Chris-
> 
> Can you confirm that both ceph-osd daemons are running v0.56.3 (i.e., 
> they were restarted after the upgrade)?

Not absolutely, but the indications are good: the osd.1 process
was started Feb 16 08:38:43 2013, 20 minutes before I sent my
email saying it had been upgraded. The osd.0 process was
restarted more recently to kick things along after the most
recent problem, however my command line history shows a "apt-get
upgrade" followed by a "service ceph restart".

Chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-23  0:13                       ` Sage Weil
  2013-02-23  0:25                         ` Sage Weil
@ 2013-02-23  0:57                         ` Chris Dunlop
  2013-02-23  1:30                           ` Sage Weil
  1 sibling, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-02-23  0:57 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>>>> On Fri, 22 Feb 2013, Chris Dunlop wrote:
>>>>>> G'day,
>>>>>> 
>>>>>> It seems there might be two issues here: the first being the delayed
>>>>>> receipt of echo replies causing an seemingly otherwise healthy osd to be
>>>>>> marked down, the second being the lack of recovery once the downed osd is
>>>>>> recognised as up again.
>>>>>> 
>>>>>> Is it worth my opening tracker reports for this, just so it doesn't get
>>>>>> lost?
>>>>> 
>>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
>>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>> 
>> Is there any way of telling where they were delayed, i.e. in the 1's output
>> queue or 0's input queue?
> 
> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> generate a lot of logging, though.

I really don't want to load the system with too much logging, but I'm happy
modifying code...  Are there specific interesting debug outputs which I can
modify so they're output under "ms = 1"?

Chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-23  0:50                           ` Chris Dunlop
@ 2013-02-23  1:10                             ` Chris Dunlop
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Dunlop @ 2013-02-23  1:10 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Sat, Feb 23, 2013 at 11:50:26AM +1100, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 04:25:39PM -0800, Sage Weil wrote:
>> Hi Chris-
>> 
>> Can you confirm that both ceph-osd daemons are running v0.56.3 (i.e., 
>> they were restarted after the upgrade)?
> 
> Not absolutely, but the indications are good: the osd.1 process
> was started Feb 16 08:38:43 2013, 20 minutes before I sent my
> email saying it had been upgraded. The osd.0 process was
> restarted more recently to kick things along after the most
> recent problem, however my command line history shows a "apt-get
> upgrade" followed by a "service ceph restart".


I can't see anything in the logs indicating the various daemon version
numbers.

Unless I'm blind and it's already there, perhaps a good idea for every
daemon to log it's version number when it starts?

Chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-23  0:57                         ` Chris Dunlop
@ 2013-02-23  1:30                           ` Sage Weil
  2013-02-23  1:49                             ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Sage Weil @ 2013-02-23  1:30 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

On Sat, 23 Feb 2013, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
> > On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> >>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> >>>>> On Fri, 22 Feb 2013, Chris Dunlop wrote:
> >>>>>> G'day,
> >>>>>> 
> >>>>>> It seems there might be two issues here: the first being the delayed
> >>>>>> receipt of echo replies causing an seemingly otherwise healthy osd to be
> >>>>>> marked down, the second being the lack of recovery once the downed osd is
> >>>>>> recognised as up again.
> >>>>>> 
> >>>>>> Is it worth my opening tracker reports for this, just so it doesn't get
> >>>>>> lost?
> >>>>> 
> >>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
> >>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> >>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> >> 
> >> Is there any way of telling where they were delayed, i.e. in the 1's output
> >> queue or 0's input queue?
> > 
> > Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> > generate a lot of logging, though.
> 
> I really don't want to load the system with too much logging, but I'm happy
> modifying code...  Are there specific interesting debug outputs which I can
> modify so they're output under "ms = 1"?

I'm basically interested in everything in writer() and write_message(), 
and reader() and read_message()...

sage

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-23  1:30                           ` Sage Weil
@ 2013-02-23  1:49                             ` Chris Dunlop
  2013-02-23  1:52                               ` Sage Weil
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-02-23  1:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>>>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>>>> 
>>>> Is there any way of telling where they were delayed, i.e. in the 1's output
>>>> queue or 0's input queue?
>>> 
>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
>>> generate a lot of logging, though.
>> 
>> I really don't want to load the system with too much logging, but I'm happy
>> modifying code...  Are there specific interesting debug outputs which I can
>> modify so they're output under "ms = 1"?
> 
> I'm basically interested in everything in writer() and write_message(), 
> and reader() and read_message()...

Like this?

----------------------------------------------------------------------
diff --git a/src/msg/Pipe.cc b/src/msg/Pipe.cc
index 37b1eeb..db4774f 100644
--- a/src/msg/Pipe.cc
+++ b/src/msg/Pipe.cc
@@ -1263,7 +1263,7 @@ void Pipe::reader()
 
     // sleep if (re)connecting
     if (state == STATE_STANDBY) {
-      ldout(msgr->cct,20) << "reader sleeping during reconnect|standby" << dendl;
+      ldout(msgr->cct, 1) << "reader sleeping during reconnect|standby" << dendl;
       cond.Wait(pipe_lock);
       continue;
     }
@@ -1272,28 +1272,28 @@ void Pipe::reader()
 
     char buf[80];
     char tag = -1;
-    ldout(msgr->cct,20) << "reader reading tag..." << dendl;
+    ldout(msgr->cct, 1) << "reader reading tag..." << dendl;
     if (tcp_read((char*)&tag, 1) < 0) {
       pipe_lock.Lock();
-      ldout(msgr->cct,2) << "reader couldn't read tag, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
+      ldout(msgr->cct, 1) << "reader couldn't read tag, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
       fault(true);
       continue;
     }
 
     if (tag == CEPH_MSGR_TAG_KEEPALIVE) {
-      ldout(msgr->cct,20) << "reader got KEEPALIVE" << dendl;
+      ldout(msgr->cct, 1) << "reader got KEEPALIVE" << dendl;
       pipe_lock.Lock();
       continue;
     }
 
     // open ...
     if (tag == CEPH_MSGR_TAG_ACK) {
-      ldout(msgr->cct,20) << "reader got ACK" << dendl;
+      ldout(msgr->cct, 1) << "reader got ACK" << dendl;
       ceph_le64 seq;
       int rc = tcp_read((char*)&seq, sizeof(seq));
       pipe_lock.Lock();
       if (rc < 0) {
-	ldout(msgr->cct,2) << "reader couldn't read ack seq, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
+	ldout(msgr->cct, 1) << "reader couldn't read ack seq, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
 	fault(true);
       } else if (state != STATE_CLOSED) {
         handle_ack(seq);
@@ -1302,7 +1302,7 @@ void Pipe::reader()
     }
 
     else if (tag == CEPH_MSGR_TAG_MSG) {
-      ldout(msgr->cct,20) << "reader got MSG" << dendl;
+      ldout(msgr->cct, 1) << "reader got MSG" << dendl;
       Message *m = 0;
       int r = read_message(&m);
 
@@ -1342,7 +1342,7 @@ void Pipe::reader()
 
       cond.Signal();  // wake up writer, to ack this
       
-      ldout(msgr->cct,10) << "reader got message "
+      ldout(msgr->cct, 1) << "reader got message "
 	       << m->get_seq() << " " << m << " " << *m
 	       << dendl;
 
@@ -1360,7 +1360,7 @@ void Pipe::reader()
     } 
     
     else if (tag == CEPH_MSGR_TAG_CLOSE) {
-      ldout(msgr->cct,20) << "reader got CLOSE" << dendl;
+      ldout(msgr->cct, 1) << "reader got CLOSE" << dendl;
       pipe_lock.Lock();
       if (state == STATE_CLOSING) {
 	state = STATE_CLOSED;
@@ -1383,7 +1383,7 @@ void Pipe::reader()
   reader_running = false;
   reader_needs_join = true;
   unlock_maybe_reap();
-  ldout(msgr->cct,10) << "reader done" << dendl;
+  ldout(msgr->cct, 1) << "reader done" << dendl;
 }
 
 /* write msgs to socket.
@@ -1395,7 +1395,7 @@ void Pipe::writer()
 
   pipe_lock.Lock();
   while (state != STATE_CLOSED) {// && state != STATE_WAIT) {
-    ldout(msgr->cct,10) << "writer: state = " << get_state_name()
+    ldout(msgr->cct, 1) << "writer: state = " << get_state_name()
 			<< " policy.server=" << policy.server << dendl;
 
     // standby?
@@ -1413,7 +1413,7 @@ void Pipe::writer()
     
     if (state == STATE_CLOSING) {
       // write close tag
-      ldout(msgr->cct,20) << "writer writing CLOSE tag" << dendl;
+      ldout(msgr->cct, 1) << "writer writing CLOSE tag" << dendl;
       char tag = CEPH_MSGR_TAG_CLOSE;
       state = STATE_CLOSED;
       state_closed.set(1);
@@ -1436,7 +1436,7 @@ void Pipe::writer()
 	int rc = write_keepalive();
 	pipe_lock.Lock();
 	if (rc < 0) {
-	  ldout(msgr->cct,2) << "writer couldn't write keepalive, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
+	  ldout(msgr->cct, 1) << "writer couldn't write keepalive, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
 	  fault();
  	  continue;
 	}
@@ -1450,7 +1450,7 @@ void Pipe::writer()
 	int rc = write_ack(send_seq);
 	pipe_lock.Lock();
 	if (rc < 0) {
-	  ldout(msgr->cct,2) << "writer couldn't write ack, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
+	  ldout(msgr->cct, 1) << "writer couldn't write ack, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
 	  fault();
  	  continue;
 	}
@@ -1470,7 +1470,7 @@ void Pipe::writer()
 	// associate message with Connection (for benefit of encode_payload)
 	m->set_connection(connection_state->get());
 
-        ldout(msgr->cct,20) << "writer encoding " << m->get_seq() << " " << m << " " << *m << dendl;
+        ldout(msgr->cct, 1) << "writer encoding " << m->get_seq() << " " << m << " " << *m << dendl;
 
 	// encode and copy out of *m
 	m->encode(connection_state->get_features(), !msgr->cct->_conf->ms_nocrc);
@@ -1485,13 +1485,13 @@ void Pipe::writer()
 	// actually calculate and check the signature, but they should
 	// handle the calls to sign_message and check_signature.  PLR
 	if (session_security == NULL) {
-	  ldout(msgr->cct, 20) << "writer no session security" << dendl;
+	  ldout(msgr->cct, 1) << "writer no session security" << dendl;
 	} else {
 	  if (session_security->sign_message(m)) {
-	    ldout(msgr->cct, 20) << "writer failed to sign seq # " << header.seq
+	    ldout(msgr->cct, 1) << "writer failed to sign seq # " << header.seq
 				 << "): sig = " << footer.sig << dendl;
 	  } else {
-	    ldout(msgr->cct, 20) << "writer signed seq # " << header.seq
+	    ldout(msgr->cct, 1) << "writer signed seq # " << header.seq
 				 << "): sig = " << footer.sig << dendl;
 	  }
 	}
@@ -1502,7 +1502,7 @@ void Pipe::writer()
 
 	pipe_lock.Unlock();
 
-        ldout(msgr->cct,20) << "writer sending " << m->get_seq() << " " << m << dendl;
+        ldout(msgr->cct, 1) << "writer sending " << m->get_seq() << " " << m << dendl;
 	int rc = write_message(header, footer, blist);
 
 	pipe_lock.Lock();
@@ -1517,22 +1517,22 @@ void Pipe::writer()
     }
     
     if (sent.empty() && close_on_empty) {
-      ldout(msgr->cct,10) << "writer out and sent queues empty, closing" << dendl;
+      ldout(msgr->cct, 1) << "writer out and sent queues empty, closing" << dendl;
       stop();
       continue;
     }
 
     // wait
-    ldout(msgr->cct,20) << "writer sleeping" << dendl;
+    ldout(msgr->cct, 1) << "writer sleeping" << dendl;
     cond.Wait(pipe_lock);
   }
   
-  ldout(msgr->cct,20) << "writer finishing" << dendl;
+  ldout(msgr->cct, 1) << "writer finishing" << dendl;
 
   // reap?
   writer_running = false;
   unlock_maybe_reap();
-  ldout(msgr->cct,10) << "writer done" << dendl;
+  ldout(msgr->cct, 1) << "writer done" << dendl;
 }
 
 void Pipe::unlock_maybe_reap()
@@ -1596,7 +1596,7 @@ int Pipe::read_message(Message **pm)
     header_crc = ceph_crc32c_le(0, (unsigned char *)&oldheader, sizeof(oldheader) - sizeof(oldheader.crc));
   }
 
-  ldout(msgr->cct,20) << "reader got envelope type=" << header.type
+  ldout(msgr->cct, 1) << "reader got envelope type=" << header.type
            << " src " << entity_name_t(header.src)
            << " front=" << header.front_len
 	   << " data=" << header.data_len
@@ -1620,7 +1620,7 @@ int Pipe::read_message(Message **pm)
   uint64_t message_size = header.front_len + header.middle_len + header.data_len;
   if (message_size) {
     if (policy.throttler) {
-      ldout(msgr->cct,10) << "reader wants " << message_size << " from policy throttler "
+      ldout(msgr->cct, 1) << "reader wants " << message_size << " from policy throttler "
 	       << policy.throttler->get_current() << "/"
 	       << policy.throttler->get_max() << dendl;
       waited_on_throttle = policy.throttler->get(message_size);
@@ -1630,7 +1630,7 @@ int Pipe::read_message(Message **pm)
     // policy throttle, as this one does not deadlock (unless dispatch
     // blocks indefinitely, which it shouldn't).  in contrast, the
     // policy throttle carries for the lifetime of the message.
-    ldout(msgr->cct,10) << "reader wants " << message_size << " from dispatch throttler "
+    ldout(msgr->cct, 1) << "reader wants " << message_size << " from dispatch throttler "
 	     << msgr->dispatch_throttler.get_current() << "/"
 	     << msgr->dispatch_throttler.get_max() << dendl;
     waited_on_throttle |= msgr->dispatch_throttler.get(message_size);
@@ -1645,7 +1645,7 @@ int Pipe::read_message(Message **pm)
     if (tcp_read(bp.c_str(), front_len) < 0)
       goto out_dethrottle;
     front.push_back(bp);
-    ldout(msgr->cct,20) << "reader got front " << front.length() << dendl;
+    ldout(msgr->cct, 1) << "reader got front " << front.length() << dendl;
   }
 
   // read middle
@@ -1655,7 +1655,7 @@ int Pipe::read_message(Message **pm)
     if (tcp_read(bp.c_str(), middle_len) < 0)
       goto out_dethrottle;
     middle.push_back(bp);
-    ldout(msgr->cct,20) << "reader got middle " << middle.length() << dendl;
+    ldout(msgr->cct, 1) << "reader got middle " << middle.length() << dendl;
   }
 
 
@@ -1680,7 +1680,7 @@ int Pipe::read_message(Message **pm)
       map<tid_t,pair<bufferlist,int> >::iterator p = connection_state->rx_buffers.find(header.tid);
       if (p != connection_state->rx_buffers.end()) {
 	if (rxbuf.length() == 0 || p->second.second != rxbuf_version) {
-	  ldout(msgr->cct,10) << "reader seleting rx buffer v " << p->second.second
+	  ldout(msgr->cct, 1) << "reader seleting rx buffer v " << p->second.second
 		   << " at offset " << offset
 		   << " len " << p->second.first.length() << dendl;
 	  rxbuf = p->second.first;
@@ -1693,7 +1693,7 @@ int Pipe::read_message(Message **pm)
 	}
       } else {
 	if (!newbuf.length()) {
-	  ldout(msgr->cct,20) << "reader allocating new rx buffer at offset " << offset << dendl;
+	  ldout(msgr->cct, 1) << "reader allocating new rx buffer at offset " << offset << dendl;
 	  alloc_aligned_buffer(newbuf, data_len, data_off);
 	  blp = newbuf.begin();
 	  blp.advance(offset);
@@ -1701,7 +1701,7 @@ int Pipe::read_message(Message **pm)
       }
       bufferptr bp = blp.get_current_ptr();
       int read = MIN(bp.length(), left);
-      ldout(msgr->cct,20) << "reader reading nonblocking into " << (void*)bp.c_str() << " len " << bp.length() << dendl;
+      ldout(msgr->cct, 1) << "reader reading nonblocking into " << (void*)bp.c_str() << " len " << bp.length() << dendl;
       int got = tcp_read_nonblocking(bp.c_str(), read);
       ldout(msgr->cct,30) << "reader read " << got << " of " << read << dendl;
       connection_state->lock.Unlock();
@@ -1732,7 +1732,7 @@ int Pipe::read_message(Message **pm)
   }
   
   aborted = (footer.flags & CEPH_MSG_FOOTER_COMPLETE) == 0;
-  ldout(msgr->cct,10) << "aborted = " << aborted << dendl;
+  ldout(msgr->cct, 1) << "aborted = " << aborted << dendl;
   if (aborted) {
     ldout(msgr->cct,0) << "reader got " << front.length() << " + " << middle.length() << " + " << data.length()
 	    << " byte message.. ABORTED" << dendl;
@@ -1740,7 +1740,7 @@ int Pipe::read_message(Message **pm)
     goto out_dethrottle;
   }
 
-  ldout(msgr->cct,20) << "reader got " << front.length() << " + " << middle.length() << " + " << data.length()
+  ldout(msgr->cct, 1) << "reader got " << front.length() << " + " << middle.length() << " + " << data.length()
 	   << " byte message" << dendl;
   message = decode_message(msgr->cct, header, footer, front, middle, data);
   if (!message) {
@@ -1753,7 +1753,7 @@ int Pipe::read_message(Message **pm)
   //
 
   if (session_security == NULL) {
-    ldout(msgr->cct, 10) << "No session security set" << dendl;
+    ldout(msgr->cct, 1) << "No session security set" << dendl;
   } else {
     if (session_security->check_message_signature(message)) {
       ldout(msgr->cct, 0) << "Signature check failed" << dendl;
@@ -1779,7 +1779,7 @@ int Pipe::read_message(Message **pm)
   // release bytes reserved from the throttlers on failure
   if (message_size) {
     if (policy.throttler) {
-      ldout(msgr->cct,10) << "reader releasing " << message_size << " to policy throttler "
+      ldout(msgr->cct, 1) << "reader releasing " << message_size << " to policy throttler "
 	       << policy.throttler->get_current() << "/"
 	       << policy.throttler->get_max() << dendl;
       policy.throttler->put(message_size);
----------------------------------------------------------------------

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-23  1:49                             ` Chris Dunlop
@ 2013-02-23  1:52                               ` Sage Weil
  2013-02-23  2:02                                 ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Sage Weil @ 2013-02-23  1:52 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

On Sat, 23 Feb 2013, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
> > On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
> >>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> >>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> >>>>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
> >>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> >>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> >>>> 
> >>>> Is there any way of telling where they were delayed, i.e. in the 1's output
> >>>> queue or 0's input queue?
> >>> 
> >>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> >>> generate a lot of logging, though.
> >> 
> >> I really don't want to load the system with too much logging, but I'm happy
> >> modifying code...  Are there specific interesting debug outputs which I can
> >> modify so they're output under "ms = 1"?
> > 
> > I'm basically interested in everything in writer() and write_message(), 
> > and reader() and read_message()...
> 
> Like this?

Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
that this is the lions share of what debug 20 will spam to the log, but 
hopefully the load is manageable!

sage



> ----------------------------------------------------------------------
> diff --git a/src/msg/Pipe.cc b/src/msg/Pipe.cc
> index 37b1eeb..db4774f 100644
> --- a/src/msg/Pipe.cc
> +++ b/src/msg/Pipe.cc
> @@ -1263,7 +1263,7 @@ void Pipe::reader()
>  
>      // sleep if (re)connecting
>      if (state == STATE_STANDBY) {
> -      ldout(msgr->cct,20) << "reader sleeping during reconnect|standby" << dendl;
> +      ldout(msgr->cct, 1) << "reader sleeping during reconnect|standby" << dendl;
>        cond.Wait(pipe_lock);
>        continue;
>      }
> @@ -1272,28 +1272,28 @@ void Pipe::reader()
>  
>      char buf[80];
>      char tag = -1;
> -    ldout(msgr->cct,20) << "reader reading tag..." << dendl;
> +    ldout(msgr->cct, 1) << "reader reading tag..." << dendl;
>      if (tcp_read((char*)&tag, 1) < 0) {
>        pipe_lock.Lock();
> -      ldout(msgr->cct,2) << "reader couldn't read tag, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
> +      ldout(msgr->cct, 1) << "reader couldn't read tag, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
>        fault(true);
>        continue;
>      }
>  
>      if (tag == CEPH_MSGR_TAG_KEEPALIVE) {
> -      ldout(msgr->cct,20) << "reader got KEEPALIVE" << dendl;
> +      ldout(msgr->cct, 1) << "reader got KEEPALIVE" << dendl;
>        pipe_lock.Lock();
>        continue;
>      }
>  
>      // open ...
>      if (tag == CEPH_MSGR_TAG_ACK) {
> -      ldout(msgr->cct,20) << "reader got ACK" << dendl;
> +      ldout(msgr->cct, 1) << "reader got ACK" << dendl;
>        ceph_le64 seq;
>        int rc = tcp_read((char*)&seq, sizeof(seq));
>        pipe_lock.Lock();
>        if (rc < 0) {
> -	ldout(msgr->cct,2) << "reader couldn't read ack seq, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
> +	ldout(msgr->cct, 1) << "reader couldn't read ack seq, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
>  	fault(true);
>        } else if (state != STATE_CLOSED) {
>          handle_ack(seq);
> @@ -1302,7 +1302,7 @@ void Pipe::reader()
>      }
>  
>      else if (tag == CEPH_MSGR_TAG_MSG) {
> -      ldout(msgr->cct,20) << "reader got MSG" << dendl;
> +      ldout(msgr->cct, 1) << "reader got MSG" << dendl;
>        Message *m = 0;
>        int r = read_message(&m);
>  
> @@ -1342,7 +1342,7 @@ void Pipe::reader()
>  
>        cond.Signal();  // wake up writer, to ack this
>        
> -      ldout(msgr->cct,10) << "reader got message "
> +      ldout(msgr->cct, 1) << "reader got message "
>  	       << m->get_seq() << " " << m << " " << *m
>  	       << dendl;
>  
> @@ -1360,7 +1360,7 @@ void Pipe::reader()
>      } 
>      
>      else if (tag == CEPH_MSGR_TAG_CLOSE) {
> -      ldout(msgr->cct,20) << "reader got CLOSE" << dendl;
> +      ldout(msgr->cct, 1) << "reader got CLOSE" << dendl;
>        pipe_lock.Lock();
>        if (state == STATE_CLOSING) {
>  	state = STATE_CLOSED;
> @@ -1383,7 +1383,7 @@ void Pipe::reader()
>    reader_running = false;
>    reader_needs_join = true;
>    unlock_maybe_reap();
> -  ldout(msgr->cct,10) << "reader done" << dendl;
> +  ldout(msgr->cct, 1) << "reader done" << dendl;
>  }
>  
>  /* write msgs to socket.
> @@ -1395,7 +1395,7 @@ void Pipe::writer()
>  
>    pipe_lock.Lock();
>    while (state != STATE_CLOSED) {// && state != STATE_WAIT) {
> -    ldout(msgr->cct,10) << "writer: state = " << get_state_name()
> +    ldout(msgr->cct, 1) << "writer: state = " << get_state_name()
>  			<< " policy.server=" << policy.server << dendl;
>  
>      // standby?
> @@ -1413,7 +1413,7 @@ void Pipe::writer()
>      
>      if (state == STATE_CLOSING) {
>        // write close tag
> -      ldout(msgr->cct,20) << "writer writing CLOSE tag" << dendl;
> +      ldout(msgr->cct, 1) << "writer writing CLOSE tag" << dendl;
>        char tag = CEPH_MSGR_TAG_CLOSE;
>        state = STATE_CLOSED;
>        state_closed.set(1);
> @@ -1436,7 +1436,7 @@ void Pipe::writer()
>  	int rc = write_keepalive();
>  	pipe_lock.Lock();
>  	if (rc < 0) {
> -	  ldout(msgr->cct,2) << "writer couldn't write keepalive, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
> +	  ldout(msgr->cct, 1) << "writer couldn't write keepalive, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
>  	  fault();
>   	  continue;
>  	}
> @@ -1450,7 +1450,7 @@ void Pipe::writer()
>  	int rc = write_ack(send_seq);
>  	pipe_lock.Lock();
>  	if (rc < 0) {
> -	  ldout(msgr->cct,2) << "writer couldn't write ack, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
> +	  ldout(msgr->cct, 1) << "writer couldn't write ack, " << strerror_r(errno, buf, sizeof(buf)) << dendl;
>  	  fault();
>   	  continue;
>  	}
> @@ -1470,7 +1470,7 @@ void Pipe::writer()
>  	// associate message with Connection (for benefit of encode_payload)
>  	m->set_connection(connection_state->get());
>  
> -        ldout(msgr->cct,20) << "writer encoding " << m->get_seq() << " " << m << " " << *m << dendl;
> +        ldout(msgr->cct, 1) << "writer encoding " << m->get_seq() << " " << m << " " << *m << dendl;
>  
>  	// encode and copy out of *m
>  	m->encode(connection_state->get_features(), !msgr->cct->_conf->ms_nocrc);
> @@ -1485,13 +1485,13 @@ void Pipe::writer()
>  	// actually calculate and check the signature, but they should
>  	// handle the calls to sign_message and check_signature.  PLR
>  	if (session_security == NULL) {
> -	  ldout(msgr->cct, 20) << "writer no session security" << dendl;
> +	  ldout(msgr->cct, 1) << "writer no session security" << dendl;
>  	} else {
>  	  if (session_security->sign_message(m)) {
> -	    ldout(msgr->cct, 20) << "writer failed to sign seq # " << header.seq
> +	    ldout(msgr->cct, 1) << "writer failed to sign seq # " << header.seq
>  				 << "): sig = " << footer.sig << dendl;
>  	  } else {
> -	    ldout(msgr->cct, 20) << "writer signed seq # " << header.seq
> +	    ldout(msgr->cct, 1) << "writer signed seq # " << header.seq
>  				 << "): sig = " << footer.sig << dendl;
>  	  }
>  	}
> @@ -1502,7 +1502,7 @@ void Pipe::writer()
>  
>  	pipe_lock.Unlock();
>  
> -        ldout(msgr->cct,20) << "writer sending " << m->get_seq() << " " << m << dendl;
> +        ldout(msgr->cct, 1) << "writer sending " << m->get_seq() << " " << m << dendl;
>  	int rc = write_message(header, footer, blist);
>  
>  	pipe_lock.Lock();
> @@ -1517,22 +1517,22 @@ void Pipe::writer()
>      }
>      
>      if (sent.empty() && close_on_empty) {
> -      ldout(msgr->cct,10) << "writer out and sent queues empty, closing" << dendl;
> +      ldout(msgr->cct, 1) << "writer out and sent queues empty, closing" << dendl;
>        stop();
>        continue;
>      }
>  
>      // wait
> -    ldout(msgr->cct,20) << "writer sleeping" << dendl;
> +    ldout(msgr->cct, 1) << "writer sleeping" << dendl;
>      cond.Wait(pipe_lock);
>    }
>    
> -  ldout(msgr->cct,20) << "writer finishing" << dendl;
> +  ldout(msgr->cct, 1) << "writer finishing" << dendl;
>  
>    // reap?
>    writer_running = false;
>    unlock_maybe_reap();
> -  ldout(msgr->cct,10) << "writer done" << dendl;
> +  ldout(msgr->cct, 1) << "writer done" << dendl;
>  }
>  
>  void Pipe::unlock_maybe_reap()
> @@ -1596,7 +1596,7 @@ int Pipe::read_message(Message **pm)
>      header_crc = ceph_crc32c_le(0, (unsigned char *)&oldheader, sizeof(oldheader) - sizeof(oldheader.crc));
>    }
>  
> -  ldout(msgr->cct,20) << "reader got envelope type=" << header.type
> +  ldout(msgr->cct, 1) << "reader got envelope type=" << header.type
>             << " src " << entity_name_t(header.src)
>             << " front=" << header.front_len
>  	   << " data=" << header.data_len
> @@ -1620,7 +1620,7 @@ int Pipe::read_message(Message **pm)
>    uint64_t message_size = header.front_len + header.middle_len + header.data_len;
>    if (message_size) {
>      if (policy.throttler) {
> -      ldout(msgr->cct,10) << "reader wants " << message_size << " from policy throttler "
> +      ldout(msgr->cct, 1) << "reader wants " << message_size << " from policy throttler "
>  	       << policy.throttler->get_current() << "/"
>  	       << policy.throttler->get_max() << dendl;
>        waited_on_throttle = policy.throttler->get(message_size);
> @@ -1630,7 +1630,7 @@ int Pipe::read_message(Message **pm)
>      // policy throttle, as this one does not deadlock (unless dispatch
>      // blocks indefinitely, which it shouldn't).  in contrast, the
>      // policy throttle carries for the lifetime of the message.
> -    ldout(msgr->cct,10) << "reader wants " << message_size << " from dispatch throttler "
> +    ldout(msgr->cct, 1) << "reader wants " << message_size << " from dispatch throttler "
>  	     << msgr->dispatch_throttler.get_current() << "/"
>  	     << msgr->dispatch_throttler.get_max() << dendl;
>      waited_on_throttle |= msgr->dispatch_throttler.get(message_size);
> @@ -1645,7 +1645,7 @@ int Pipe::read_message(Message **pm)
>      if (tcp_read(bp.c_str(), front_len) < 0)
>        goto out_dethrottle;
>      front.push_back(bp);
> -    ldout(msgr->cct,20) << "reader got front " << front.length() << dendl;
> +    ldout(msgr->cct, 1) << "reader got front " << front.length() << dendl;
>    }
>  
>    // read middle
> @@ -1655,7 +1655,7 @@ int Pipe::read_message(Message **pm)
>      if (tcp_read(bp.c_str(), middle_len) < 0)
>        goto out_dethrottle;
>      middle.push_back(bp);
> -    ldout(msgr->cct,20) << "reader got middle " << middle.length() << dendl;
> +    ldout(msgr->cct, 1) << "reader got middle " << middle.length() << dendl;
>    }
>  
>  
> @@ -1680,7 +1680,7 @@ int Pipe::read_message(Message **pm)
>        map<tid_t,pair<bufferlist,int> >::iterator p = connection_state->rx_buffers.find(header.tid);
>        if (p != connection_state->rx_buffers.end()) {
>  	if (rxbuf.length() == 0 || p->second.second != rxbuf_version) {
> -	  ldout(msgr->cct,10) << "reader seleting rx buffer v " << p->second.second
> +	  ldout(msgr->cct, 1) << "reader seleting rx buffer v " << p->second.second
>  		   << " at offset " << offset
>  		   << " len " << p->second.first.length() << dendl;
>  	  rxbuf = p->second.first;
> @@ -1693,7 +1693,7 @@ int Pipe::read_message(Message **pm)
>  	}
>        } else {
>  	if (!newbuf.length()) {
> -	  ldout(msgr->cct,20) << "reader allocating new rx buffer at offset " << offset << dendl;
> +	  ldout(msgr->cct, 1) << "reader allocating new rx buffer at offset " << offset << dendl;
>  	  alloc_aligned_buffer(newbuf, data_len, data_off);
>  	  blp = newbuf.begin();
>  	  blp.advance(offset);
> @@ -1701,7 +1701,7 @@ int Pipe::read_message(Message **pm)
>        }
>        bufferptr bp = blp.get_current_ptr();
>        int read = MIN(bp.length(), left);
> -      ldout(msgr->cct,20) << "reader reading nonblocking into " << (void*)bp.c_str() << " len " << bp.length() << dendl;
> +      ldout(msgr->cct, 1) << "reader reading nonblocking into " << (void*)bp.c_str() << " len " << bp.length() << dendl;
>        int got = tcp_read_nonblocking(bp.c_str(), read);
>        ldout(msgr->cct,30) << "reader read " << got << " of " << read << dendl;
>        connection_state->lock.Unlock();
> @@ -1732,7 +1732,7 @@ int Pipe::read_message(Message **pm)
>    }
>    
>    aborted = (footer.flags & CEPH_MSG_FOOTER_COMPLETE) == 0;
> -  ldout(msgr->cct,10) << "aborted = " << aborted << dendl;
> +  ldout(msgr->cct, 1) << "aborted = " << aborted << dendl;
>    if (aborted) {
>      ldout(msgr->cct,0) << "reader got " << front.length() << " + " << middle.length() << " + " << data.length()
>  	    << " byte message.. ABORTED" << dendl;
> @@ -1740,7 +1740,7 @@ int Pipe::read_message(Message **pm)
>      goto out_dethrottle;
>    }
>  
> -  ldout(msgr->cct,20) << "reader got " << front.length() << " + " << middle.length() << " + " << data.length()
> +  ldout(msgr->cct, 1) << "reader got " << front.length() << " + " << middle.length() << " + " << data.length()
>  	   << " byte message" << dendl;
>    message = decode_message(msgr->cct, header, footer, front, middle, data);
>    if (!message) {
> @@ -1753,7 +1753,7 @@ int Pipe::read_message(Message **pm)
>    //
>  
>    if (session_security == NULL) {
> -    ldout(msgr->cct, 10) << "No session security set" << dendl;
> +    ldout(msgr->cct, 1) << "No session security set" << dendl;
>    } else {
>      if (session_security->check_message_signature(message)) {
>        ldout(msgr->cct, 0) << "Signature check failed" << dendl;
> @@ -1779,7 +1779,7 @@ int Pipe::read_message(Message **pm)
>    // release bytes reserved from the throttlers on failure
>    if (message_size) {
>      if (policy.throttler) {
> -      ldout(msgr->cct,10) << "reader releasing " << message_size << " to policy throttler "
> +      ldout(msgr->cct, 1) << "reader releasing " << message_size << " to policy throttler "
>  	       << policy.throttler->get_current() << "/"
>  	       << policy.throttler->get_max() << dendl;
>        policy.throttler->put(message_size);
> ----------------------------------------------------------------------
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-23  1:52                               ` Sage Weil
@ 2013-02-23  2:02                                 ` Chris Dunlop
  2013-03-01  2:02                                   ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-02-23  2:02 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>>>>>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
>>>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
>>>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>>>>>> 
>>>>>> Is there any way of telling where they were delayed, i.e. in the 1's output
>>>>>> queue or 0's input queue?
>>>>> 
>>>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
>>>>> generate a lot of logging, though.
>>>> 
>>>> I really don't want to load the system with too much logging, but I'm happy
>>>> modifying code...  Are there specific interesting debug outputs which I can
>>>> modify so they're output under "ms = 1"?
>>> 
>>> I'm basically interested in everything in writer() and write_message(), 
>>> and reader() and read_message()...
>> 
>> Like this?
> 
> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
> that this is the lions share of what debug 20 will spam to the log, but 
> hopefully the load is manageable!

Good idea on the '2'. I'll get that installed and wait for it to happen again.

Chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-02-23  2:02                                 ` Chris Dunlop
@ 2013-03-01  2:02                                   ` Chris Dunlop
  2013-03-01  5:00                                     ` Sage Weil
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-03-01  2:02 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>>>>>>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
>>>>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
>>>>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>>>>>>> 
>>>>>>> Is there any way of telling where they were delayed, i.e. in the 1's output
>>>>>>> queue or 0's input queue?
>>>>>> 
>>>>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
>>>>>> generate a lot of logging, though.
>>>>> 
>>>>> I really don't want to load the system with too much logging, but I'm happy
>>>>> modifying code...  Are there specific interesting debug outputs which I can
>>>>> modify so they're output under "ms = 1"?
>>>> 
>>>> I'm basically interested in everything in writer() and write_message(), 
>>>> and reader() and read_message()...
>>> 
>>> Like this?
>> 
>> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
>> that this is the lions share of what debug 20 will spam to the log, but 
>> hopefully the load is manageable!
> 
> Good idea on the '2'. I'll get that installed and wait for it to happen again.

FYI...

To avoid running out of disk space for the massive logs, I
started using logrotate on the ceph logs every two hours, which
does a 'service ceph reload' to re-open the log files.

In the week since doing that I haven't seen any 'slow requests'
at all (the load has stayed the same as before the change),
which means the issue with the osds dropping out, then the
system not recovering properly, also hasn't happened.

That's a bit suspicious, no?

I've now put the log dirs on each machine on their own 2TB
partition and reverted back to the default daily rotates.

And once more we're waiting... Godot, is that you?


Chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-03-01  2:02                                   ` Chris Dunlop
@ 2013-03-01  5:00                                     ` Sage Weil
  2013-03-08  3:12                                       ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Sage Weil @ 2013-03-01  5:00 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

On Fri, 1 Mar 2013, Chris Dunlop wrote:
> On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
> > On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
> >> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
> >>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>>>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
> >>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>>>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> >>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>>>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> >>>>>>>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
> >>>>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> >>>>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> >>>>>>> 
> >>>>>>> Is there any way of telling where they were delayed, i.e. in the 1's output
> >>>>>>> queue or 0's input queue?
> >>>>>> 
> >>>>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> >>>>>> generate a lot of logging, though.
> >>>>> 
> >>>>> I really don't want to load the system with too much logging, but I'm happy
> >>>>> modifying code...  Are there specific interesting debug outputs which I can
> >>>>> modify so they're output under "ms = 1"?
> >>>> 
> >>>> I'm basically interested in everything in writer() and write_message(), 
> >>>> and reader() and read_message()...
> >>> 
> >>> Like this?
> >> 
> >> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
> >> that this is the lions share of what debug 20 will spam to the log, but 
> >> hopefully the load is manageable!
> > 
> > Good idea on the '2'. I'll get that installed and wait for it to happen again.
> 
> FYI...
> 
> To avoid running out of disk space for the massive logs, I
> started using logrotate on the ceph logs every two hours, which
> does a 'service ceph reload' to re-open the log files.
> 
> In the week since doing that I haven't seen any 'slow requests'
> at all (the load has stayed the same as before the change),
> which means the issue with the osds dropping out, then the
> system not recovering properly, also hasn't happened.
> 
> That's a bit suspicious, no?

I suspect the logging itself is changing the timing.  Let's wait and see 
if we get lucky... 

sage

> 
> I've now put the log dirs on each machine on their own 2TB
> partition and reverted back to the default daily rotates.
> 
> And once more we're waiting... Godot, is that you?
> 
> 
> Chris
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-03-01  5:00                                     ` Sage Weil
@ 2013-03-08  3:12                                       ` Chris Dunlop
  2013-03-08 22:47                                         ` Chris Dunlop
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Dunlop @ 2013-03-08  3:12 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thu, Feb 28, 2013 at 09:00:24PM -0800, Sage Weil wrote:
> On Fri, 1 Mar 2013, Chris Dunlop wrote:
>> On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
>>> On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
>>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>>>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>>>>>>>>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
>>>>>>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
>>>>>>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>>>>>>>>> 
>>>>>>>>> Is there any way of telling where they were delayed, i.e. in the 1's output
>>>>>>>>> queue or 0's input queue?
>>>>>>>> 
>>>>>>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
>>>>>>>> generate a lot of logging, though.
>>>>>>> 
>>>>>>> I really don't want to load the system with too much logging, but I'm happy
>>>>>>> modifying code...  Are there specific interesting debug outputs which I can
>>>>>>> modify so they're output under "ms = 1"?
>>>>>> 
>>>>>> I'm basically interested in everything in writer() and write_message(), 
>>>>>> and reader() and read_message()...
>>>>> 
>>>>> Like this?
>>>> 
>>>> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
>>>> that this is the lions share of what debug 20 will spam to the log, but 
>>>> hopefully the load is manageable!
>>> 
>>> Good idea on the '2'. I'll get that installed and wait for it to happen again.
>> 
>> FYI...
>> 
>> To avoid running out of disk space for the massive logs, I
>> started using logrotate on the ceph logs every two hours, which
>> does a 'service ceph reload' to re-open the log files.
>> 
>> In the week since doing that I haven't seen any 'slow requests'
>> at all (the load has stayed the same as before the change),
>> which means the issue with the osds dropping out, then the
>> system not recovering properly, also hasn't happened.
>> 
>> That's a bit suspicious, no?
> 
> I suspect the logging itself is changing the timing.  Let's wait and see 
> if we get lucky... 

We got "lucky"...

ceph-mon.0.log:
2013-03-08 03:46:44.786682 7fcc62172700  1 -- 192.168.254.132:0/20298 --> 192.168.254.133:6801/23939 -- osd_ping(ping e815 stamp 2013-03-08 03:46:44.786679) v2 -- ?+0 0x765b180 con 0x6ab6160
  [no ping_reply logged, then later...]
2013-03-08 03:46:56.211993 7fcc71190700 -1 osd.0 815 heartbeat_check: no reply from osd.1 since 2013-03-08 03:46:35.986327 (cutoff 2013-03-08 03:46:36.211992)

ceph-mon.1.log:
2013-03-08 03:46:44.786848 7fe6f47a4700  1 -- 192.168.254.133:6801/23939 <== osd.0 192.168.254.132:0/20298 178549 ==== osd_ping(ping e815 stamp 2013-03-08 03:46:44.786679) v2 ==== 47+0+0 (1298645350 0 0) 0x98256c0 con 0x7bd2160
2013-03-08 03:46:44.786880 7fe6f47a4700  1 -- 192.168.254.133:6801/23939 --> 192.168.254.132:0/20298 -- osd_ping(ping_reply e815 stamp 2013-03-08 03:46:44.786679) v2 -- ?+0 0x29876c0 con 0x7bd2160

Interestingly, the matching ping_reply from osd.1 never appears in the
osd.0 log, in contrast to the previous incident upthread where the
"missing" ping replies were all seen in a rush (but after osd.1 had been
marked down).

The missing ping_reply caused osd.1 to get marked down, then it marked
itself up again a bit later ("map e818 wrongly marked me down"). However
the system still hadn't recovered before 07:46:29 when a 'service ceph
restart' was done on the machine holding mon.b5 and osd.1, bringing things
back to life.

Before the restart:

# ceph -s
   health HEALTH_WARN 273 pgs peering; 2 pgs recovery_wait; 273 pgs stuck inactive; 576 pgs stuck unclean; recovery 43/293224 degraded (0.015%)
   monmap e9: 3 mons at {b2=10.200.63.130:6789/0,b4=10.200.63.132:6789/0,b5=10.200.63.133:6789/0}, election epoch 898, quorum 0,1,2 b2,b4,b5
   osdmap e825: 2 osds: 2 up, 2 in
    pgmap v3545580: 576 pgs: 301 active, 2 active+recovery_wait, 273 peering; 560 GB data, 1348 GB used, 2375 GB / 3724 GB avail; 43/293224 degraded (0.015%)
   mdsmap e1: 0/0/1 up

After the restart:

# ceph -s
   health HEALTH_WARN 19 pgs recovering; 24 pgs recovery_wait; 43 pgs stuck unclean; recovery 66/293226 degraded (0.023%)
   monmap e9: 3 mons at {b2=10.200.63.130:6789/0,b4=10.200.63.132:6789/0,b5=10.200.63.133:6789/0}, election epoch 902, quorum 0,1,2 b2,b4,b5
   osdmap e828: 2 osds: 2 up, 2 in
    pgmap v3545603: 576 pgs: 533 active+clean, 24 active+recovery_wait, 19 active+recovering; 560 GB data, 1348 GB used, 2375 GB / 3724 GB avail; 0B/s rd, 8135KB/s wr, 224op/s; 66/293226 degraded (0.023%)
   mdsmap e1: 0/0/1 up

Logs covering 00:00 to 09:00 are in:

https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2

Cheers,

Chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Mon losing touch with OSDs
  2013-03-08  3:12                                       ` Chris Dunlop
@ 2013-03-08 22:47                                         ` Chris Dunlop
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Dunlop @ 2013-03-08 22:47 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, Mar 08, 2013 at 02:12:40PM +1100, Chris Dunlop wrote:
> On Thu, Feb 28, 2013 at 09:00:24PM -0800, Sage Weil wrote:
>> On Fri, 1 Mar 2013, Chris Dunlop wrote:
>>> On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
>>>> On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
>>>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>>>>>>>>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>>>>>>>>>>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>>>>>>>>>>>> I just looked at the logs.  I can't tell what happend to cause that 10 
>>>>>>>>>>>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
>>>>>>>>>>>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>>>>>>>>>> 
>>>>>>>>>> Is there any way of telling where they were delayed, i.e. in the 1's output
>>>>>>>>>> queue or 0's input queue?
>>>>>>>>> 
>>>>>>>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
>>>>>>>>> generate a lot of logging, though.
>>>>>>>> 
>>>>>>>> I really don't want to load the system with too much logging, but I'm happy
>>>>>>>> modifying code...  Are there specific interesting debug outputs which I can
>>>>>>>> modify so they're output under "ms = 1"?
>>>>>>> 
>>>>>>> I'm basically interested in everything in writer() and write_message(), 
>>>>>>> and reader() and read_message()...
>>>>>> 
>>>>>> Like this?
>>>>> 
>>>>> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
>>>>> that this is the lions share of what debug 20 will spam to the log, but 
>>>>> hopefully the load is manageable!
>>>> 
>>>> Good idea on the '2'. I'll get that installed and wait for it to happen again.
>>> 
>>> FYI...
>>> 
>>> To avoid running out of disk space for the massive logs, I
>>> started using logrotate on the ceph logs every two hours, which
>>> does a 'service ceph reload' to re-open the log files.
>>> 
>>> In the week since doing that I haven't seen any 'slow requests'
>>> at all (the load has stayed the same as before the change),
>>> which means the issue with the osds dropping out, then the
>>> system not recovering properly, also hasn't happened.
>>> 
>>> That's a bit suspicious, no?
>> 
>> I suspect the logging itself is changing the timing.  Let's wait and see 
>> if we get lucky... 
> 
> We got "lucky"...
> 
> ceph-mon.0.log:
> 2013-03-08 03:46:44.786682 7fcc62172700  1 -- 192.168.254.132:0/20298 --> 192.168.254.133:6801/23939 -- osd_ping(ping e815 stamp 2013-03-08 03:46:44.786679) v2 -- ?+0 0x765b180 con 0x6ab6160
>   [no ping_reply logged, then later...]
> 2013-03-08 03:46:56.211993 7fcc71190700 -1 osd.0 815 heartbeat_check: no reply from osd.1 since 2013-03-08 03:46:35.986327 (cutoff 2013-03-08 03:46:36.211992)

Bugger. I just realised that the cluster had come up without
the "ms = 2" logging enabled, making these logs no more useful
than the previous ones for working out who dropped the missing
ping_reply.

Injected "ms = 2" into the osds and mons, and added it to the
config files.

Sigh. And we're waiting again...

Chris.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-03-08 22:47 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-15  3:29 Mon losing touch with OSDs Chris Dunlop
2013-02-15  4:57 ` Sage Weil
2013-02-15 22:05   ` Chris Dunlop
2013-02-17 23:41     ` Chris Dunlop
2013-02-18  1:44       ` Sage Weil
2013-02-19  3:02         ` Chris Dunlop
2013-02-20  2:07           ` Chris Dunlop
2013-02-22  3:06             ` Chris Dunlop
2013-02-22 21:57               ` Sage Weil
2013-02-22 23:35                 ` Chris Dunlop
2013-02-22 23:43                   ` Sage Weil
2013-02-23  0:08                     ` Chris Dunlop
2013-02-23  0:13                       ` Sage Weil
2013-02-23  0:25                         ` Sage Weil
2013-02-23  0:50                           ` Chris Dunlop
2013-02-23  1:10                             ` Chris Dunlop
2013-02-23  0:57                         ` Chris Dunlop
2013-02-23  1:30                           ` Sage Weil
2013-02-23  1:49                             ` Chris Dunlop
2013-02-23  1:52                               ` Sage Weil
2013-02-23  2:02                                 ` Chris Dunlop
2013-03-01  2:02                                   ` Chris Dunlop
2013-03-01  5:00                                     ` Sage Weil
2013-03-08  3:12                                       ` Chris Dunlop
2013-03-08 22:47                                         ` Chris Dunlop

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.