All of lore.kernel.org
 help / color / mirror / Atom feed
* Client reconnect failing: reader gets bad tag
@ 2011-05-04 22:27 Jim Schutt
  2011-05-05 19:19 ` Sage Weil
  0 siblings, 1 reply; 17+ messages in thread
From: Jim Schutt @ 2011-05-04 22:27 UTC (permalink / raw)
  To: ceph-devel

Hi,

I'm seeing clients having trouble reconnecting after timed-out
requests.  When they get in this state, sometimes they manage
to reconnect after several attempts; sometimes they never seem
to be able to reconnect.

Here's an example from a client's dmesg:

[ 2423.312190] libceph:  tid 8536 timed out on osd34, will reset osd
[ 2452.449206] libceph: osd34 172.17.40.30:6806 connection failed

Here's the corresponding log of the attempt on the OSD, with
debug ms = 20:

2011-05-04 16:00:59.710605 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=0 cs=0 l=0).accept peer addr is really 172.17.40.49:0/302440129 (socket is 172.17.40.49:42695/0)
2011-05-04 16:00:59.710630 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=0 cs=0 l=0).accept got peer connect_seq 0 global_seq 430
2011-05-04 16:00:59.710646 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=0 cs=0 l=1).accept of host_type 8, policy.lossy=1
2011-05-04 16:00:59.710667 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=0 cs=0 l=1).accept my proto 24, their proto 24
2011-05-04 16:00:59.710703 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=0 cs=0 l=1).accept new session
2011-05-04 16:00:59.710712 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).accept success, connect_seq = 1, sending READY
2011-05-04 16:00:59.710721 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).accept features 138
2011-05-04 16:00:59.710730 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).register_pipe
2011-05-04 16:00:59.710764 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).accept starting writer, state=2
2011-05-04 16:00:59.710813 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).accept done
2011-05-04 16:00:59.710832 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).reader reading tag...
2011-05-04 16:00:59.710870 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).writer: state = 2 policy.server=1
2011-05-04 16:00:59.710917 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).writer sleeping
2011-05-04 16:00:59.710971 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).reader bad tag 0
2011-05-04 16:00:59.711013 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).fault 0: Success
2011-05-04 16:00:59.711075 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).fault on lossy channel, failing
2011-05-04 16:00:59.711089 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).fail
2011-05-04 16:00:59.711102 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).stop
2011-05-04 16:00:59.711116 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).discard_queue
2011-05-04 16:00:59.711136 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1). dequeued pipe
2011-05-04 16:00:59.711184 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).writer finishing
2011-05-04 16:00:59.711238 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).writer done
2011-05-04 16:00:59.711397 7f15f9dbe940 -- 172.17.40.30:6806/12583 reaper reaping pipe 0x213fa000 172.17.40.49:0/302440129
2011-05-04 16:00:59.711422 7f15f9dbe940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).discard_queue
2011-05-04 16:00:59.711443 7f15f9dbe940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1). dequeued pipe
2011-05-04 16:00:59.711465 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).reader done
2011-05-04 16:00:59.711488 7f15f9dbe940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).unregister_pipe
2011-05-04 16:00:59.711639 7f15f9dbe940 -- 172.17.40.30:6806/12583 reaper reaped pipe 0x213fa000 172.17.40.49:0/302440129

A little later, a connect from that client succeeded:

2011-05-04 16:18:07.816016 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=0 cs=0 l=0).accept peer addr is really 172.17.40.49:0/302440129 (socket is 172.17.40.49:53237/0)
2011-05-04 16:18:07.816032 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=0 cs=0 l=0).accept got peer connect_seq 0 global_seq 435
2011-05-04 16:18:07.816041 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=0 cs=0 l=1).accept of host_type 8, policy.lossy=1
2011-05-04 16:18:07.816050 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=0 cs=0 l=1).accept my proto 24, their proto 24
2011-05-04 16:18:07.816067 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=0 cs=0 l=1).accept new session
2011-05-04 16:18:07.816076 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).accept success, connect_seq = 1, sending READY
2011-05-04 16:18:07.816085 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).accept features 138
2011-05-04 16:18:07.816093 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).register_pipe
2011-05-04 16:18:07.816116 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).accept starting writer, state=2
2011-05-04 16:18:07.816147 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).accept done
2011-05-04 16:18:07.816164 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader reading tag...
2011-05-04 16:18:07.816197 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).writer: state = 2 policy.server=1
2011-05-04 16:18:07.816289 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader got MSG
2011-05-04 16:18:07.816343 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader got envelope type=42 src client4153 front=128 data=524288 off 0
2011-05-04 16:18:07.816407 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader wants 524416 from policy throttler 0/15000000
2011-05-04 16:18:07.816422 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader wants 524416 from dispatch throttler 0/28000000
2011-05-04 16:18:07.816500 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader got front 128
2011-05-04 16:18:07.816533 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader allocating new rx buffer at offset 0
2011-05-04 16:18:07.816547 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).writer sleeping
2011-05-04 16:18:07.816574 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader reading nonblocking into 0x2fcd8000 len 524288
2011-05-04 16:18:07.816637 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader reading nonblocking into 0x2fcde000 len 499712
2011-05-04 16:18:07.816675 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader reading nonblocking into 0x2fcdf7c8 len 493624
2011-05-04 16:18:07.816696 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader reading nonblocking into 0x2fcdff90 len 491632
2011-05-04 16:18:07.816737 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader reading nonblocking into 0x2fce0000 len 491520

This is with current server-side master branch (d417fb0bad2a),
and 2.6.39-rc6 + current for-linus (fca65b4ad72d) + current
master (0ee5623f9a6e) on the client side.

Let me know if there is some client-side debugging I can enable
to help sort this out.

Thanks -- Jim



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Client reconnect failing: reader gets bad tag
  2011-05-04 22:27 Client reconnect failing: reader gets bad tag Jim Schutt
@ 2011-05-05 19:19 ` Sage Weil
  2011-05-05 20:23   ` Jim Schutt
                     ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Sage Weil @ 2011-05-05 19:19 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Wed, 4 May 2011, Jim Schutt wrote:
> Hi,
> 
> I'm seeing clients having trouble reconnecting after timed-out
> requests.  When they get in this state, sometimes they manage
> to reconnect after several attempts; sometimes they never seem
> to be able to reconnect.

Hmm, the interesting line is

> 2011-05-04 16:00:59.710971 7f15d6948940 -- 172.17.40.30:6806/12583 >>
> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).reader bad
> tag 0

That _should_ mean the server side (osd) closes out the connection 
immediately, which should generate a disconnect error on the client and an 
immediate reconnect.  So it's strange that you're also seeing timeouts.

Of course, we should be getting bad tags anyway, so something else is 
clearly wrong and may be contributing to both problems.  

How easy is this to reproduce?  It's right after a fresh connection, so 
the number of possibly offending code paths is pretty small, at least!

There is client side debugging to turn on, but it's very chatty.  Maybe 
you can just enable a few key lines, like the connect handshake ones, and 
any point where we queue/send a tag.  It's a bit tedious to enable 
the individual dout lines in messenger.c, sadly, but unless you have a 
very fast netconsole or something that's probably the only way to go...

sage

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Client reconnect failing: reader gets bad tag
  2011-05-05 19:19 ` Sage Weil
@ 2011-05-05 20:23   ` Jim Schutt
  2011-05-06 21:56   ` Jim Schutt
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Jim Schutt @ 2011-05-05 20:23 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Sage Weil wrote:
> On Wed, 4 May 2011, Jim Schutt wrote:
>> Hi,
>>
>> I'm seeing clients having trouble reconnecting after timed-out
>> requests.  When they get in this state, sometimes they manage
>> to reconnect after several attempts; sometimes they never seem
>> to be able to reconnect.
> 
> Hmm, the interesting line is
> 
>> 2011-05-04 16:00:59.710971 7f15d6948940 -- 172.17.40.30:6806/12583 >>
>> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).reader bad
>> tag 0
> 
> That _should_ mean the server side (osd) closes out the connection 
> immediately, which should generate a disconnect error on the client and an 
> immediate reconnect.  So it's strange that you're also seeing timeouts.

Here's the file server-side log for the connection
attempts - i.e. everything in the logs for that client IP,
once I noticed the timeouts and reconnect failures:

# tail -f osd.34.log | grep 172\.17\.40\.49
2011-05-04 16:00:59.710605 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=0 cs=0 l=0).accept peer addr is really 172.17.40.49:0/302440129 (socket is 172.17.40.49:42695/0)
2011-05-04 16:00:59.710630 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=0 cs=0 l=0).accept got peer connect_seq 0 global_seq 430
2011-05-04 16:00:59.710646 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=0 cs=0 l=1).accept of host_type 8, policy.lossy=1
2011-05-04 16:00:59.710667 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=0 cs=0 l=1).accept my proto 24, their proto 24
2011-05-04 16:00:59.710703 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=0 cs=0 l=1).accept new session
2011-05-04 16:00:59.710712 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).accept success, connect_seq = 1, sending READY
2011-05-04 16:00:59.710721 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).accept features 138
2011-05-04 16:00:59.710730 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).register_pipe
2011-05-04 16:00:59.710764 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).accept starting writer, state=2
2011-05-04 16:00:59.710813 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).accept done
2011-05-04 16:00:59.710832 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).reader reading tag...
2011-05-04 16:00:59.710870 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).writer: state = 2 policy.server=1
2011-05-04 16:00:59.710917 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).writer sleeping
2011-05-04 16:00:59.710971 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).reader bad tag 0
2011-05-04 16:00:59.711013 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).fault 0: Success
2011-05-04 16:00:59.711075 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).fault on lossy channel, failing
2011-05-04 16:00:59.711089 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).fail
2011-05-04 16:00:59.711102 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).stop
2011-05-04 16:00:59.711116 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).discard_queue
2011-05-04 16:00:59.711136 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1). dequeued pipe
2011-05-04 16:00:59.711184 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).writer finishing
2011-05-04 16:00:59.711238 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).writer done
2011-05-04 16:00:59.711397 7f15f9dbe940 -- 172.17.40.30:6806/12583 reaper reaping pipe 0x213fa000 172.17.40.49:0/302440129
2011-05-04 16:00:59.711422 7f15f9dbe940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).discard_queue
2011-05-04 16:00:59.711443 7f15f9dbe940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1). dequeued pipe
2011-05-04 16:00:59.711465 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).reader done
2011-05-04 16:00:59.711488 7f15f9dbe940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).unregister_pipe
2011-05-04 16:00:59.711639 7f15f9dbe940 -- 172.17.40.30:6806/12583 reaper reaped pipe 0x213fa000 172.17.40.49:0/302440129
2011-05-04 16:09:33.761370 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=0 cs=0 l=0).accept peer addr is really 172.17.40.49:0/302440129 (socket is 172.17.40.49:37202/0)
2011-05-04 16:09:33.761408 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=0 cs=0 l=0).accept got peer connect_seq 0 global_seq 432
2011-05-04 16:09:33.761421 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=0 cs=0 l=1).accept of host_type 8, policy.lossy=1
2011-05-04 16:09:33.761430 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=0 cs=0 l=1).accept my proto 24, their proto 24
2011-05-04 16:09:33.761465 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=0 cs=0 l=1).accept new session
2011-05-04 16:09:33.761475 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).accept success, connect_seq = 1, sending READY
2011-05-04 16:09:33.761484 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).accept features 138
2011-05-04 16:09:33.761493 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).register_pipe
2011-05-04 16:09:33.761518 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).accept starting writer, state=2
2011-05-04 16:09:33.761572 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).accept done
2011-05-04 16:09:33.761586 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).reader reading tag...
2011-05-04 16:09:33.761607 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).writer: state = 2 policy.server=1
2011-05-04 16:09:33.761650 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).reader bad tag 0
2011-05-04 16:09:33.761675 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).writer sleeping
2011-05-04 16:09:33.761730 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).fault 0: Success
2011-05-04 16:09:33.761824 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).fault on lossy channel, failing
2011-05-04 16:09:33.761839 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).fail
2011-05-04 16:09:33.761852 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).stop
2011-05-04 16:09:33.761866 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).discard_queue
2011-05-04 16:09:33.761886 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1). dequeued pipe
2011-05-04 16:09:33.761944 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).writer finishing
2011-05-04 16:09:33.761997 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).writer done
2011-05-04 16:09:33.762133 7f15f9dbe940 -- 172.17.40.30:6806/12583 reaper reaping pipe 0x213fac80 172.17.40.49:0/302440129
2011-05-04 16:09:33.762154 7f15f9dbe940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).discard_queue
2011-05-04 16:09:33.762173 7f15f9dbe940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1). dequeued pipe
2011-05-04 16:09:33.762197 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).reader done
2011-05-04 16:09:33.762234 7f15f9dbe940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fac80 sd=91 pgs=432 cs=1 l=1).unregister_pipe
2011-05-04 16:09:33.762305 7f15f9dbe940 -- 172.17.40.30:6806/12583 reaper reaped pipe 0x213fac80 172.17.40.49:0/302440129
2011-05-04 16:18:07.809562 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=0 cs=0 l=0).accept peer addr is really 172.17.40.49:0/302440129 (socket is 172.17.40.49:53236/0)
2011-05-04 16:18:07.809606 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=0 cs=0 l=0).accept got peer connect_seq 0 global_seq 434
2011-05-04 16:18:07.809631 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=0 cs=0 l=1).accept of host_type 8, policy.lossy=1
2011-05-04 16:18:07.809640 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=0 cs=0 l=1).accept my proto 24, their proto 24
2011-05-04 16:18:07.809674 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=0 cs=0 l=1).accept new session
2011-05-04 16:18:07.809683 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).accept success, connect_seq = 1, sending READY
2011-05-04 16:18:07.809692 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).accept features 138
2011-05-04 16:18:07.809700 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).register_pipe
2011-05-04 16:18:07.809728 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).accept starting writer, state=2
2011-05-04 16:18:07.809782 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).accept done
2011-05-04 16:18:07.809796 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).reader reading tag...
2011-05-04 16:18:07.809816 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).writer: state = 2 policy.server=1
2011-05-04 16:18:07.809859 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).reader bad tag 0
2011-05-04 16:18:07.809894 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).writer sleeping
2011-05-04 16:18:07.809940 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).fault 0: Success
2011-05-04 16:18:07.810021 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).fault on lossy channel, failing
2011-05-04 16:18:07.810035 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).fail
2011-05-04 16:18:07.810048 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).stop
2011-05-04 16:18:07.810062 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).discard_queue
2011-05-04 16:18:07.810082 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1). dequeued pipe
2011-05-04 16:18:07.810130 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).writer finishing
2011-05-04 16:18:07.810176 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).writer done
2011-05-04 16:18:07.810312 7f15f9dbe940 -- 172.17.40.30:6806/12583 reaper reaping pipe 0x213fa280 172.17.40.49:0/302440129
2011-05-04 16:18:07.810332 7f15f9dbe940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).discard_queue
2011-05-04 16:18:07.810352 7f15f9dbe940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1). dequeued pipe
2011-05-04 16:18:07.810375 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).reader done
2011-05-04 16:18:07.810405 7f15f9dbe940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x213fa280 sd=91 pgs=434 cs=1 l=1).unregister_pipe
2011-05-04 16:18:07.810488 7f15f9dbe940 -- 172.17.40.30:6806/12583 reaper reaped pipe 0x213fa280 172.17.40.49:0/302440129
2011-05-04 16:18:07.816016 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=0 cs=0 l=0).accept peer addr is really 172.17.40.49:0/302440129 (socket is 172.17.40.49:53237/0)
2011-05-04 16:18:07.816032 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=0 cs=0 l=0).accept got peer connect_seq 0 global_seq 435
2011-05-04 16:18:07.816041 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=0 cs=0 l=1).accept of host_type 8, policy.lossy=1
2011-05-04 16:18:07.816050 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=0 cs=0 l=1).accept my proto 24, their proto 24
2011-05-04 16:18:07.816067 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=0 cs=0 l=1).accept new session
2011-05-04 16:18:07.816076 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).accept success, connect_seq = 1, sending READY
2011-05-04 16:18:07.816085 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).accept features 138
2011-05-04 16:18:07.816093 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).register_pipe
2011-05-04 16:18:07.816116 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).accept starting writer, state=2
2011-05-04 16:18:07.816147 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).accept done
2011-05-04 16:18:07.816164 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader reading tag...
2011-05-04 16:18:07.816197 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).writer: state = 2 policy.server=1
2011-05-04 16:18:07.816289 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader got MSG
2011-05-04 16:18:07.816343 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader got envelope type=42 src client4153 front=128 data=524288 off 0
2011-05-04 16:18:07.816407 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader wants 524416 from policy throttler 0/15000000
2011-05-04 16:18:07.816422 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader wants 524416 from dispatch throttler 0/28000000
2011-05-04 16:18:07.816500 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader got front 128
2011-05-04 16:18:07.816533 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader allocating new rx buffer at offset 0
2011-05-04 16:18:07.816547 7f15d5837940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).writer sleeping
2011-05-04 16:18:07.816574 7f15d6948940 -- 172.17.40.30:6806/12583 >> 172.17.40.49:0/302440129 pipe(0x2771c80 sd=91 pgs=435 cs=1 l=1).reader reading nonblocking into 0x2fcd8000 len 524288

Here's the client-side logs for the same time window:

May  4 16:00:30 an361 [ 2423.312190] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:00:59 an361 [ 2452.449206] libceph: osd34 172.17.40.30:6806 connection failed
May  4 16:00:59 an361 [ 2452.455980] libceph: osd9 172.17.40.23:6809 connection failed
May  4 16:02:00 an361 [ 2513.456232] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:02:09 an361 [ 2521.938104] INFO: task dd:3680 blocked for more than 120 seconds.
May  4 16:02:09 an361 [ 2521.944206] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  4 16:02:09 an361 [ 2521.952039] dd              D ffff88019fc51f40     0  3680   3677 0x00000000
May  4 16:02:09 an361 [ 2521.952046]  ffff88018f859cc8 0000000000000082 ffffffff81033836 ffff880198931690
May  4 16:02:09 an361 [ 2521.959556]  ffff88018dbead20 ffff88018dbead20 0000000000011f40 ffff880198931690
May  4 16:02:09 an361 [ 2521.967087]  ffff88018dbeb0d0 0000000000000002 ffff88018f859cf8 ffffffff813aa5a7
May  4 16:02:09 an361 [ 2521.974616] Call Trace:
May  4 16:02:09 an361 [ 2521.977078]  [<ffffffff81033836>] ? calc_load_account_idle+0xe/0x1d
May  4 16:02:09 an361 [ 2521.983358]  [<ffffffff813aa5a7>] schedule+0x159/0x193
May  4 16:02:09 an361 [ 2521.988503]  [<ffffffff813aa628>] io_schedule+0x47/0x61
May  4 16:02:09 an361 [ 2521.993738]  [<ffffffff810c5afc>] sleep_on_page+0xe/0x12
May  4 16:02:09 an361 [ 2521.999060]  [<ffffffff813aac0f>] __wait_on_bit+0x4a/0x7c
May  4 16:02:09 an361 [ 2522.004471]  [<ffffffff810c5aee>] ? lock_page+0x2d/0x2d
May  4 16:02:09 an361 [ 2522.009704]  [<ffffffff810c5d10>] wait_on_page_bit+0x74/0x7b
May  4 16:02:09 an361 [ 2522.015373]  [<ffffffff810660df>] ? autoremove_wake_function+0x2b/0x2b
May  4 16:02:09 an361 [ 2522.021905]  [<ffffffff810c5d3e>] wait_on_page_writeback+0x27/0x2b
May  4 16:02:09 an361 [ 2522.028090]  [<ffffffff810c67d8>] filemap_fdatawait_range+0x68/0x147
May  4 16:02:09 an361 [ 2522.034450]  [<ffffffff810c6988>] filemap_write_and_wait_range+0x43/0x56
May  4 16:02:09 an361 [ 2522.041155]  [<ffffffff811357b8>] vfs_fsync_range+0x35/0x76
May  4 16:02:09 an361 [ 2522.046735]  [<ffffffff81135858>] vfs_fsync+0x1c/0x1e
May  4 16:02:09 an361 [ 2522.051794]  [<ffffffff8113588d>] do_fsync+0x33/0x49
May  4 16:02:09 an361 [ 2522.056767]  [<ffffffff811358b6>] sys_fdatasync+0x13/0x17
May  4 16:02:09 an361 [ 2522.062175]  [<ffffffff813b23ab>] system_call_fastpath+0x16/0x1b
May  4 16:03:00 an361 [ 2573.552054] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:04:00 an361 [ 2633.648092] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:04:09 an361 [ 2642.068130] INFO: task dd:3680 blocked for more than 120 seconds.
May  4 16:04:09 an361 [ 2642.074231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  4 16:04:09 an361 [ 2642.082065] dd              D ffff88019fc51f40     0  3680   3677 0x00000000
May  4 16:04:09 an361 [ 2642.082073]  ffff88018f859cc8 0000000000000082 ffffffff81033836 ffff880198931690
May  4 16:04:09 an361 [ 2642.089587]  ffff88018dbead20 ffff88018dbead20 0000000000011f40 ffff880198931690
May  4 16:04:09 an361 [ 2642.097117]  ffff88018dbeb0d0 0000000000000002 ffff88018f859cf8 ffffffff813aa5a7
May  4 16:04:09 an361 [ 2642.104628] Call Trace:
May  4 16:04:09 an361 [ 2642.107091]  [<ffffffff81033836>] ? calc_load_account_idle+0xe/0x1d
May  4 16:04:09 an361 [ 2642.113368]  [<ffffffff813aa5a7>] schedule+0x159/0x193
May  4 16:04:09 an361 [ 2642.118514]  [<ffffffff813aa628>] io_schedule+0x47/0x61
May  4 16:04:09 an361 [ 2642.123748]  [<ffffffff810c5afc>] sleep_on_page+0xe/0x12
May  4 16:04:09 an361 [ 2642.129066]  [<ffffffff813aac0f>] __wait_on_bit+0x4a/0x7c
May  4 16:04:09 an361 [ 2642.134473]  [<ffffffff810c5aee>] ? lock_page+0x2d/0x2d
May  4 16:04:09 an361 [ 2642.139705]  [<ffffffff810c5d10>] wait_on_page_bit+0x74/0x7b
May  4 16:04:09 an361 [ 2642.145375]  [<ffffffff810660df>] ? autoremove_wake_function+0x2b/0x2b
May  4 16:04:09 an361 [ 2642.151904]  [<ffffffff810c5d3e>] wait_on_page_writeback+0x27/0x2b
May  4 16:04:09 an361 [ 2642.158090]  [<ffffffff810c67d8>] filemap_fdatawait_range+0x68/0x147
May  4 16:04:09 an361 [ 2642.164451]  [<ffffffff810c6988>] filemap_write_and_wait_range+0x43/0x56
May  4 16:04:09 an361 [ 2642.171158]  [<ffffffff811357b8>] vfs_fsync_range+0x35/0x76
May  4 16:04:09 an361 [ 2642.176737]  [<ffffffff81135858>] vfs_fsync+0x1c/0x1e
May  4 16:04:09 an361 [ 2642.181796]  [<ffffffff8113588d>] do_fsync+0x33/0x49
May  4 16:04:09 an361 [ 2642.186768]  [<ffffffff811358b6>] sys_fdatasync+0x13/0x17
May  4 16:04:09 an361 [ 2642.192176]  [<ffffffff813b23ab>] system_call_fastpath+0x16/0x1b
May  4 16:05:01 an361 [ 2693.744036] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:06:01 an361 [ 2753.840079] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:06:09 an361 [ 2762.198150] INFO: task dd:3680 blocked for more than 120 seconds.
May  4 16:06:09 an361 [ 2762.204251] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  4 16:06:09 an361 [ 2762.212084] dd              D ffff88019fc51f40     0  3680   3677 0x00000000
May  4 16:06:09 an361 [ 2762.212090]  ffff88018f859cc8 0000000000000082 ffffffff81033836 ffff880198931690
May  4 16:06:09 an361 [ 2762.219608]  ffff88018dbead20 ffff88018dbead20 0000000000011f40 ffff880198931690
May  4 16:06:09 an361 [ 2762.227104]  ffff88018dbeb0d0 0000000000000002 ffff88018f859cf8 ffffffff813aa5a7
May  4 16:06:09 an361 [ 2762.234623] Call Trace:
May  4 16:06:09 an361 [ 2762.237082]  [<ffffffff81033836>] ? calc_load_account_idle+0xe/0x1d
May  4 16:06:09 an361 [ 2762.243354]  [<ffffffff813aa5a7>] schedule+0x159/0x193
May  4 16:06:09 an361 [ 2762.248500]  [<ffffffff813aa628>] io_schedule+0x47/0x61
May  4 16:06:09 an361 [ 2762.253734]  [<ffffffff810c5afc>] sleep_on_page+0xe/0x12
May  4 16:06:09 an361 [ 2762.259054]  [<ffffffff813aac0f>] __wait_on_bit+0x4a/0x7c
May  4 16:06:09 an361 [ 2762.264461]  [<ffffffff810c5aee>] ? lock_page+0x2d/0x2d
May  4 16:06:09 an361 [ 2762.269695]  [<ffffffff810c5d10>] wait_on_page_bit+0x74/0x7b
May  4 16:06:09 an361 [ 2762.275360]  [<ffffffff810660df>] ? autoremove_wake_function+0x2b/0x2b
May  4 16:06:09 an361 [ 2762.281894]  [<ffffffff810c5d3e>] wait_on_page_writeback+0x27/0x2b
May  4 16:06:09 an361 [ 2762.288080]  [<ffffffff810c67d8>] filemap_fdatawait_range+0x68/0x147
May  4 16:06:09 an361 [ 2762.294438]  [<ffffffff810c6988>] filemap_write_and_wait_range+0x43/0x56
May  4 16:06:09 an361 [ 2762.301143]  [<ffffffff811357b8>] vfs_fsync_range+0x35/0x76
May  4 16:06:09 an361 [ 2762.306724]  [<ffffffff81135858>] vfs_fsync+0x1c/0x1e
May  4 16:06:09 an361 [ 2762.311784]  [<ffffffff8113588d>] do_fsync+0x33/0x49
May  4 16:06:09 an361 [ 2762.316757]  [<ffffffff811358b6>] sys_fdatasync+0x13/0x17
May  4 16:06:09 an361 [ 2762.322163]  [<ffffffff813b23ab>] system_call_fastpath+0x16/0x1b
May  4 16:07:01 an361 [ 2813.936115] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:08:01 an361 [ 2874.032097] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:09:01 an361 [ 2934.128133] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:09:33 an361 [ 2966.497102] libceph: osd34 172.17.40.30:6806 connection failed
May  4 16:09:33 an361 [ 2966.571124] libceph: skipping osd9 172.17.40.23:6809 seq 1 expected 2
May  4 16:10:36 an361 [ 3029.280179] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:11:36 an361 [ 3089.392225] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:12:36 an361 [ 3149.488096] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:13:36 an361 [ 3209.584074] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:14:36 an361 [ 3269.680098] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:15:37 an361 [ 3329.776047] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:16:37 an361 [ 3389.872029] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:17:37 an361 [ 3449.968039] libceph:  tid 8536 timed out on osd34, will reset osd
May  4 16:18:07 an361 [ 3480.545137] libceph: osd34 172.17.40.30:6806 socket closed
May  4 16:18:12 an361 [ 3485.024036] libceph:  tid 8545 timed out on osd9, will reset osd


> 
> Of course, we should be getting bad tags anyway, so something else is 
> clearly wrong and may be contributing to both problems.  
> 
> How easy is this to reproduce?  It's right after a fresh connection, so 
> the number of possibly offending code paths is pretty small, at least!

I can reproduce this few times in an hour.  It feels racy; I.e.
I can run the exact same test several times in a row, and sometimes
I see this and other times I don't.

> 
> There is client side debugging to turn on, but it's very chatty.  Maybe 
> you can just enable a few key lines, like the connect handshake ones, and 
> any point where we queue/send a tag.  It's a bit tedious to enable 
> the individual dout lines in messenger.c, sadly, but unless you have a 
> very fast netconsole or something that's probably the only way to go...

I'm happy to give this a try if you think it will help - can you
point me at some instructions for what you'd like to see?

Thanks -- Jim

> 
> sage
> 
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Client reconnect failing: reader gets bad tag
  2011-05-05 19:19 ` Sage Weil
  2011-05-05 20:23   ` Jim Schutt
@ 2011-05-06 21:56   ` Jim Schutt
  2011-05-12 21:32   ` [PATCH 1/2] libceph: add debugging to understand how bad msg tag is getting sent Jim Schutt
  2011-05-12 21:32   ` [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write() Jim Schutt
  3 siblings, 0 replies; 17+ messages in thread
From: Jim Schutt @ 2011-05-06 21:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2176 bytes --]


Hi Sage,

Sage Weil wrote:
> On Wed, 4 May 2011, Jim Schutt wrote:
>> Hi,
>>
>> I'm seeing clients having trouble reconnecting after timed-out
>> requests.  When they get in this state, sometimes they manage
>> to reconnect after several attempts; sometimes they never seem
>> to be able to reconnect.
> 
> Hmm, the interesting line is
> 
>> 2011-05-04 16:00:59.710971 7f15d6948940 -- 172.17.40.30:6806/12583 >>
>> 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).reader bad
>> tag 0
> 
> That _should_ mean the server side (osd) closes out the connection 
> immediately, which should generate a disconnect error on the client and an 
> immediate reconnect.  So it's strange that you're also seeing timeouts.
> 
> Of course, we should be getting bad tags anyway, so something else is 
> clearly wrong and may be contributing to both problems.  
> 
> How easy is this to reproduce?  It's right after a fresh connection, so 
> the number of possibly offending code paths is pretty small, at least!
> 
> There is client side debugging to turn on, but it's very chatty.  Maybe 
> you can just enable a few key lines, like the connect handshake ones, and 
> any point where we queue/send a tag.  It's a bit tedious to enable 
> the individual dout lines in messenger.c, sadly, but unless you have a 
> very fast netconsole or something that's probably the only way to go...

Here's some logs of a client-server hanging interaction.

My dd started on the client at 14:38:22.

The first bad tag can be seen in the osd6 log at 14:39:40.655544.

AFAICS, the client had written a stripe into its socket,
and the OSD got as far as reading the msg tag and header
when the client gave up the the message, closed the socket,
and reconnected.  The OSD got a bad tag on the new pipe.

After that the client continued to retry the send, but
for many retries it always sent a bad tag.  But, it seems
to do this without closing/opening the socket.

Then, the client does close/open the socket, and a valid
msg tag is sent, and things work fine.

FWIW, I think the client-side messenger isn't doing a
good job distinguishing a busy OSD from a dead OSD.

-- Jim

> 
> sage
> 
> 


[-- Attachment #2: client.full.log.bz2 --]
[-- Type: application/x-bzip, Size: 15714 bytes --]

[-- Attachment #3: client.log.bz2 --]
[-- Type: application/x-bzip, Size: 1844 bytes --]

[-- Attachment #4: server.log.bz2 --]
[-- Type: application/x-bzip, Size: 20897 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] libceph: add debugging to understand how bad msg tag is getting sent
  2011-05-05 19:19 ` Sage Weil
  2011-05-05 20:23   ` Jim Schutt
  2011-05-06 21:56   ` Jim Schutt
@ 2011-05-12 21:32   ` Jim Schutt
  2011-05-12 21:32   ` [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write() Jim Schutt
  3 siblings, 0 replies; 17+ messages in thread
From: Jim Schutt @ 2011-05-12 21:32 UTC (permalink / raw)
  To: sage; +Cc: ceph-devel, Jim Schutt

---
 net/ceph/messenger.c |   13 +++++++------
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index e15a82c..db12abc 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -502,8 +502,8 @@ static void prepare_write_message(struct ceph_connection *con)
 		m->needs_out_seq = false;
 	}
 
-	dout("prepare_write_message %p seq %lld type %d len %d+%d+%d %d pgs\n",
-	     m, con->out_seq, le16_to_cpu(m->hdr.type),
+	dout("prepare_write_message %p msg %p seq %lld type %d len %d+%d+%d %d pgs\n",
+	     con, m, con->out_seq, le16_to_cpu(m->hdr.type),
 	     le32_to_cpu(m->hdr.front_len), le32_to_cpu(m->hdr.middle_len),
 	     le32_to_cpu(m->hdr.data_len),
 	     m->nr_pages);
@@ -1708,7 +1708,8 @@ static int try_write(struct ceph_connection *con)
 	     atomic_read(&con->nref));
 
 more:
-	dout("try_write out_kvec_bytes %d\n", con->out_kvec_bytes);
+	dout("try_write %p out_msg %p out_kvec_bytes %d\n",
+	     con, con->out_msg, con->out_kvec_bytes);
 
 	/* open the socket first? */
 	if (con->sock == NULL) {
@@ -1783,7 +1784,7 @@ do_next:
 
 	/* Nothing to do! */
 	clear_bit(WRITE_PENDING, &con->state);
-	dout("try_write nothing else to write.\n");
+	dout("try_write nothing else to write on %p.\n", con);
 	ret = 0;
 out:
 	dout("try_write done on %p ret %d\n", con, ret);
@@ -1959,13 +1960,13 @@ static void con_work(struct work_struct *work)
 		goto done;
 	}
 	if (test_bit(CLOSED, &con->state)) { /* e.g. if we are replaced */
-		dout("con_work CLOSED\n");
+		dout("con_work %p CLOSED\n", con);
 		con_close_socket(con);
 		goto done;
 	}
 	if (test_and_clear_bit(OPENING, &con->state)) {
 		/* reopen w/ new peer */
-		dout("con_work OPENING\n");
+		dout("con_work %p OPENING\n", con);
 		con_close_socket(con);
 	}
 
-- 
1.6.6



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write()
  2011-05-05 19:19 ` Sage Weil
                     ` (2 preceding siblings ...)
  2011-05-12 21:32   ` [PATCH 1/2] libceph: add debugging to understand how bad msg tag is getting sent Jim Schutt
@ 2011-05-12 21:32   ` Jim Schutt
  2011-05-16 16:57     ` [PATCH v2 0/1] " Jim Schutt
  2011-05-16 17:57     ` [PATCH 2/2] " Sage Weil
  3 siblings, 2 replies; 17+ messages in thread
From: Jim Schutt @ 2011-05-12 21:32 UTC (permalink / raw)
  To: sage; +Cc: ceph-devel, Jim Schutt

Under heavy write load, ceph OSDs get messages from clients with bad
message tags.  The result is that client/OSD connection stalls
for a while until the connection state gets sorted out.

The client-side sequence of events that has this result is as follows:

Due to the heavy write load, message processing is backed up, and
a client begins timing out its messages to an OSD.  This timeout
mechanism is intended to recover lost connections, so it causes the
connection to be closed and reopened in an attempt to recover contact
with the OSD.

All ceph socket operations are performed on a workqueue via ceph's
con_work() function.  handle_timeout(), via __reset_osd(), signals
via ceph_connection state that the connection should be closed
and reopened.  However, if there is a message send in progress
when __reset_osd() is called, the ceph_connection state is not
correctly updated, with the result that upon reconnect, information
about the in-progress message is lost, and data that is interpreted
as a message tag is incorrectly sent.

Fix this by causing try_write() to notice that it needs to do
nothing until the connection socket has been closed.  This gives
the appropriate con_work() calls that close/reopen the socket
time to occur correctly.

Here's an annotated debug log of the erroneous sequence:

[ 2381.368370] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 29 nref 1
[ 2381.368374] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91c600 out_kvec_bytes 0
[ 2381.368959] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0

>> con_work()/try_write() made progress on a message

[ 2381.716677] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 29 nref 1
[ 2381.716681] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91c600 out_kvec_bytes 0
[ 2381.717434] libceph:   messenger.c:509  : prepare_write_message ffff8801955c0030 msg ffff88018f91d6c0 seq 5 type 42 len 128+0+524288 128 pgs
[ 2381.717522] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 191
[ 2381.718982] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0

>> con_work()/try_write() finished previous message and started on next message

[ 2388.650373] libceph:   messenger.c:2159 : ----- ffff88015410a300 to osd20 42=osd_op len 128+0+524288 -----
[ 2388.664280] libceph:   messenger.c:2159 : ----- ffff880154116d80 to osd20 42=osd_op len 128+0+524288 -----
[ 2388.664905] libceph:   messenger.c:2159 : ----- ffff8801541176c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2388.669632] libceph:   messenger.c:2159 : ----- ffff88015411bb40 to osd20 42=osd_op len 128+0+524288 -----

>> ceph_con_send() queues more work to connection

[ 2392.366919] libceph:  tid 447 timed out on osd20, will reset osd
[ 2392.366922] libceph:  osd_client.c:725  : __reset_osd ffff8801955c0000 osd20
[ 2392.366926] libceph:   messenger.c:349  : con_close ffff8801955c0030 peer 172.17.40.27:6800

>> A message times out, __reset_osd signals that connection should be closed; work
>> queued to connection is discarded.

[ 2392.366934] libceph:   messenger.c:369  : con_open ffff8801955c0030 172.17.40.27:6800

>> __reset_osd signals that connection should be opened.

[ 2392.367237] libceph:   messenger.c:2159 : ----- ffff88018f91c600 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367242] libceph:   messenger.c:2159 : ----- ffff88018f91d6c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367247] libceph:   messenger.c:2159 : ----- ffff88018f920a80 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367251] libceph:   messenger.c:2159 : ----- ffff88018f924d80 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367256] libceph:   messenger.c:2159 : ----- ffff88018f926480 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367260] libceph:   messenger.c:2159 : ----- ffff88018f92f240 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367265] libceph:   messenger.c:2159 : ----- ffff88018f930d80 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367269] libceph:   messenger.c:2159 : ----- ffff88018be14f00 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367274] libceph:   messenger.c:2159 : ----- ffff88018be1b3c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367279] libceph:   messenger.c:2159 : ----- ffff88018be210c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367283] libceph:   messenger.c:2159 : ----- ffff88018d9c69c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367288] libceph:   messenger.c:2159 : ----- ffff88018d85f9c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367293] libceph:   messenger.c:2159 : ----- ffff88018daf2300 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367297] libceph:   messenger.c:2159 : ----- ffff88018f8b6600 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367302] libceph:   messenger.c:2159 : ----- ffff88018be28840 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367307] libceph:   messenger.c:2159 : ----- ffff880154105240 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367311] libceph:   messenger.c:2159 : ----- ffff88015410a300 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367316] libceph:   messenger.c:2159 : ----- ffff88015410f0c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367320] libceph:   messenger.c:2159 : ----- ffff880154116d80 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367325] libceph:   messenger.c:2159 : ----- ffff8801541176c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367329] libceph:   messenger.c:2159 : ----- ffff88015411bb40 to osd20 42=osd_op len 128+0+524288 -----

>> Outstanding work is requeued to connection.

[ 2397.376206] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
[ 2402.384140] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
[ 2407.395283] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
[ 2411.207412] libceph:   messenger.c:2177 : con_revoke ffff8801955c0030 msg ffff88018f91c600 - was on queue

[ 2411.217092] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 8220 nref 1
[ 2411.225536] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg           (null) out_kvec_bytes 0
[ 2411.235555] libceph:   messenger.c:509  : prepare_write_message ffff8801955c0030 msg ffff88018f91d6c0 seq 1 type 42 len 128+0+524288 128 pgs
[ 2411.256874] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 191
[ 2411.267280] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0

>> con_work()/try_write() start again on new message.
>> Note that con_work() did not notice that the connection was marked to
>> be closed/reopened!  This is where things went bad.
>> Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.

[ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030 OPENING
[ 2412.643182] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff8801825a8c00
[ 2412.643197] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2076 sk_state = 4

>> con_work() notices that connection should be closed/reopened

[ 2412.643206] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 28 nref 1
[ 2412.643210] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 0
[ 2412.643214] libceph:   messenger.c:665  : prepare_write_connect ffff8801955c0030 cseq=0 gseq=84 proto=24
[ 2412.643218] libceph:   messenger.c:1725 : try_write initiating connect on ffff8801955c0030 new state 26
[ 2412.643281] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2412.643300] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 26 sk_state = 1

>> con_work()/try_write() attempts to reopen connection, since con->sock is NULL.
>> Note that prepare_write_connect() just overwrote the part of the kvec that
>> had the message tag that prepare_write_message() set up above.

[ 2415.582538] libceph:   messenger.c:2159 : ----- ffff88018d892480 to osd20 42=osd_op len 128+0+524288 -----
[ 2417.921405] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
[ 2418.046863] libceph:   messenger.c:2159 : ----- ffff8801542d8480 to osd20 42=osd_op len 128+0+524288 -----
[ 2418.129434] libceph:   messenger.c:2159 : ----- ffff8801542d8d80 to osd20 42=osd_op len 128+0+524288 -----

>> more work is queued

[ 2418.219923] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2418.219955] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 200
[ 2418.220388] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0

>> try_write() tries again to send the current message after connection was reopened.
>> since connection was reopened, we need to call prepare_write_message() again to
>> send correct message tag, but that didn't happen.

[ 2418.220517] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 30 sk_state = 8
[ 2418.223341] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2419.611777] libceph:   messenger.c:2159 : ----- ffff8801542d96c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2419.620907] libceph:   messenger.c:2159 : ----- ffff8801542e1840 to osd20 42=osd_op len 128+0+524288 -----
[ 2419.638112] libceph:   messenger.c:2159 : ----- ffff880154355240 to osd20 42=osd_op len 128+0+524288 -----
[ 2423.040109] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
[ 2424.572798] libceph: osd20 172.17.40.27:6800 connection failed

>> client closed connection since it had bad tag

[ 2424.578852] libceph:   messenger.c:1997 : fault ffff8801955c0030 state 30 to peer 172.17.40.27:6800
[ 2424.587903] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff880195fbb300
[ 2424.597059] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2425.380053] libceph:   messenger.c:2034 : fault queued ffff8801955c0030 delay 500
[ 2425.387544] libceph:  osd_client.c:725  : __reset_osd ffff8801955c0000 osd20
[ 2428.814245] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2428.822518] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 0
[ 2428.832532] libceph:   messenger.c:665  : prepare_write_connect ffff8801955c0030 cseq=0 gseq=128 proto=24
[ 2428.842125] libceph:   messenger.c:1725 : try_write initiating connect on ffff8801955c0030 new state 26
[ 2428.851583] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2428.859178] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 26 sk_state = 1
[ 2432.083976] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2432.092250] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 200
[ 2432.103263] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 30 sk_state = 8
[ 2432.119307] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2432.149668] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret -32
[ 2432.157419] libceph: osd20 172.17.40.27:6800 connection failed
[ 2432.163266] libceph:   messenger.c:1997 : fault ffff8801955c0030 state 2078 to peer 172.17.40.27:6800
[ 2432.172487] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff8801825ecc00
[ 2432.181636] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2432.204310] libceph:   messenger.c:2038 : fault failed to queue ffff8801955c0030 delay 1000, backoff
[ 2432.213462] libceph:  osd_client.c:725  : __reset_osd ffff8801955c0000 osd20
[ 2435.234571] libceph:   messenger.c:1945 : con_work ffff8801955c0030 backing off
[ 2435.241907] libceph:   messenger.c:1948 : con_work ffff8801955c0030 backoff 1000
[ 2439.455722] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2439.464052] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 0
[ 2439.474054] libceph:   messenger.c:665  : prepare_write_connect ffff8801955c0030 cseq=0 gseq=168 proto=24
[ 2439.483636] libceph:   messenger.c:1725 : try_write initiating connect on ffff8801955c0030 new state 26
[ 2439.493061] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2439.500653] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 26 sk_state = 1
[ 2442.678598] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2442.686871] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 200
[ 2442.697121] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2442.704825] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 30 sk_state = 8
[ 2442.720849] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2444.582650] libceph: osd20 172.17.40.27:6800 connection failed
[ 2444.588700] libceph:   messenger.c:1997 : fault ffff8801955c0030 state 30 to peer 172.17.40.27:6800
[ 2444.597750] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff8801825a8900
[ 2444.606904] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2444.629552] libceph:   messenger.c:2034 : fault queued ffff8801955c0030 delay 2000
[ 2444.637143] libceph:  osd_client.c:725  : __reset_osd ffff8801955c0000 osd20
[ 2449.541286] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2449.549601] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 0
[ 2449.569187] libceph:   messenger.c:1725 : try_write initiating connect on ffff8801955c0030 new state 26
[ 2449.578620] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2449.586212] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 26 sk_state = 1
[ 2452.647790] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2452.656069] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 200
[ 2452.667104] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 30 sk_state = 8
[ 2452.683145] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2452.713468] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret -32
[ 2452.721233] libceph: osd20 172.17.40.27:6800 connection failed
[ 2452.727079] libceph:   messenger.c:1997 : fault ffff8801955c0030 state 2078 to peer 172.17.40.27:6800
[ 2452.736314] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff8801825ec900
[ 2452.745447] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2452.768128] libceph:   messenger.c:2038 : fault failed to queue ffff8801955c0030 delay 4000, backoff
[ 2452.777264] libceph:  osd_client.c:725  : __reset_osd ffff8801955c0000 osd20
[ 2455.493705] libceph:   messenger.c:1945 : con_work ffff8801955c0030 backing off
[ 2455.501031] libceph:   messenger.c:1948 : con_work ffff8801955c0030 backoff 4000
[ 2458.132179] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
[ 2462.134769] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2462.143086] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 0
[ 2462.153099] libceph:   messenger.c:665  : prepare_write_connect ffff8801955c0030 cseq=0 gseq=242 proto=24
[ 2462.162670] libceph:   messenger.c:1725 : try_write initiating connect on ffff8801955c0030 new state 26
[ 2462.172102] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2462.179695] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 26 sk_state = 1
[ 2464.284217] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2464.292489] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 200
[ 2464.303138] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2464.310771] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 30 sk_state = 8
[ 2464.326802] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2466.903694] libceph: osd20 172.17.40.27:6800 connection failed
[ 2466.909735] libceph:   messenger.c:1997 : fault ffff8801955c0030 state 30 to peer 172.17.40.27:6800

>> four more iterations of reconnecting, then sending bad message tag.
---
 net/ceph/messenger.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index db12abc..657a2de 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1707,6 +1707,12 @@ static int try_write(struct ceph_connection *con)
 	dout("try_write start %p state %lu nref %d\n", con, con->state,
 	     atomic_read(&con->nref));
 
+	if (test_bit(OPENING, &con->state)) {
+		dout("try_write %p raced ceph_con_open, won - retry\n", con);
+		ret = 0;
+		goto out;
+	}
+
 more:
 	dout("try_write %p out_msg %p out_kvec_bytes %d\n",
 	     con, con->out_msg, con->out_kvec_bytes);
-- 
1.6.6



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 0/1] libceph: fix handle_timeout() racing with con_work()/try_write()
  2011-05-12 21:32   ` [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write() Jim Schutt
@ 2011-05-16 16:57     ` Jim Schutt
  2011-05-16 16:57       ` [PATCH v2 1/1] " Jim Schutt
  2011-05-16 17:57     ` [PATCH 2/2] " Sage Weil
  1 sibling, 1 reply; 17+ messages in thread
From: Jim Schutt @ 2011-05-16 16:57 UTC (permalink / raw)
  To: sage; +Cc: ceph-devel, Jim Schutt

Changes since v1:

If try_write() needs to re-open a connection when there is
a message send in progress, just requeue it.  This is more
robust, since it handles all possible ways the connection
could have gotten to that state.


Jim Schutt (1):
  libceph: fix handle_timeout() racing with con_work()/try_write()

 net/ceph/messenger.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/1] libceph: fix handle_timeout() racing with con_work()/try_write()
  2011-05-16 16:57     ` [PATCH v2 0/1] " Jim Schutt
@ 2011-05-16 16:57       ` Jim Schutt
  0 siblings, 0 replies; 17+ messages in thread
From: Jim Schutt @ 2011-05-16 16:57 UTC (permalink / raw)
  To: sage; +Cc: ceph-devel, Jim Schutt

Under heavy write load, ceph OSDs get messages from clients with bad
message tags.  The result is that client/OSD connection stalls
for a while until the connection state gets sorted out.

The client-side sequence of events that has this result is as follows:

Due to the heavy write load, message processing is backed up, and
a client begins timing out its messages to an OSD.  This timeout
mechanism is intended to recover lost connections, so it causes the
connection to be closed and reopened in an attempt to recover contact
with the OSD.

All ceph socket operations are performed on a workqueue via ceph's
con_work() function.  handle_timeout(), via __reset_osd(), signals
via ceph_connection state that the connection should be closed
and reopened.  However, if there is a message send in progress
when __reset_osd() is called, the ceph_connection state is not
correctly updated, with the result that upon reconnect, information
about the in-progress message is lost, and data that is interpreted
as a message tag is incorrectly sent.

Fix this by causing try_write() to notice that it has a message
in progress, because con->out_msg is non-NULL, when it needs to
reopen the connection socket.  Put the current message back on
the head of con->out_queue, in case ceph_con_send() has already
requeued outstanding messages after ceph_con_close() had dropped
them all.

Here's an annotated debug log of the erroneous sequence:

[ 2381.368370] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 29 nref 1
[ 2381.368374] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91c600 out_kvec_bytes 0
[ 2381.368959] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0

>> con_work()/try_write() made progress on a message

[ 2381.716677] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 29 nref 1
[ 2381.716681] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91c600 out_kvec_bytes 0
[ 2381.717434] libceph:   messenger.c:509  : prepare_write_message ffff8801955c0030 msg ffff88018f91d6c0 seq 5 type 42 len 128+0+524288 128 pgs
[ 2381.717522] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 191
[ 2381.718982] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0

>> con_work()/try_write() finished previous message and started on next message

[ 2388.650373] libceph:   messenger.c:2159 : ----- ffff88015410a300 to osd20 42=osd_op len 128+0+524288 -----
[ 2388.664280] libceph:   messenger.c:2159 : ----- ffff880154116d80 to osd20 42=osd_op len 128+0+524288 -----
[ 2388.664905] libceph:   messenger.c:2159 : ----- ffff8801541176c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2388.669632] libceph:   messenger.c:2159 : ----- ffff88015411bb40 to osd20 42=osd_op len 128+0+524288 -----

>> ceph_con_send() queues more work to connection

[ 2392.366919] libceph:  tid 447 timed out on osd20, will reset osd
[ 2392.366922] libceph:  osd_client.c:725  : __reset_osd ffff8801955c0000 osd20
[ 2392.366926] libceph:   messenger.c:349  : con_close ffff8801955c0030 peer 172.17.40.27:6800

>> A message times out, __reset_osd signals that connection should be closed; work
>> queued to connection is discarded.

[ 2392.366934] libceph:   messenger.c:369  : con_open ffff8801955c0030 172.17.40.27:6800

>> __reset_osd signals that connection should be opened.

[ 2392.367237] libceph:   messenger.c:2159 : ----- ffff88018f91c600 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367242] libceph:   messenger.c:2159 : ----- ffff88018f91d6c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367247] libceph:   messenger.c:2159 : ----- ffff88018f920a80 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367251] libceph:   messenger.c:2159 : ----- ffff88018f924d80 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367256] libceph:   messenger.c:2159 : ----- ffff88018f926480 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367260] libceph:   messenger.c:2159 : ----- ffff88018f92f240 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367265] libceph:   messenger.c:2159 : ----- ffff88018f930d80 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367269] libceph:   messenger.c:2159 : ----- ffff88018be14f00 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367274] libceph:   messenger.c:2159 : ----- ffff88018be1b3c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367279] libceph:   messenger.c:2159 : ----- ffff88018be210c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367283] libceph:   messenger.c:2159 : ----- ffff88018d9c69c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367288] libceph:   messenger.c:2159 : ----- ffff88018d85f9c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367293] libceph:   messenger.c:2159 : ----- ffff88018daf2300 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367297] libceph:   messenger.c:2159 : ----- ffff88018f8b6600 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367302] libceph:   messenger.c:2159 : ----- ffff88018be28840 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367307] libceph:   messenger.c:2159 : ----- ffff880154105240 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367311] libceph:   messenger.c:2159 : ----- ffff88015410a300 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367316] libceph:   messenger.c:2159 : ----- ffff88015410f0c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367320] libceph:   messenger.c:2159 : ----- ffff880154116d80 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367325] libceph:   messenger.c:2159 : ----- ffff8801541176c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2392.367329] libceph:   messenger.c:2159 : ----- ffff88015411bb40 to osd20 42=osd_op len 128+0+524288 -----

>> Outstanding work is requeued to connection.

[ 2397.376206] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
[ 2402.384140] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
[ 2407.395283] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
[ 2411.207412] libceph:   messenger.c:2177 : con_revoke ffff8801955c0030 msg ffff88018f91c600 - was on queue

[ 2411.217092] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 8220 nref 1
[ 2411.225536] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg           (null) out_kvec_bytes 0
[ 2411.235555] libceph:   messenger.c:509  : prepare_write_message ffff8801955c0030 msg ffff88018f91d6c0 seq 1 type 42 len 128+0+524288 128 pgs
[ 2411.256874] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 191
[ 2411.267280] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0

>> con_work()/try_write() start again on new message.
>> Note that con_work() did not notice that the connection was marked to
>> be closed/reopened!  This is where things went bad.
>> Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.

[ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030 OPENING
[ 2412.643182] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff8801825a8c00
[ 2412.643197] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2076 sk_state = 4

>> con_work() notices that connection should be closed/reopened

[ 2412.643206] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 28 nref 1
[ 2412.643210] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 0
[ 2412.643214] libceph:   messenger.c:665  : prepare_write_connect ffff8801955c0030 cseq=0 gseq=84 proto=24
[ 2412.643218] libceph:   messenger.c:1725 : try_write initiating connect on ffff8801955c0030 new state 26
[ 2412.643281] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2412.643300] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 26 sk_state = 1

>> con_work()/try_write() attempts to reopen connection, since con->sock is NULL.
>> Note that prepare_write_connect() just overwrote the part of the kvec that
>> had the message tag that prepare_write_message() set up above.

[ 2415.582538] libceph:   messenger.c:2159 : ----- ffff88018d892480 to osd20 42=osd_op len 128+0+524288 -----
[ 2417.921405] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
[ 2418.046863] libceph:   messenger.c:2159 : ----- ffff8801542d8480 to osd20 42=osd_op len 128+0+524288 -----
[ 2418.129434] libceph:   messenger.c:2159 : ----- ffff8801542d8d80 to osd20 42=osd_op len 128+0+524288 -----

>> more work is queued

[ 2418.219923] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2418.219955] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 200
[ 2418.220388] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0

>> try_write() tries again to send the current message after connection was reopened.
>> since connection was reopened, we need to call prepare_write_message() again to
>> send correct message tag, but that didn't happen.

[ 2418.220517] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 30 sk_state = 8
[ 2418.223341] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2419.611777] libceph:   messenger.c:2159 : ----- ffff8801542d96c0 to osd20 42=osd_op len 128+0+524288 -----
[ 2419.620907] libceph:   messenger.c:2159 : ----- ffff8801542e1840 to osd20 42=osd_op len 128+0+524288 -----
[ 2419.638112] libceph:   messenger.c:2159 : ----- ffff880154355240 to osd20 42=osd_op len 128+0+524288 -----
[ 2423.040109] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
[ 2424.572798] libceph: osd20 172.17.40.27:6800 connection failed

>> client closed connection since it had bad tag

[ 2424.578852] libceph:   messenger.c:1997 : fault ffff8801955c0030 state 30 to peer 172.17.40.27:6800
[ 2424.587903] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff880195fbb300
[ 2424.597059] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2425.380053] libceph:   messenger.c:2034 : fault queued ffff8801955c0030 delay 500
[ 2425.387544] libceph:  osd_client.c:725  : __reset_osd ffff8801955c0000 osd20
[ 2428.814245] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2428.822518] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 0
[ 2428.832532] libceph:   messenger.c:665  : prepare_write_connect ffff8801955c0030 cseq=0 gseq=128 proto=24
[ 2428.842125] libceph:   messenger.c:1725 : try_write initiating connect on ffff8801955c0030 new state 26
[ 2428.851583] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2428.859178] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 26 sk_state = 1
[ 2432.083976] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2432.092250] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 200
[ 2432.103263] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 30 sk_state = 8
[ 2432.119307] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2432.149668] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret -32
[ 2432.157419] libceph: osd20 172.17.40.27:6800 connection failed
[ 2432.163266] libceph:   messenger.c:1997 : fault ffff8801955c0030 state 2078 to peer 172.17.40.27:6800
[ 2432.172487] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff8801825ecc00
[ 2432.181636] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2432.204310] libceph:   messenger.c:2038 : fault failed to queue ffff8801955c0030 delay 1000, backoff
[ 2432.213462] libceph:  osd_client.c:725  : __reset_osd ffff8801955c0000 osd20
[ 2435.234571] libceph:   messenger.c:1945 : con_work ffff8801955c0030 backing off
[ 2435.241907] libceph:   messenger.c:1948 : con_work ffff8801955c0030 backoff 1000
[ 2439.455722] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2439.464052] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 0
[ 2439.474054] libceph:   messenger.c:665  : prepare_write_connect ffff8801955c0030 cseq=0 gseq=168 proto=24
[ 2439.483636] libceph:   messenger.c:1725 : try_write initiating connect on ffff8801955c0030 new state 26
[ 2439.493061] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2439.500653] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 26 sk_state = 1
[ 2442.678598] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2442.686871] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 200
[ 2442.697121] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2442.704825] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 30 sk_state = 8
[ 2442.720849] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2444.582650] libceph: osd20 172.17.40.27:6800 connection failed
[ 2444.588700] libceph:   messenger.c:1997 : fault ffff8801955c0030 state 30 to peer 172.17.40.27:6800
[ 2444.597750] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff8801825a8900
[ 2444.606904] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2444.629552] libceph:   messenger.c:2034 : fault queued ffff8801955c0030 delay 2000
[ 2444.637143] libceph:  osd_client.c:725  : __reset_osd ffff8801955c0000 osd20
[ 2449.541286] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2449.549601] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 0
[ 2449.569187] libceph:   messenger.c:1725 : try_write initiating connect on ffff8801955c0030 new state 26
[ 2449.578620] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2449.586212] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 26 sk_state = 1
[ 2452.647790] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2452.656069] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 200
[ 2452.667104] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 30 sk_state = 8
[ 2452.683145] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2452.713468] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret -32
[ 2452.721233] libceph: osd20 172.17.40.27:6800 connection failed
[ 2452.727079] libceph:   messenger.c:1997 : fault ffff8801955c0030 state 2078 to peer 172.17.40.27:6800
[ 2452.736314] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff8801825ec900
[ 2452.745447] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2452.768128] libceph:   messenger.c:2038 : fault failed to queue ffff8801955c0030 delay 4000, backoff
[ 2452.777264] libceph:  osd_client.c:725  : __reset_osd ffff8801955c0000 osd20
[ 2455.493705] libceph:   messenger.c:1945 : con_work ffff8801955c0030 backing off
[ 2455.501031] libceph:   messenger.c:1948 : con_work ffff8801955c0030 backoff 4000
[ 2458.132179] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
[ 2462.134769] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2462.143086] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 0
[ 2462.153099] libceph:   messenger.c:665  : prepare_write_connect ffff8801955c0030 cseq=0 gseq=242 proto=24
[ 2462.162670] libceph:   messenger.c:1725 : try_write initiating connect on ffff8801955c0030 new state 26
[ 2462.172102] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2462.179695] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 26 sk_state = 1
[ 2464.284217] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 30 nref 1
[ 2464.292489] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 200
[ 2464.303138] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
[ 2464.310771] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 30 sk_state = 8
[ 2464.326802] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2078 sk_state = 7
[ 2466.903694] libceph: osd20 172.17.40.27:6800 connection failed
[ 2466.909735] libceph:   messenger.c:1997 : fault ffff8801955c0030 state 30 to peer 172.17.40.27:6800

>> four more iterations of reconnecting, then sending bad message tag.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
---
 net/ceph/messenger.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index db12abc..3a88470 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1713,6 +1713,17 @@ more:
 
 	/* open the socket first? */
 	if (con->sock == NULL) {
+		/* working on a message? put it back on out_queue. */
+		if (con->out_msg) {
+			struct ceph_msg *m = con->out_msg;
+			if (test_bit(LOSSYTX, &con->state)) {
+				list_add(&m->list_head, &con->out_queue);
+			} else {
+				list_move(&m->list_head, &con->out_queue);
+				ceph_msg_put(m);
+			}
+			con->out_msg = NULL;
+		}
 		prepare_write_banner(msgr, con);
 		prepare_write_connect(msgr, con, 1);
 		prepare_read_banner(con);
-- 
1.6.6



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write()
  2011-05-12 21:32   ` [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write() Jim Schutt
  2011-05-16 16:57     ` [PATCH v2 0/1] " Jim Schutt
@ 2011-05-16 17:57     ` Sage Weil
  2011-05-16 19:06       ` Jim Schutt
  2011-05-17 22:32       ` Jim Schutt
  1 sibling, 2 replies; 17+ messages in thread
From: Sage Weil @ 2011-05-16 17:57 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

Hi Jim,

Finally had some time to look at this closely.  This is great work 
nailing down the misbehavior.  Here's what confuses me:  

- con_work takes con->mutex and does (almost) everything under the 
protection of that lock.
- con_close and con_open both take it as well when they twiddle with state
- therefore, after the close/open are called by __reset_osd(), the next 
call into con_work should see the OPENING bit is set.

Of course, it doesn't.  See below...

> [ 2392.366919] libceph:  tid 447 timed out on osd20, will reset osd
> [ 2392.366922] libceph:  osd_client.c:725  : __reset_osd ffff8801955c0000 osd20
> [ 2392.366926] libceph:   messenger.c:349  : con_close ffff8801955c0030 peer 172.17.40.27:6800
> 
> >> A message times out, __reset_osd signals that connection should be closed; work
> >> queued to connection is discarded.
> 
> [ 2392.366934] libceph:   messenger.c:369  : con_open ffff8801955c0030 172.17.40.27:6800
> 
> >> __reset_osd signals that connection should be opened.
> 
> [ 2392.367237] libceph:   messenger.c:2159 : ----- ffff88018f91c600 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367242] libceph:   messenger.c:2159 : ----- ffff88018f91d6c0 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367247] libceph:   messenger.c:2159 : ----- ffff88018f920a80 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367251] libceph:   messenger.c:2159 : ----- ffff88018f924d80 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367256] libceph:   messenger.c:2159 : ----- ffff88018f926480 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367260] libceph:   messenger.c:2159 : ----- ffff88018f92f240 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367265] libceph:   messenger.c:2159 : ----- ffff88018f930d80 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367269] libceph:   messenger.c:2159 : ----- ffff88018be14f00 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367274] libceph:   messenger.c:2159 : ----- ffff88018be1b3c0 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367279] libceph:   messenger.c:2159 : ----- ffff88018be210c0 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367283] libceph:   messenger.c:2159 : ----- ffff88018d9c69c0 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367288] libceph:   messenger.c:2159 : ----- ffff88018d85f9c0 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367293] libceph:   messenger.c:2159 : ----- ffff88018daf2300 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367297] libceph:   messenger.c:2159 : ----- ffff88018f8b6600 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367302] libceph:   messenger.c:2159 : ----- ffff88018be28840 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367307] libceph:   messenger.c:2159 : ----- ffff880154105240 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367311] libceph:   messenger.c:2159 : ----- ffff88015410a300 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367316] libceph:   messenger.c:2159 : ----- ffff88015410f0c0 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367320] libceph:   messenger.c:2159 : ----- ffff880154116d80 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367325] libceph:   messenger.c:2159 : ----- ffff8801541176c0 to osd20 42=osd_op len 128+0+524288 -----
> [ 2392.367329] libceph:   messenger.c:2159 : ----- ffff88015411bb40 to osd20 42=osd_op len 128+0+524288 -----
> 
> >> Outstanding work is requeued to connection.
> 
> [ 2397.376206] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
> [ 2402.384140] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
> [ 2407.395283] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
> [ 2411.207412] libceph:   messenger.c:2177 : con_revoke ffff8801955c0030 msg ffff88018f91c600 - was on queue

Not sure why the first message was revoked... but that shouldn't matter...
 
> [ 2411.217092] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 8220 nref 1
> [ 2411.225536] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg           (null) out_kvec_bytes 0
> [ 2411.235555] libceph:   messenger.c:509  : prepare_write_message ffff8801955c0030 msg ffff88018f91d6c0 seq 1 type 42 len 128+0+524288 128 pgs
> [ 2411.256874] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 191
> [ 2411.267280] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0

This picks up the second resent message.  But and OPENING is set.. so how 
is it that we were in try_write but we didn't pass through the OPENING 
test in con_work?

> >> con_work()/try_write() start again on new message.
> >> Note that con_work() did not notice that the connection was marked to
> >> be closed/reopened!  This is where things went bad.
> >> Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.
> 
> [ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030 OPENING
> [ 2412.643182] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff8801825a8c00
> [ 2412.643197] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2076 sk_state = 4

Ok, now we see it, but this should have executed before we ever got to 
try_write!

There are a handful of places that drop and re-take con->mutex... I think 
it has to be one of those.  They are:

- prepare_connect_authorizer (before calling ->prepare_authorizer 
callback)
- fail_protocol (before calling ->bad_proto callback)
- process_connect RESETSESSION case (before calling ->reset callback)
- process_message (before calling ->dispatch callback)

I think after retaking the lock in each of these cases we need to test for 
the CLOSED or OPENING bits and, if set, bail out so that con_work can 
restart.

Does that make sense?  Can you confirm which case of lock droppage inside 
con_work is at fault?  (Maybe adding a printk at the top of con_work after 
taking the lock too would make it more clear what's going on; currently it 
only prints in all the random test_bit cases).

I'm thinking something like the below... does that seem reasonable?  I 
only did the prepare_authorizer case to start, but the others should be 
fixed up too.  Compile tested only.

BTW, I looked at the second patch too.  Messages should only be requeued 
for non-lossy connections, and that's done in ceph_fault by the 
list_splice call.  The current out_msg is placed on the out_sent list when 
we first start, so it is already included in that set.  I don't think any 
special casing is needed there.

Thanks for hunting this down!
sage


diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index e15a82c..544d52c 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -598,7 +598,7 @@ static void prepare_write_keepalive(struct ceph_connection *con)
  * Connection negotiation.
  */
 
-static void prepare_connect_authorizer(struct ceph_connection *con)
+static int prepare_connect_authorizer(struct ceph_connection *con)
 {
 	void *auth_buf;
 	int auth_len = 0;
@@ -612,6 +612,10 @@ static void prepare_connect_authorizer(struct ceph_connection *con)
 					 con->auth_retry);
 	mutex_lock(&con->mutex);
 
+	if (test_bit(CLOSED, &con->state) ||
+	    test_bit(OPENING, &con->state))
+		return -EAGAIN;
+
 	con->out_connect.authorizer_protocol = cpu_to_le32(auth_protocol);
 	con->out_connect.authorizer_len = cpu_to_le32(auth_len);
 
@@ -619,6 +623,7 @@ static void prepare_connect_authorizer(struct ceph_connection *con)
 	con->out_kvec[con->out_kvec_left].iov_len = auth_len;
 	con->out_kvec_left++;
 	con->out_kvec_bytes += auth_len;
+	return 0;
 }
 
 /*
@@ -640,9 +645,9 @@ static void prepare_write_banner(struct ceph_messenger *msgr,
 	set_bit(WRITE_PENDING, &con->state);
 }
 
-static void prepare_write_connect(struct ceph_messenger *msgr,
-				  struct ceph_connection *con,
-				  int after_banner)
+static int prepare_write_connect(struct ceph_messenger *msgr,
+				 struct ceph_connection *con,
+				 int after_banner)
 {
 	unsigned global_seq = get_global_seq(con->msgr, 0);
 	int proto;
@@ -683,7 +688,7 @@ static void prepare_write_connect(struct ceph_messenger *msgr,
 	con->out_more = 0;
 	set_bit(WRITE_PENDING, &con->state);
 
-	prepare_connect_authorizer(con);
+	return prepare_connect_authorizer(con);
 }
 
 
@@ -1216,6 +1221,7 @@ static int process_connect(struct ceph_connection *con)
 	u64 sup_feat = con->msgr->supported_features;
 	u64 req_feat = con->msgr->required_features;
 	u64 server_feat = le64_to_cpu(con->in_reply.features);
+	int r;
 
 	dout("process_connect on %p tag %d\n", con, (int)con->in_tag);
 
@@ -1250,7 +1256,9 @@ static int process_connect(struct ceph_connection *con)
 			return -1;
 		}
 		con->auth_retry = 1;
-		prepare_write_connect(con->msgr, con, 0);
+		r = prepare_write_connect(con->msgr, con, 0);
+		if (r < 0)
+			return r;
 		prepare_read_connect(con);
 		break;
 
@@ -1938,6 +1946,7 @@ static void con_work(struct work_struct *work)
 {
 	struct ceph_connection *con = container_of(work, struct ceph_connection,
 						   work.work);
+	int ret;
 
 	mutex_lock(&con->mutex);
 	if (test_and_clear_bit(BACKOFF, &con->state)) {
@@ -1958,6 +1967,7 @@ static void con_work(struct work_struct *work)
 		dout("con_work %p STANDBY\n", con);
 		goto done;
 	}
+restart:
 	if (test_bit(CLOSED, &con->state)) { /* e.g. if we are replaced */
 		dout("con_work CLOSED\n");
 		con_close_socket(con);
@@ -1969,18 +1979,32 @@ static void con_work(struct work_struct *work)
 		con_close_socket(con);
 	}
 
-	if (test_and_clear_bit(SOCK_CLOSED, &con->state) ||
-	    try_read(con) < 0 ||
-	    try_write(con) < 0) {
-		mutex_unlock(&con->mutex);
-		ceph_fault(con);     /* error/fault path */
-		goto done_unlocked;
-	}
+	if (test_and_clear_bit(SOCK_CLOSED, &con->state))
+		goto fault;
+
+	ret = try_read(con);
+	if (ret == -EAGAIN)
+		goto restart;
+	if (ret < 0)
+		goto fault;
+
+	ret = try_write(con);
+	if (ret == -EAGAIN)
+		goto restart;
+	if (ret < 0)
+		goto fault;
 
 done:
 	mutex_unlock(&con->mutex);
 done_unlocked:
 	con->ops->put(con);
+	return;
+
+
+fault:
+	mutex_unlock(&con->mutex);
+	ceph_fault(con);
+	goto done_unlocked;	
 }
 
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write()
  2011-05-16 17:57     ` [PATCH 2/2] " Sage Weil
@ 2011-05-16 19:06       ` Jim Schutt
  2011-05-17 22:32       ` Jim Schutt
  1 sibling, 0 replies; 17+ messages in thread
From: Jim Schutt @ 2011-05-16 19:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,

Sage Weil wrote:
> Hi Jim,
> 
> Finally had some time to look at this closely.  This is great work 
> nailing down the misbehavior.  Here's what confuses me:  
> 
> - con_work takes con->mutex and does (almost) everything under the 
> protection of that lock.
> - con_close and con_open both take it as well when they twiddle with state
> - therefore, after the close/open are called by __reset_osd(), the next 
> call into con_work should see the OPENING bit is set.
> 
> Of course, it doesn't.  See below...
> 
>> [ 2392.366919] libceph:  tid 447 timed out on osd20, will reset osd
>> [ 2392.366922] libceph:  osd_client.c:725  : __reset_osd ffff8801955c0000 osd20
>> [ 2392.366926] libceph:   messenger.c:349  : con_close ffff8801955c0030 peer 172.17.40.27:6800
>>
>>>> A message times out, __reset_osd signals that connection should be closed; work
>>>> queued to connection is discarded.
>> [ 2392.366934] libceph:   messenger.c:369  : con_open ffff8801955c0030 172.17.40.27:6800
>>
>>>> __reset_osd signals that connection should be opened.
>> [ 2392.367237] libceph:   messenger.c:2159 : ----- ffff88018f91c600 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367242] libceph:   messenger.c:2159 : ----- ffff88018f91d6c0 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367247] libceph:   messenger.c:2159 : ----- ffff88018f920a80 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367251] libceph:   messenger.c:2159 : ----- ffff88018f924d80 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367256] libceph:   messenger.c:2159 : ----- ffff88018f926480 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367260] libceph:   messenger.c:2159 : ----- ffff88018f92f240 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367265] libceph:   messenger.c:2159 : ----- ffff88018f930d80 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367269] libceph:   messenger.c:2159 : ----- ffff88018be14f00 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367274] libceph:   messenger.c:2159 : ----- ffff88018be1b3c0 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367279] libceph:   messenger.c:2159 : ----- ffff88018be210c0 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367283] libceph:   messenger.c:2159 : ----- ffff88018d9c69c0 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367288] libceph:   messenger.c:2159 : ----- ffff88018d85f9c0 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367293] libceph:   messenger.c:2159 : ----- ffff88018daf2300 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367297] libceph:   messenger.c:2159 : ----- ffff88018f8b6600 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367302] libceph:   messenger.c:2159 : ----- ffff88018be28840 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367307] libceph:   messenger.c:2159 : ----- ffff880154105240 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367311] libceph:   messenger.c:2159 : ----- ffff88015410a300 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367316] libceph:   messenger.c:2159 : ----- ffff88015410f0c0 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367320] libceph:   messenger.c:2159 : ----- ffff880154116d80 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367325] libceph:   messenger.c:2159 : ----- ffff8801541176c0 to osd20 42=osd_op len 128+0+524288 -----
>> [ 2392.367329] libceph:   messenger.c:2159 : ----- ffff88015411bb40 to osd20 42=osd_op len 128+0+524288 -----
>>
>>>> Outstanding work is requeued to connection.
>> [ 2397.376206] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
>> [ 2402.384140] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
>> [ 2407.395283] libceph:   messenger.c:2230 : con_keepalive ffff8801955c0030
>> [ 2411.207412] libceph:   messenger.c:2177 : con_revoke ffff8801955c0030 msg ffff88018f91c600 - was on queue
> 
> Not sure why the first message was revoked... but that shouldn't matter...
>  
>> [ 2411.217092] libceph:   messenger.c:1708 : try_write start ffff8801955c0030 state 8220 nref 1
>> [ 2411.225536] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg           (null) out_kvec_bytes 0
>> [ 2411.235555] libceph:   messenger.c:509  : prepare_write_message ffff8801955c0030 msg ffff88018f91d6c0 seq 1 type 42 len 128+0+524288 128 pgs
>> [ 2411.256874] libceph:   messenger.c:1712 : try_write ffff8801955c0030 out_msg ffff88018f91d6c0 out_kvec_bytes 191
>> [ 2411.267280] libceph:   messenger.c:1790 : try_write done on ffff8801955c0030 ret 0
> 
> This picks up the second resent message.  But and OPENING is set.. so how 
> is it that we were in try_write but we didn't pass through the OPENING 
> test in con_work?

The only way I see this could happen is if this instance
of con_work passed through that test before OPENING was set.
That's what made me think "race".

But, look at the time stamps.  18.85 seconds pass between
con_open and try_write !!

I don't see how that could happen unless the instance of
con_work that called try_write above was stuck in try_read
for most of that 18.85 seconds.  I'll check that out.

Also, AFAICS neither ceph_con_open nor ceph_con_close take
con->mutex while they are twiddling bits in con->state.  But
here's a fun fact I didn't mention earlier since it didn't
help: putting everything in both ceph_con_open and ceph_con_close
except the call to queue_con(con) under con->mutex didn't help at all.

So it must be something else.

> 
>>>> con_work()/try_write() start again on new message.
>>>> Note that con_work() did not notice that the connection was marked to
>>>> be closed/reopened!  This is where things went bad.
>>>> Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.
>> [ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030 OPENING
>> [ 2412.643182] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff8801825a8c00
>> [ 2412.643197] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2076 sk_state = 4
> 
> Ok, now we see it, but this should have executed before we ever got to 
> try_write!
> 
> There are a handful of places that drop and re-take con->mutex... I think 
> it has to be one of those. 

OK, I will look at them.

  They are:
> 
> - prepare_connect_authorizer (before calling ->prepare_authorizer 
> callback)
> - fail_protocol (before calling ->bad_proto callback)
> - process_connect RESETSESSION case (before calling ->reset callback)
> - process_message (before calling ->dispatch callback)
> 
> I think after retaking the lock in each of these cases we need to test for 
> the CLOSED or OPENING bits and, if set, bail out so that con_work can 
> restart.
> 
> Does that make sense?  Can you confirm which case of lock droppage inside 
> con_work is at fault?  (Maybe adding a printk at the top of con_work after 
> taking the lock too would make it more clear what's going on; currently it 
> only prints in all the random test_bit cases).
> 
> I'm thinking something like the below... does that seem reasonable?  I 
> only did the prepare_authorizer case to start, but the others should be 
> fixed up too.  Compile tested only.

OK, I'll fix them all up and let you know what I learn.

> 
> BTW, I looked at the second patch too.  Messages should only be requeued 
> for non-lossy connections, and that's done in ceph_fault by the 
> list_splice call.  The current out_msg is placed on the out_sent list when 
> we first start, so it is already included in that set.  I don't think any 
> special casing is needed there.

OK, thanks; there's lots of details I haven't fully understood yet.

-- Jim


> 
> Thanks for hunting this down!
> sage
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write()
  2011-05-16 17:57     ` [PATCH 2/2] " Sage Weil
  2011-05-16 19:06       ` Jim Schutt
@ 2011-05-17 22:32       ` Jim Schutt
  2011-05-17 23:27         ` Sage Weil
  1 sibling, 1 reply; 17+ messages in thread
From: Jim Schutt @ 2011-05-17 22:32 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,

Sage Weil wrote:
> Hi Jim,

> This picks up the second resent message.  But and OPENING is set.. so how 
> is it that we were in try_write but we didn't pass through the OPENING 
> test in con_work?
> 
>>>> con_work()/try_write() start again on new message.
>>>> Note that con_work() did not notice that the connection was marked to
>>>> be closed/reopened!  This is where things went bad.
>>>> Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.
>> [ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030 OPENING
>> [ 2412.643182] libceph:   messenger.c:292  : con_close_socket on ffff8801955c0030 sock ffff8801825a8c00
>> [ 2412.643197] libceph:   messenger.c:162  : ceph_state_change ffff8801955c0030 state = 2076 sk_state = 4
> 
> Ok, now we see it, but this should have executed before we ever got to 
> try_write!
> 
> There are a handful of places that drop and re-take con->mutex... I think 
> it has to be one of those.  They are:
> 
> - prepare_connect_authorizer (before calling ->prepare_authorizer 
> callback)
> - fail_protocol (before calling ->bad_proto callback)
> - process_connect RESETSESSION case (before calling ->reset callback)
> - process_message (before calling ->dispatch callback)
> 
> I think after retaking the lock in each of these cases we need to test for 
> the CLOSED or OPENING bits and, if set, bail out so that con_work can 
> restart.
> 
> Does that make sense?  Can you confirm which case of lock droppage inside 
> con_work is at fault?  (Maybe adding a printk at the top of con_work after 
> taking the lock too would make it more clear what's going on; currently it 
> only prints in all the random test_bit cases).

So I've got logs now that show that connection state changes
across a call to process_message.  That seems like it proves
the concept.

> 
> I'm thinking something like the below... does that seem reasonable?  I 
> only did the prepare_authorizer case to start, but the others should be 
> fixed up too.  Compile tested only.
> 

Is there any reason not to fix all the cases at once, in try_read?
It's less code, but far away from where the mutex dropping occurs,
so we add a comment:

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index a9bd424..11d6a61 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1816,10 +1820,19 @@ static int try_read(struct ceph_connection *con)
  	dout("try_read start on %p state %lu\n", con, con->state);

  more:
  	dout("try_read %p state %lu tag %d in_base_pos %d\n", con,
  	     con->state, (int)con->in_tag, con->in_base_pos);
+
+	/* process_connect, process_message need to drop con->mutex,
+	 * so connection state might have changed on us.  Check it,
+	 * leave if necessary so new state can be processed.
+	 */
+	if (test_bit(CLOSED, &con->state) ||
+	    test_bit(OPENING, &con->state))
+		goto out;
+
  	if (test_bit(CONNECTING, &con->state)) {
  		if (!test_bit(NEGOTIATING, &con->state)) {
  			dout("try_read %p connecting\n", con);
  			ret = read_partial_banner(con);
  			if (ret <= 0)


What am I missing?

-- Jim



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write()
  2011-05-17 22:32       ` Jim Schutt
@ 2011-05-17 23:27         ` Sage Weil
  2011-05-17 23:38           ` Sage Weil
  0 siblings, 1 reply; 17+ messages in thread
From: Sage Weil @ 2011-05-17 23:27 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Tue, 17 May 2011, Jim Schutt wrote:
> Hi Sage,
> 
> Sage Weil wrote:
> > Hi Jim,
> 
> > This picks up the second resent message.  But and OPENING is set.. so how is
> > it that we were in try_write but we didn't pass through the OPENING test in
> > con_work?
> > 
> > > > > con_work()/try_write() start again on new message.
> > > > > Note that con_work() did not notice that the connection was marked to
> > > > > be closed/reopened!  This is where things went bad.
> > > > > Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.
> > > [ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030
> > > OPENING
> > > [ 2412.643182] libceph:   messenger.c:292  : con_close_socket on
> > > ffff8801955c0030 sock ffff8801825a8c00
> > > [ 2412.643197] libceph:   messenger.c:162  : ceph_state_change
> > > ffff8801955c0030 state = 2076 sk_state = 4
> > 
> > Ok, now we see it, but this should have executed before we ever got to
> > try_write!
> > 
> > There are a handful of places that drop and re-take con->mutex... I think it
> > has to be one of those.  They are:
> > 
> > - prepare_connect_authorizer (before calling ->prepare_authorizer callback)
> > - fail_protocol (before calling ->bad_proto callback)
> > - process_connect RESETSESSION case (before calling ->reset callback)
> > - process_message (before calling ->dispatch callback)
> > 
> > I think after retaking the lock in each of these cases we need to test for
> > the CLOSED or OPENING bits and, if set, bail out so that con_work can
> > restart.
> > 
> > Does that make sense?  Can you confirm which case of lock droppage inside
> > con_work is at fault?  (Maybe adding a printk at the top of con_work after
> > taking the lock too would make it more clear what's going on; currently it
> > only prints in all the random test_bit cases).
> 
> So I've got logs now that show that connection state changes
> across a call to process_message.  That seems like it proves
> the concept.
> 
> > 
> > I'm thinking something like the below... does that seem reasonable?  I only
> > did the prepare_authorizer case to start, but the others should be fixed up
> > too.  Compile tested only.
> > 
> 
> Is there any reason not to fix all the cases at once, in try_read?
> It's less code, but far away from where the mutex dropping occurs,
> so we add a comment:

That covers the process_message and process_connect cases, provided you 
also set ret properly, and do the con_close_socket work that the con_work 
checks would have done.  e.g.,

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index e15a82c..7ef9eb0 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1810,6 +1810,21 @@ static int try_read(struct ceph_connection *con)
 more:
        dout("try_read tag %d in_base_pos %d\n", (int)con->in_tag,
             con->in_base_pos);
+
+       /*
+        * process_connect and process_message drop and re-take
+        * con->mutex.  make sure we handle a racing close or reopen.
+        */
+       if (test_bit(CLOSED, &con->state)) {
+               ret = -EAGAIN;
+               goto out;
+       }
+       if (test_bit(OPENING, &con->state)) {
+               con_close_socket(con);
+               ret = 0;  /* so we continue to try_write */
+               goto out;
+       }
+
        if (test_bit(CONNECTING, &con->state)) {
                if (!test_bit(NEGOTIATING, &con->state)) {
                        dout("try_read connecting\n");

It's a little bit unweildly because we repeat the checks at the top of 
con_work, though.  It is probably cleaner to always fall back to that case 
by doing CLOSED|OPENING bit checks and bailing with -EAGAIN any time we 
retake con->mutex, and then checking try_read/write return values like the 
patch from yesterday does.  That way we only react to the bits in one 
place.

Have you verified that this fixes the problem for you?  It makes sense the 
race is over ->dispatch, since that is a somewhat long-running call.

sage



> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index a9bd424..11d6a61 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1816,10 +1820,19 @@ static int try_read(struct ceph_connection *con)
>  	dout("try_read start on %p state %lu\n", con, con->state);
> 
>  more:
>  	dout("try_read %p state %lu tag %d in_base_pos %d\n", con,
>  	     con->state, (int)con->in_tag, con->in_base_pos);
> +
> +	/* process_connect, process_message need to drop con->mutex,
> +	 * so connection state might have changed on us.  Check it,
> +	 * leave if necessary so new state can be processed.
> +	 */
> +	if (test_bit(CLOSED, &con->state) ||
> +	    test_bit(OPENING, &con->state))
> +		goto out;
> +
>  	if (test_bit(CONNECTING, &con->state)) {
>  		if (!test_bit(NEGOTIATING, &con->state)) {
>  			dout("try_read %p connecting\n", con);
>  			ret = read_partial_banner(con);
>  			if (ret <= 0)
> 
> 
> What am I missing?
> 
> -- Jim
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write()
  2011-05-17 23:27         ` Sage Weil
@ 2011-05-17 23:38           ` Sage Weil
  2011-05-18 14:34             ` Jim Schutt
  2011-05-18 20:27             ` Jim Schutt
  0 siblings, 2 replies; 17+ messages in thread
From: Sage Weil @ 2011-05-17 23:38 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Tue, 17 May 2011, Sage Weil wrote:
> On Tue, 17 May 2011, Jim Schutt wrote:
> > Hi Sage,
> > 
> > Sage Weil wrote:
> > > Hi Jim,
> > 
> > > This picks up the second resent message.  But and OPENING is set.. so how is
> > > it that we were in try_write but we didn't pass through the OPENING test in
> > > con_work?
> > > 
> > > > > > con_work()/try_write() start again on new message.
> > > > > > Note that con_work() did not notice that the connection was marked to
> > > > > > be closed/reopened!  This is where things went bad.
> > > > > > Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.
> > > > [ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030
> > > > OPENING
> > > > [ 2412.643182] libceph:   messenger.c:292  : con_close_socket on
> > > > ffff8801955c0030 sock ffff8801825a8c00
> > > > [ 2412.643197] libceph:   messenger.c:162  : ceph_state_change
> > > > ffff8801955c0030 state = 2076 sk_state = 4
> > > 
> > > Ok, now we see it, but this should have executed before we ever got to
> > > try_write!
> > > 
> > > There are a handful of places that drop and re-take con->mutex... I think it
> > > has to be one of those.  They are:
> > > 
> > > - prepare_connect_authorizer (before calling ->prepare_authorizer callback)
> > > - fail_protocol (before calling ->bad_proto callback)
> > > - process_connect RESETSESSION case (before calling ->reset callback)
> > > - process_message (before calling ->dispatch callback)
> > > 
> > > I think after retaking the lock in each of these cases we need to test for
> > > the CLOSED or OPENING bits and, if set, bail out so that con_work can
> > > restart.
> > > 
> > > Does that make sense?  Can you confirm which case of lock droppage inside
> > > con_work is at fault?  (Maybe adding a printk at the top of con_work after
> > > taking the lock too would make it more clear what's going on; currently it
> > > only prints in all the random test_bit cases).
> > 
> > So I've got logs now that show that connection state changes
> > across a call to process_message.  That seems like it proves
> > the concept.
> > 
> > > 
> > > I'm thinking something like the below... does that seem reasonable?  I only
> > > did the prepare_authorizer case to start, but the others should be fixed up
> > > too.  Compile tested only.
> > > 
> > 
> > Is there any reason not to fix all the cases at once, in try_read?
> > It's less code, but far away from where the mutex dropping occurs,
> > so we add a comment:
> 
> That covers the process_message and process_connect cases, provided you 
> also set ret properly, and do the con_close_socket work that the con_work 
> checks would have done.  e.g.,

I pushed a patch to the msgr_race branch that catches all four cases (I 
think).  Does the fix make sense given what you saw?

Thanks!
sage



> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index e15a82c..7ef9eb0 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1810,6 +1810,21 @@ static int try_read(struct ceph_connection *con)
>  more:
>         dout("try_read tag %d in_base_pos %d\n", (int)con->in_tag,
>              con->in_base_pos);
> +
> +       /*
> +        * process_connect and process_message drop and re-take
> +        * con->mutex.  make sure we handle a racing close or reopen.
> +        */
> +       if (test_bit(CLOSED, &con->state)) {
> +               ret = -EAGAIN;
> +               goto out;
> +       }
> +       if (test_bit(OPENING, &con->state)) {
> +               con_close_socket(con);
> +               ret = 0;  /* so we continue to try_write */
> +               goto out;
> +       }
> +
>         if (test_bit(CONNECTING, &con->state)) {
>                 if (!test_bit(NEGOTIATING, &con->state)) {
>                         dout("try_read connecting\n");
> 
> It's a little bit unweildly because we repeat the checks at the top of 
> con_work, though.  It is probably cleaner to always fall back to that case 
> by doing CLOSED|OPENING bit checks and bailing with -EAGAIN any time we 
> retake con->mutex, and then checking try_read/write return values like the 
> patch from yesterday does.  That way we only react to the bits in one 
> place.
> 
> Have you verified that this fixes the problem for you?  It makes sense the 
> race is over ->dispatch, since that is a somewhat long-running call.
> 
> sage
> 
> 
> 
> > 
> > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> > index a9bd424..11d6a61 100644
> > --- a/net/ceph/messenger.c
> > +++ b/net/ceph/messenger.c
> > @@ -1816,10 +1820,19 @@ static int try_read(struct ceph_connection *con)
> >  	dout("try_read start on %p state %lu\n", con, con->state);
> > 
> >  more:
> >  	dout("try_read %p state %lu tag %d in_base_pos %d\n", con,
> >  	     con->state, (int)con->in_tag, con->in_base_pos);
> > +
> > +	/* process_connect, process_message need to drop con->mutex,
> > +	 * so connection state might have changed on us.  Check it,
> > +	 * leave if necessary so new state can be processed.
> > +	 */
> > +	if (test_bit(CLOSED, &con->state) ||
> > +	    test_bit(OPENING, &con->state))
> > +		goto out;
> > +
> >  	if (test_bit(CONNECTING, &con->state)) {
> >  		if (!test_bit(NEGOTIATING, &con->state)) {
> >  			dout("try_read %p connecting\n", con);
> >  			ret = read_partial_banner(con);
> >  			if (ret <= 0)
> > 
> > 
> > What am I missing?
> > 
> > -- Jim
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write()
  2011-05-17 23:38           ` Sage Weil
@ 2011-05-18 14:34             ` Jim Schutt
  2011-05-18 20:27             ` Jim Schutt
  1 sibling, 0 replies; 17+ messages in thread
From: Jim Schutt @ 2011-05-18 14:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Sage Weil wrote:
> On Tue, 17 May 2011, Sage Weil wrote:
>> On Tue, 17 May 2011, Jim Schutt wrote:
>>> Hi Sage,
>>>
>>> Sage Weil wrote:
>>>> Hi Jim,
>>>> This picks up the second resent message.  But and OPENING is set.. so how is
>>>> it that we were in try_write but we didn't pass through the OPENING test in
>>>> con_work?
>>>>
>>>>>>> con_work()/try_write() start again on new message.
>>>>>>> Note that con_work() did not notice that the connection was marked to
>>>>>>> be closed/reopened!  This is where things went bad.
>>>>>>> Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.
>>>>> [ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030
>>>>> OPENING
>>>>> [ 2412.643182] libceph:   messenger.c:292  : con_close_socket on
>>>>> ffff8801955c0030 sock ffff8801825a8c00
>>>>> [ 2412.643197] libceph:   messenger.c:162  : ceph_state_change
>>>>> ffff8801955c0030 state = 2076 sk_state = 4
>>>> Ok, now we see it, but this should have executed before we ever got to
>>>> try_write!
>>>>
>>>> There are a handful of places that drop and re-take con->mutex... I think it
>>>> has to be one of those.  They are:
>>>>
>>>> - prepare_connect_authorizer (before calling ->prepare_authorizer callback)
>>>> - fail_protocol (before calling ->bad_proto callback)
>>>> - process_connect RESETSESSION case (before calling ->reset callback)
>>>> - process_message (before calling ->dispatch callback)
>>>>
>>>> I think after retaking the lock in each of these cases we need to test for
>>>> the CLOSED or OPENING bits and, if set, bail out so that con_work can
>>>> restart.
>>>>
>>>> Does that make sense?  Can you confirm which case of lock droppage inside
>>>> con_work is at fault?  (Maybe adding a printk at the top of con_work after
>>>> taking the lock too would make it more clear what's going on; currently it
>>>> only prints in all the random test_bit cases).
>>> So I've got logs now that show that connection state changes
>>> across a call to process_message.  That seems like it proves
>>> the concept.
>>>
>>>> I'm thinking something like the below... does that seem reasonable?  I only
>>>> did the prepare_authorizer case to start, but the others should be fixed up
>>>> too.  Compile tested only.
>>>>
>>> Is there any reason not to fix all the cases at once, in try_read?
>>> It's less code, but far away from where the mutex dropping occurs,
>>> so we add a comment:
>> That covers the process_message and process_connect cases, provided you 
>> also set ret properly, and do the con_close_socket work that the con_work 
>> checks would have done.  e.g.,
> 
> I pushed a patch to the msgr_race branch that catches all four cases (I 
> think).  Does the fix make sense given what you saw?

Yes, I think so.

In every case I logged I saw that prepare_write_message was
getting called at the wrong time - I finally am understanding
that happened because reset_connection was able to run when
con->mutex was dropped, and it thus resets con->out_msg at
the wrong time.

I'll test your fix out today and let you know if I see
any more of this.

-- Jim

> 
> Thanks!
> sage
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write()
  2011-05-17 23:38           ` Sage Weil
  2011-05-18 14:34             ` Jim Schutt
@ 2011-05-18 20:27             ` Jim Schutt
  2011-05-18 23:36               ` Sage Weil
  1 sibling, 1 reply; 17+ messages in thread
From: Jim Schutt @ 2011-05-18 20:27 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Sage Weil wrote:

> 
> I pushed a patch to the msgr_race branch that catches all four cases (I 
> think).  Does the fix make sense given what you saw?

Sorry, I haven't completed much testing; it took me a while
to figure out the fix needs this:

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 9c0a9bd..b140dd3 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2013,6 +2013,7 @@ done:
  	mutex_unlock(&con->mutex);
  done_unlocked:
  	con->ops->put(con);
+	return;

  fault:
  	mutex_unlock(&con->mutex);


Still testing....

Thanks -- Jim

> 
> Thanks!
> sage
> 


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write()
  2011-05-18 20:27             ` Jim Schutt
@ 2011-05-18 23:36               ` Sage Weil
  2011-05-19 17:31                 ` Jim Schutt
  0 siblings, 1 reply; 17+ messages in thread
From: Sage Weil @ 2011-05-18 23:36 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Wed, 18 May 2011, Jim Schutt wrote:
> Sage Weil wrote:
> 
> > 
> > I pushed a patch to the msgr_race branch that catches all four cases (I
> > think).  Does the fix make sense given what you saw?
> 
> Sorry, I haven't completed much testing; it took me a while
> to figure out the fix needs this:
> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 9c0a9bd..b140dd3 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -2013,6 +2013,7 @@ done:
>  	mutex_unlock(&con->mutex);
>  done_unlocked:
>  	con->ops->put(con);
> +	return;
> 
>  fault:
>  	mutex_unlock(&con->mutex);
> 
> 
> Still testing....

Good catch.  Let us know how it goes!

sge

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write()
  2011-05-18 23:36               ` Sage Weil
@ 2011-05-19 17:31                 ` Jim Schutt
  0 siblings, 0 replies; 17+ messages in thread
From: Jim Schutt @ 2011-05-19 17:31 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,

Sage Weil wrote:
> On Wed, 18 May 2011, Jim Schutt wrote:
>> Sage Weil wrote:
>>
>>> I pushed a patch to the msgr_race branch that catches all four cases (I
>>> think).  Does the fix make sense given what you saw?
>> Sorry, I haven't completed much testing; it took me a while
>> to figure out the fix needs this:
>>
>> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
>> index 9c0a9bd..b140dd3 100644
>> --- a/net/ceph/messenger.c
>> +++ b/net/ceph/messenger.c
>> @@ -2013,6 +2013,7 @@ done:
>>  	mutex_unlock(&con->mutex);
>>  done_unlocked:
>>  	con->ops->put(con);
>> +	return;
>>
>>  fault:
>>  	mutex_unlock(&con->mutex);
>>
>>
>> Still testing....
> 
> Good catch.  Let us know how it goes!

I've been testing commit a30413af363 from your msgr_race
branch, and I think it is ready.  In my testing of it I've
found none of the signs that I recognize as indicators of
the race we were trying to fix.

However, with that issue fixed I'm now running into a
different type of stall under a heavy write load.

If the load is heavy enough to trigger that issue where
heartbeat processing is delayed, and an osd is wrongly
marked down, then I see a flurry of messages with bad
tags.

So far I can't generate a high enough load with much client
debugging enabled, but what I have done is turn on client
debugging after I see the "wrongly marked me down" from
ceph -w, and the bad message tags in the osd logs.

When I do that I see some clients with a few messages
queued up to send.  Sending of the first message
starts as expected, but before it completes it stalls,
as though the socket buffer fills up and isn't being
emptied.  Then the message times out, the socket is
closed/reopened, and the partially-sent message is
resent from the beginning.  And then sending stalls again.

I'll let you know when I've got some logs that suggest
what the problem might be...

FWIW, on the server side I'm running the stable branch
(commit b6cccc741a3).

-- Jim

> 
> sge
> 
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-05-19 17:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-04 22:27 Client reconnect failing: reader gets bad tag Jim Schutt
2011-05-05 19:19 ` Sage Weil
2011-05-05 20:23   ` Jim Schutt
2011-05-06 21:56   ` Jim Schutt
2011-05-12 21:32   ` [PATCH 1/2] libceph: add debugging to understand how bad msg tag is getting sent Jim Schutt
2011-05-12 21:32   ` [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write() Jim Schutt
2011-05-16 16:57     ` [PATCH v2 0/1] " Jim Schutt
2011-05-16 16:57       ` [PATCH v2 1/1] " Jim Schutt
2011-05-16 17:57     ` [PATCH 2/2] " Sage Weil
2011-05-16 19:06       ` Jim Schutt
2011-05-17 22:32       ` Jim Schutt
2011-05-17 23:27         ` Sage Weil
2011-05-17 23:38           ` Sage Weil
2011-05-18 14:34             ` Jim Schutt
2011-05-18 20:27             ` Jim Schutt
2011-05-18 23:36               ` Sage Weil
2011-05-19 17:31                 ` Jim Schutt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.