All of lore.kernel.org
 help / color / mirror / Atom feed
* Assert in Pipe.cc in Hammer 0.94.7
@ 2017-01-16 13:23 Padmanabh Ratnakar
  2017-01-16 23:27 ` Gregory Farnum
  0 siblings, 1 reply; 6+ messages in thread
From: Padmanabh Ratnakar @ 2017-01-16 13:23 UTC (permalink / raw)
  To: ceph-devel

Hi,
           We are seeing following assert in Pipe.cc when we hit some
network glitch in our setup.

msg/simple/Pipe.cc: In function 'int Pipe::connect()' thread
7f0124800700 time 2016-12-28 20:43:00.057696
msg/simple/Pipe.cc: 1156: FAILED assert(m)
 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x8b) [0xbb1fab]
 2: (Pipe::connect()+0x380a) [0xc8985a]
 3: (Pipe::writer()+0x4ca) [0xc8acca]
 4: (Pipe::Writer::entry()+0xd) [0xc95b1d]
 5: (()+0x8182) [0x7f01e1b8f182]
 6: (clone()+0x6d) [0x7f01e00fa47d]

(gdb) bt
#0  0x00007f01e1b9720b in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x0000000000ab70dd in reraise_fatal (signum=6) at
global/signal_handler.cc:59
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:109
#3  <signal handler called>
#4  0x00007f01e0036cc9 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#5  0x00007f01e003a0d8 in __GI_abort () at abort.c:89
#6  0x00007f01e0941535 in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f01e093f6d6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f01e093f703 in std::terminate() () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f01e093f922 in __cxa_throw () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x0000000000bb2198 in ceph::__ceph_assert_fail
(assertion=assertion@entry=0xd8bc4f "m", file=file@entry=0xd928b8
"msg/simple/Pipe.cc",
    line=line@entry=1156, func=func@entry=0xd94210
<Pipe::connect()::__PRETTY_FUNCTION__> "int Pipe::connect()") at
common/assert.cc:77
#11 0x0000000000c8985a in Pipe::connect (this=this@entry=0x32030000)
at msg/simple/Pipe.cc:1156
#12 0x0000000000c8acca in Pipe::writer (this=0x32030000) at
msg/simple/Pipe.cc:1703
#13 0x0000000000c95b1d in Pipe::Writer::entry (this=<optimized out>)
at msg/simple/Pipe.h:62
#14 0x00007f01e1b8f182 in start_thread (arg=0x7f0124800700) at
pthread_create.c:312
#15 0x00007f01e00fa47d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Is this a known issue? I searched for it and could not find anyone hitting this.

Looking at the 0.94.7 code, looks like pipe_lock is released in line
886 in the beginning of connect() routine.
It is again taken later. But there is update to state member variable
without checking current state in code below.
If pipe is moved to STATE_CLOSED in the interval when lock was released,
there is a chance that it can get overwritten when
CEPH_MSGR_TAG_WAIT(STATE_WAIT) comes as reply or
directly to STATE_OPEN in line 1172.
I feel this may cause assert seen above but only if many other things
also happen.

I am new to ceph code and I may be missing something.
Please see if this can cause any issues.

We did not have much logs enabled when we hit this issue.
We are trying to reproduce this issue in our tests meanwhile with logs enabled.

Let me know the information you need which will help to debug this.

Thanks,
Padmanabh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Assert in Pipe.cc in Hammer 0.94.7
  2017-01-16 13:23 Assert in Pipe.cc in Hammer 0.94.7 Padmanabh Ratnakar
@ 2017-01-16 23:27 ` Gregory Farnum
  2017-01-18 23:10   ` Padmanabh Ratnakar
  0 siblings, 1 reply; 6+ messages in thread
From: Gregory Farnum @ 2017-01-16 23:27 UTC (permalink / raw)
  To: Padmanabh Ratnakar; +Cc: ceph-devel

On Mon, Jan 16, 2017 at 5:23 AM, Padmanabh Ratnakar
<padmanabh.ratnakar@snapdeal.com> wrote:
> Hi,
>            We are seeing following assert in Pipe.cc when we hit some
> network glitch in our setup.
>
> msg/simple/Pipe.cc: In function 'int Pipe::connect()' thread
> 7f0124800700 time 2016-12-28 20:43:00.057696
> msg/simple/Pipe.cc: 1156: FAILED assert(m)
>  ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0xbb1fab]
>  2: (Pipe::connect()+0x380a) [0xc8985a]
>  3: (Pipe::writer()+0x4ca) [0xc8acca]
>  4: (Pipe::Writer::entry()+0xd) [0xc95b1d]
>  5: (()+0x8182) [0x7f01e1b8f182]
>  6: (clone()+0x6d) [0x7f01e00fa47d]
>
> (gdb) bt
> #0  0x00007f01e1b9720b in raise (sig=6) at
> ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
> #1  0x0000000000ab70dd in reraise_fatal (signum=6) at
> global/signal_handler.cc:59
> #2  handle_fatal_signal (signum=6) at global/signal_handler.cc:109
> #3  <signal handler called>
> #4  0x00007f01e0036cc9 in __GI_raise (sig=sig@entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #5  0x00007f01e003a0d8 in __GI_abort () at abort.c:89
> #6  0x00007f01e0941535 in __gnu_cxx::__verbose_terminate_handler() ()
> from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #7  0x00007f01e093f6d6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #8  0x00007f01e093f703 in std::terminate() () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #9  0x00007f01e093f922 in __cxa_throw () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #10 0x0000000000bb2198 in ceph::__ceph_assert_fail
> (assertion=assertion@entry=0xd8bc4f "m", file=file@entry=0xd928b8
> "msg/simple/Pipe.cc",
>     line=line@entry=1156, func=func@entry=0xd94210
> <Pipe::connect()::__PRETTY_FUNCTION__> "int Pipe::connect()") at
> common/assert.cc:77
> #11 0x0000000000c8985a in Pipe::connect (this=this@entry=0x32030000)
> at msg/simple/Pipe.cc:1156
> #12 0x0000000000c8acca in Pipe::writer (this=0x32030000) at
> msg/simple/Pipe.cc:1703
> #13 0x0000000000c95b1d in Pipe::Writer::entry (this=<optimized out>)
> at msg/simple/Pipe.h:62
> #14 0x00007f01e1b8f182 in start_thread (arg=0x7f0124800700) at
> pthread_create.c:312
> #15 0x00007f01e00fa47d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
> Is this a known issue? I searched for it and could not find anyone hitting this.

I don't think so.

>
> Looking at the 0.94.7 code, looks like pipe_lock is released in line
> 886 in the beginning of connect() routine.
> It is again taken later. But there is update to state member variable
> without checking current state in code below.

Where exactly?

> If pipe is moved to STATE_CLOSED in the interval when lock was released,
> there is a chance that it can get overwritten when
> CEPH_MSGR_TAG_WAIT(STATE_WAIT) comes as reply or
> directly to STATE_OPEN in line 1172.

I'm not following your referents here. The out_seq can get
overwritten? Something else?

> I feel this may cause assert seen above but only if many other things
> also happen.

We had some vaguely similar issues in the time after CEPH_MSGR_TAG_SEQ
was introduced, but I think it's been a while. You might have spotted
another rare one.
-Greg

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Assert in Pipe.cc in Hammer 0.94.7
  2017-01-16 23:27 ` Gregory Farnum
@ 2017-01-18 23:10   ` Padmanabh Ratnakar
  2017-01-24 14:10     ` Padmanabh Ratnakar
  2017-01-24 18:51     ` Gregory Farnum
  0 siblings, 2 replies; 6+ messages in thread
From: Padmanabh Ratnakar @ 2017-01-18 23:10 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

>>
>> Is this a known issue? I searched for it and could not find anyone hitting this.
>
> I don't think so.
>
>>
>> Looking at the 0.94.7 code, looks like pipe_lock is released in line
>> 886 in the beginning of connect() routine.
>> It is again taken later. But there is update to state member variable
>> without checking current state in code below.
>
> Where exactly?

Sorry for late response.

In function int Pipe::connect(), pipe_lock is released first.

int Pipe::connect()
{
  bool got_bad_auth = false;

  ldout(msgr->cct,10) << "connect " << connect_seq << dendl;
  assert(pipe_lock.is_locked());

  __u32 cseq = connect_seq;
  __u32 gseq = msgr->get_global_seq();

  // stop reader thread
  join_reader();

  pipe_lock.Unlock();

.....

some  tcp_read and do_sendmsg after this for protocol processing
In between unlock and lock, pipe may be getting replaced due to network errors.
and moved to STATE_CLOSED in accept routine while this pipe is waiting for lock
below.
......

pipe_lock.Lock();

....
After this there is code like below which override Pipe state.
There is a danger of overwriting STATE_CLOSED state?

    if (reply.tag == CEPH_MSGR_TAG_WAIT) {
      ldout(msgr->cct,3) << "connect got WAIT (connection race)" << dendl;
      state = STATE_WAIT;
      goto stop_locked;
    }

Please see if there is any issue in above code.

>
>> If pipe is moved to STATE_CLOSED in the interval when lock was released,
>> there is a chance that it can get overwritten when
>> CEPH_MSGR_TAG_WAIT(STATE_WAIT) comes as reply or
>> directly to STATE_OPEN in line 1172.
>
> I'm not following your referents here. The out_seq can get
> overwritten? Something else?

We got a different acked out_seq from other side than the messages
present in this pipe's queue.
This leads me to think there is some race not handled correctly.

>
>> I feel this may cause assert seen above but only if many other things
>> also happen.
>
> We had some vaguely similar issues in the time after CEPH_MSGR_TAG_SEQ
> was introduced, but I think it's been a while. You might have spotted
> another rare one.
> -Greg

-Padmanabh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Assert in Pipe.cc in Hammer 0.94.7
  2017-01-18 23:10   ` Padmanabh Ratnakar
@ 2017-01-24 14:10     ` Padmanabh Ratnakar
  2017-01-24 18:51     ` Gregory Farnum
  1 sibling, 0 replies; 6+ messages in thread
From: Padmanabh Ratnakar @ 2017-01-24 14:10 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

> Sorry for late response.
>
> In function int Pipe::connect(), pipe_lock is released first.
>
> int Pipe::connect()
> {
>   bool got_bad_auth = false;
>
>   ldout(msgr->cct,10) << "connect " << connect_seq << dendl;
>   assert(pipe_lock.is_locked());
>
>   __u32 cseq = connect_seq;
>   __u32 gseq = msgr->get_global_seq();
>
>   // stop reader thread
>   join_reader();
>
>   pipe_lock.Unlock();
>
> .....
>
> some  tcp_read and do_sendmsg after this for protocol processing
> In between unlock and lock, pipe may be getting replaced due to network errors.
> and moved to STATE_CLOSED in accept routine while this pipe is waiting for lock
> below.
> ......
>
> pipe_lock.Lock();
>
> ....
> After this there is code like below which override Pipe state.
> There is a danger of overwriting STATE_CLOSED state?
>
>     if (reply.tag == CEPH_MSGR_TAG_WAIT) {
>       ldout(msgr->cct,3) << "connect got WAIT (connection race)" << dendl;
>       state = STATE_WAIT;
>       goto stop_locked;
>     }
>
> Please see if there is any issue in above code.

Hi Greg, kindly see if there is a bug here.
We are not able to reproduce the issue in our tests.

Thanks,
Padmanabh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Assert in Pipe.cc in Hammer 0.94.7
  2017-01-18 23:10   ` Padmanabh Ratnakar
  2017-01-24 14:10     ` Padmanabh Ratnakar
@ 2017-01-24 18:51     ` Gregory Farnum
  2017-01-25 16:12       ` Padmanabh Ratnakar
  1 sibling, 1 reply; 6+ messages in thread
From: Gregory Farnum @ 2017-01-24 18:51 UTC (permalink / raw)
  To: Padmanabh Ratnakar; +Cc: ceph-devel

On Wed, Jan 18, 2017 at 3:10 PM, Padmanabh Ratnakar
<padmanabh.ratnakar@snapdeal.com> wrote:
>>>
>>> Is this a known issue? I searched for it and could not find anyone hitting this.
>>
>> I don't think so.
>>
>>>
>>> Looking at the 0.94.7 code, looks like pipe_lock is released in line
>>> 886 in the beginning of connect() routine.
>>> It is again taken later. But there is update to state member variable
>>> without checking current state in code below.
>>
>> Where exactly?
>
> Sorry for late response.
>
> In function int Pipe::connect(), pipe_lock is released first.
>
> int Pipe::connect()
> {
>   bool got_bad_auth = false;
>
>   ldout(msgr->cct,10) << "connect " << connect_seq << dendl;
>   assert(pipe_lock.is_locked());
>
>   __u32 cseq = connect_seq;
>   __u32 gseq = msgr->get_global_seq();
>
>   // stop reader thread
>   join_reader();
>
>   pipe_lock.Unlock();
>
> .....
>
> some  tcp_read and do_sendmsg after this for protocol processing
> In between unlock and lock, pipe may be getting replaced due to network errors.
> and moved to STATE_CLOSED in accept routine while this pipe is waiting for lock
> below.
> ......
>
> pipe_lock.Lock();
>
> ....
> After this there is code like below which override Pipe state.
> There is a danger of overwriting STATE_CLOSED state?
>
>     if (reply.tag == CEPH_MSGR_TAG_WAIT) {
>       ldout(msgr->cct,3) << "connect got WAIT (connection race)" << dendl;
>       state = STATE_WAIT;
>       goto stop_locked;
>     }

You're skipping over the part where, immediately after taking the
pipe_lock back, we check for this case (and similar ones):

>    pipe_lock.Lock();
>    if (state != STATE_CONNECTING) {
>      ldout(msgr->cct,0) << "connect got RESETSESSION but no longer connecting" << dendl;
>      goto stop_locked;
>    }

https://github.com/ceph/ceph/blob/v0.94.1.7/src/msg/simple/Pipe.cc#L1078

:)

Obviously there's an issue somewhere or you wouldn't have hit an
assert, but I'm not sure where and I'm pretty sure if you ever see it
again, it's something fixed in later source revisions (...which we may
not have backported once hammer reached EoL).
-Greg


>
> Please see if there is any issue in above code.
>
>>
>>> If pipe is moved to STATE_CLOSED in the interval when lock was released,
>>> there is a chance that it can get overwritten when
>>> CEPH_MSGR_TAG_WAIT(STATE_WAIT) comes as reply or
>>> directly to STATE_OPEN in line 1172.
>>
>> I'm not following your referents here. The out_seq can get
>> overwritten? Something else?
>
> We got a different acked out_seq from other side than the messages
> present in this pipe's queue.
> This leads me to think there is some race not handled correctly.
>
>>
>>> I feel this may cause assert seen above but only if many other things
>>> also happen.
>>
>> We had some vaguely similar issues in the time after CEPH_MSGR_TAG_SEQ
>> was introduced, but I think it's been a while. You might have spotted
>> another rare one.
>> -Greg
>
> -Padmanabh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Assert in Pipe.cc in Hammer 0.94.7
  2017-01-24 18:51     ` Gregory Farnum
@ 2017-01-25 16:12       ` Padmanabh Ratnakar
  0 siblings, 0 replies; 6+ messages in thread
From: Padmanabh Ratnakar @ 2017-01-25 16:12 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

>>     if (reply.tag == CEPH_MSGR_TAG_WAIT) {
>>       ldout(msgr->cct,3) << "connect got WAIT (connection race)" << dendl;
>>       state = STATE_WAIT;
>>       goto stop_locked;
>>     }
>
> You're skipping over the part where, immediately after taking the
> pipe_lock back, we check for this case (and similar ones):
>
>>    pipe_lock.Lock();
>>    if (state != STATE_CONNECTING) {
>>      ldout(msgr->cct,0) << "connect got RESETSESSION but no longer connecting" << dendl;
>>      goto stop_locked;
>>    }
>
> https://github.com/ceph/ceph/blob/v0.94.1.7/src/msg/simple/Pipe.cc#L1078
>
> :)
>
> Obviously there's an issue somewhere or you wouldn't have hit an
> assert, but I'm not sure where and I'm pretty sure if you ever see it
> again, it's something fixed in later source revisions (...which we may
> not have backported once hammer reached EoL).
> -Greg

Sorry my bad. I missed this somehow. :)
Thanks for clearing my doubt.

Thanks,
Padmanabh

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-01-25 16:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-16 13:23 Assert in Pipe.cc in Hammer 0.94.7 Padmanabh Ratnakar
2017-01-16 23:27 ` Gregory Farnum
2017-01-18 23:10   ` Padmanabh Ratnakar
2017-01-24 14:10     ` Padmanabh Ratnakar
2017-01-24 18:51     ` Gregory Farnum
2017-01-25 16:12       ` Padmanabh Ratnakar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.