ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mds crash on snaptest-2
@ 2010-07-19 14:57 Thomas Mueller
  2010-07-19 21:16 ` Gregory Farnum
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Mueller @ 2010-07-19 14:57 UTC (permalink / raw)
  To: ceph-devel

hi

the ceph.git/unstable cmds gets killed by my snaptest-2 (http://
github.com/vinzent/ceph-testsuite/blob/master/tests/snaptest-2) with ceph-
client-standalone/unstable-backport on kernel 2.6.34.1. I can reproduce 
the behaviour.

it somewhere happens on the "Delete the snapshots..." phase.


kernel log:
[ 2024.315441] ceph: client4102 fsid ab2d5e45-9f53-7764-c958-c099f5be6e33
[ 2024.316111] ceph: mon0 127.0.0.1:6789 session established
[ 3753.964099] ceph:  tid 11109 timed out on osd0, will reset osd
[ 4054.056374] ceph:  tid 15029 timed out on osd0, will reset osd
[ 4098.646013] ceph: mds0 127.0.0.1:6802 socket closed
[ 4099.804937] ceph: mds0 127.0.0.1:6802 connection failed
[ 4100.804629] ceph: mds0 127.0.0.1:6802 connection failed
[ 4101.804638] ceph: mds0 127.0.0.1:6802 connection failed
[ 4103.804636] ceph: mds0 127.0.0.1:6802 connection failed
[ 4107.804381] ceph: mds0 127.0.0.1:6802 connection failed
[ 4115.804644] ceph: mds0 127.0.0.1:6802 connection failed
[ 4131.804387] ceph: mds0 127.0.0.1:6802 connection failed
[ 4144.804343] ceph: mds0 caps stale
[ 4159.806936] ceph: mds0 caps stale
[ 4163.804149] ceph: mds0 127.0.0.1:6802 connection failed

there is no cmds segfault message, but cmds process has gone.

- Thomas


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mds crash on snaptest-2
  2010-07-19 14:57 mds crash on snaptest-2 Thomas Mueller
@ 2010-07-19 21:16 ` Gregory Farnum
  2010-07-20 14:50   ` Thomas Mueller
  2010-07-27  9:22   ` Thomas Mueller
  0 siblings, 2 replies; 8+ messages in thread
From: Gregory Farnum @ 2010-07-19 21:16 UTC (permalink / raw)
  To: Thomas Mueller; +Cc: ceph-devel

Can you turn on debugging and verify for me that it's crashing on
"assert(p->second.first <= snapid && snapid <= p->first);
" CInode::encode_inodestat:1617?
I've hit this assert trying to reproduce your issue using cfuse and I
think this is it, but I'm hitting some ext3 bugs in my kernel on a
fairly regular basis while trying to reproduce, so a fix will need to
wait until I've upgraded (tomorrow). :)
Thanks!
-Greg

On Mon, Jul 19, 2010 at 7:57 AM, Thomas Mueller <thomas@chaschperli.ch> wrote:
> hi
>
> the ceph.git/unstable cmds gets killed by my snaptest-2 (http://
> github.com/vinzent/ceph-testsuite/blob/master/tests/snaptest-2) with ceph-
> client-standalone/unstable-backport on kernel 2.6.34.1. I can reproduce
> the behaviour.
>
> it somewhere happens on the "Delete the snapshots..." phase.
>
>
> kernel log:
> [ 2024.315441] ceph: client4102 fsid ab2d5e45-9f53-7764-c958-c099f5be6e33
> [ 2024.316111] ceph: mon0 127.0.0.1:6789 session established
> [ 3753.964099] ceph:  tid 11109 timed out on osd0, will reset osd
> [ 4054.056374] ceph:  tid 15029 timed out on osd0, will reset osd
> [ 4098.646013] ceph: mds0 127.0.0.1:6802 socket closed
> [ 4099.804937] ceph: mds0 127.0.0.1:6802 connection failed
> [ 4100.804629] ceph: mds0 127.0.0.1:6802 connection failed
> [ 4101.804638] ceph: mds0 127.0.0.1:6802 connection failed
> [ 4103.804636] ceph: mds0 127.0.0.1:6802 connection failed
> [ 4107.804381] ceph: mds0 127.0.0.1:6802 connection failed
> [ 4115.804644] ceph: mds0 127.0.0.1:6802 connection failed
> [ 4131.804387] ceph: mds0 127.0.0.1:6802 connection failed
> [ 4144.804343] ceph: mds0 caps stale
> [ 4159.806936] ceph: mds0 caps stale
> [ 4163.804149] ceph: mds0 127.0.0.1:6802 connection failed
>
> there is no cmds segfault message, but cmds process has gone.
>
> - Thomas
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mds crash on snaptest-2
  2010-07-19 21:16 ` Gregory Farnum
@ 2010-07-20 14:50   ` Thomas Mueller
  2010-07-27  9:22   ` Thomas Mueller
  1 sibling, 0 replies; 8+ messages in thread
From: Thomas Mueller @ 2010-07-20 14:50 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel


Am Mon, 19 Jul 2010 14:16:14 -0700 schrieb Gregory Farnum:

> Can you turn on debugging and verify for me that it's crashing on
> "assert(p->second.first <= snapid && snapid <= p->first); "
> CInode::encode_inodestat:1617?
> I've hit this assert trying to reproduce your issue using cfuse and I
> think this is it, but I'm hitting some ext3 bugs in my kernel on a
> fairly regular basis while trying to reproduce, so a fix will need to
> wait until I've upgraded (tomorrow). :) Thanks!
> -Greg


i do start the daemons with vstart.sh - is that enough "turn on debugging"?

last 10 lines of all log files in src/log:

http://pastebin.com/VE05VXRF







^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mds crash on snaptest-2
  2010-07-19 21:16 ` Gregory Farnum
  2010-07-20 14:50   ` Thomas Mueller
@ 2010-07-27  9:22   ` Thomas Mueller
  2010-07-27 18:59     ` Gregory Farnum
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Mueller @ 2010-07-27  9:22 UTC (permalink / raw)
  To: ceph-devel

Am Mon, 19 Jul 2010 14:16:14 -0700 schrieb Gregory Farnum:

> Can you turn on debugging and verify for me that it's crashing on
> "assert(p->second.first <= snapid && snapid <= p->first); "
> CInode::encode_inodestat:1617?
> I've hit this assert trying to reproduce your issue using cfuse and I
> think this is it, but I'm hitting some ext3 bugs in my kernel on a
> fairly regular basis while trying to reproduce, so a fix will need to
> wait until I've upgraded (tomorrow). :) Thanks!
> -Greg

hi greg

the test still fails with ceph.git/unstable from today. now cmds doesn't 
exit anymore. But after a half an hour the test kills itself because of a 
timeout (normal running time is about 10 minutes).

- Thomas

PS: found out that vstart.sh places logs in subdir "out" too. so tell me 
if you need some of them.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mds crash on snaptest-2
  2010-07-27  9:22   ` Thomas Mueller
@ 2010-07-27 18:59     ` Gregory Farnum
  2010-07-27 19:50       ` Thomas Mueller
  0 siblings, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2010-07-27 18:59 UTC (permalink / raw)
  To: Thomas Mueller; +Cc: ceph-devel

On Tue, Jul 27, 2010 at 2:22 AM, Thomas Mueller <thomas@chaschperli.ch> wrote:
> Am Mon, 19 Jul 2010 14:16:14 -0700 schrieb Gregory Farnum:
>
>> Can you turn on debugging and verify for me that it's crashing on
>> "assert(p->second.first <= snapid && snapid <= p->first); "
>> CInode::encode_inodestat:1617?
>> I've hit this assert trying to reproduce your issue using cfuse and I
>> think this is it, but I'm hitting some ext3 bugs in my kernel on a
>> fairly regular basis while trying to reproduce, so a fix will need to
>> wait until I've upgraded (tomorrow). :) Thanks!
>> -Greg
>
> hi greg
>
> the test still fails with ceph.git/unstable from today. now cmds doesn't
> exit anymore. But after a half an hour the test kills itself because of a
> timeout (normal running time is about 10 minutes).
>
> - Thomas
>
> PS: found out that vstart.sh places logs in subdir "out" too. so tell me
> if you need some of them.
Yes, I've been working on this for some time now. If you try the test
on a single MDS it should work fine with the latest git, but there are
some deeper issues going on with an MDS cluster that we're having a
hard time isolating in a way that lets us fix it. It appears we might
need to rework our snapshot inode handling a bit and Sage has asked me
to move on.

I'd recommend doing your testing on a single MDS (if using vstart:
CEPH_NUM_MDS=1 ./vstart -- this also works for _OSD and _MON) system
until we say that we expect the MDS cluster to work under more
circumstances.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mds crash on snaptest-2
  2010-07-27 18:59     ` Gregory Farnum
@ 2010-07-27 19:50       ` Thomas Mueller
  2010-07-27 19:54         ` Gregory Farnum
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Mueller @ 2010-07-27 19:50 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel


>> the test still fails with ceph.git/unstable from today. now cmds doesn't
>> exit anymore. But after a half an hour the test kills itself because of a
>> timeout (normal running time is about 10 minutes).
>>
>> - Thomas
>>
>> PS: found out that vstart.sh places logs in subdir "out" too. so tell me
>> if you need some of them.
> Yes, I've been working on this for some time now. If you try the test
> on a single MDS it should work fine with the latest git, but there are
> some deeper issues going on with an MDS cluster that we're having a
> hard time isolating in a way that lets us fix it. It appears we might
> need to rework our snapshot inode handling a bit and Sage has asked me
> to move on.
>
> I'd recommend doing your testing on a single MDS (if using vstart:
> CEPH_NUM_MDS=1 ./vstart -- this also works for _OSD and _MON) system
> until we say that we expect the MDS cluster to work under more
> circumstances.

i'm  always starting just one daemon. my test script sets these vars 
before calling "vstart.sh":

export CEPH_NUM_MON=1
export CEPH_NUM_OSD=1
export CEPH_NUM_MDS=1

last known good rev was  ae82dd5a5c964bb310a5512d10d1e062cbb0c1a5 on 
July 8 - with this rev the test was working fine.

i've also tried to compile with "-O0" to run it with gdb (not that i'm a 
gdb expert..) - but the binaries failed to start (ok back then it was 
bit late ...)

- Thomas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mds crash on snaptest-2
  2010-07-27 19:50       ` Thomas Mueller
@ 2010-07-27 19:54         ` Gregory Farnum
  2010-07-28  5:03           ` Thomas Mueller
  0 siblings, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2010-07-27 19:54 UTC (permalink / raw)
  To: Thomas Mueller; +Cc: ceph-devel

On Tue, Jul 27, 2010 at 12:50 PM, Thomas Mueller <thomas@chaschperli.ch> wrote:
> i'm  always starting just one daemon. my test script sets these vars before
> calling "vstart.sh":
>
> export CEPH_NUM_MON=1
> export CEPH_NUM_OSD=1
> export CEPH_NUM_MDS=1
>
> last known good rev was  ae82dd5a5c964bb310a5512d10d1e062cbb0c1a5 on July 8
> - with this rev the test was working fine.
>
> i've also tried to compile with "-O0" to run it with gdb (not that i'm a gdb
> expert..) - but the binaries failed to start (ok back then it was bit late
> ...)
Huh. Can you double check that you have the latest code? Specifically,
it needs to include commit e2b1a4ee119a68b403582ae3bc15b54e9458b9b5.
I've run your test a number of times under cfuse and haven't gotten
any single-MDS crashes or hangs with that.

Are you running it under the kclient or cfuse?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mds crash on snaptest-2
  2010-07-27 19:54         ` Gregory Farnum
@ 2010-07-28  5:03           ` Thomas Mueller
  0 siblings, 0 replies; 8+ messages in thread
From: Thomas Mueller @ 2010-07-28  5:03 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On 27.07.2010 21:54, Gregory Farnum wrote:
> On Tue, Jul 27, 2010 at 12:50 PM, Thomas Mueller<thomas@chaschperli.ch>  wrote:
>> i'm  always starting just one daemon. my test script sets these vars before
>> calling "vstart.sh":
>>
>> export CEPH_NUM_MON=1
>> export CEPH_NUM_OSD=1
>> export CEPH_NUM_MDS=1
>>
>> last known good rev was  ae82dd5a5c964bb310a5512d10d1e062cbb0c1a5 on July 8
>> - with this rev the test was working fine.
>>
>> i've also tried to compile with "-O0" to run it with gdb (not that i'm a gdb
>> expert..) - but the binaries failed to start (ok back then it was bit late
>> ...)
> Huh. Can you double check that you have the latest code? Specifically,
> it needs to include commit e2b1a4ee119a68b403582ae3bc15b54e9458b9b5.
> I've run your test a number of times under cfuse and haven't gotten
> any single-MDS crashes or hangs with that.
>
> Are you running it under the kclient or cfuse?


today the test passed - thank you!

before i updated for todays testrun - the mentioned ref was the last 
commit in my yesterdays tests:
$ git log -1
commit e2b1a4ee119a68b403582ae3bc15b54e9458b9b5
Author: Greg Farnum <gregf@hq.newdream.net>
Date:   Mon Jul 26 16:43:16 2010 -0700

     mds: Use get_oldest_snap() (not first) in handle_client_lssnap.


i'm running kclient unstable-backport with "merge origin/master" (ref 
0938669c180056f517db836f05697f8a2c41ec61), vanilla kernel 2.6.34.1



- Thomas

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-07-28  5:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-19 14:57 mds crash on snaptest-2 Thomas Mueller
2010-07-19 21:16 ` Gregory Farnum
2010-07-20 14:50   ` Thomas Mueller
2010-07-27  9:22   ` Thomas Mueller
2010-07-27 18:59     ` Gregory Farnum
2010-07-27 19:50       ` Thomas Mueller
2010-07-27 19:54         ` Gregory Farnum
2010-07-28  5:03           ` Thomas Mueller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).