From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-lf0-f51.google.com ([209.85.215.51]:33674 "EHLO
	mail-lf0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751560AbcBNWbM (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Sun, 14 Feb 2016 17:31:12 -0500
Received: by mail-lf0-f51.google.com with SMTP id m1so79357508lfg.0
        for <linux-fsdevel@vger.kernel.org>; Sun, 14 Feb 2016 14:31:11 -0800 (PST)
MIME-Version: 1.0
In-Reply-To: <20160214025615.GU17997@ZenIV.linux.org.uk>
References: <20160209221623.GI17997@ZenIV.linux.org.uk>
	<20160209224050.GJ17997@ZenIV.linux.org.uk>
	<20160209231328.GK17997@ZenIV.linux.org.uk>
	<20160211004432.GM17997@ZenIV.linux.org.uk>
	<CAOg9mSR+=Xj3rtf_DQQjxhWs=HQHYu-3Mjdb3LeOvaYEwLXvYA@mail.gmail.com>
	<20160212042757.GP17997@ZenIV.linux.org.uk>
	<CAOg9mSTPQLarX0kpBgfhnpPZL=0Y7netSLhSkhHfAEor_w2mEg@mail.gmail.com>
	<CA+D=wkhJRVsety9WX9KLGbRUEk1hwSf4Q0oP23c1TtQjr_Oguw@mail.gmail.com>
	<CAOg9mSQ=pF=Pg60VoSUeywybDATi1M3TTRmjGNEmvyLCh0AAoQ@mail.gmail.com>
	<20160213174738.GR17997@ZenIV.linux.org.uk>
	<20160214025615.GU17997@ZenIV.linux.org.uk>
Date: Sun, 14 Feb 2016 17:31:10 -0500
Message-ID: <CAOg9mSRgGghk=RAmWZ_MLgp=3DkR+2iV5dU8ffaAFK8aYmM_xQ@mail.gmail.com>
Subject: Re: Orangefs ABI documentation
From: Mike Marshall <hubcap@omnibond.com>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Brandenburg <martin@omnibond.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>
Content-Type: text/plain; charset=UTF-8
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

I added the list_del...

Everything is very resilient, I killed
the client-core over and over while dbench
was running at the same time as  ls -R
was running, and the client-core always
restarted... until finally, it didn't. I guess
related to the state of just what was going on
at the time... Hit the WARN_ON in service_operation,
and then oopsed on the orangefs_bufmap_put
down at the end of wait_for_direct_io...

http://myweb.clemson.edu/~hubcap/after.list_del

-Mike

On Sat, Feb 13, 2016 at 9:56 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Feb 13, 2016 at 05:47:38PM +0000, Al Viro wrote:
>> On Sat, Feb 13, 2016 at 12:18:12PM -0500, Mike Marshall wrote:
>> > I added the patches, and ran a bunch of tests.
>> >
>> > Stuff works fine when left unbothered, and also
>> > when wrenches are thrown into the works.
>> >
>> > I had multiple userspace things going on at the
>> > same time, dbench, ls -R, find... kill -9 or control-C on
>> > any of them is handled well. When I killed both
>> > the client-core and its restarter, the kernel
>> > dealt with swarm of ops that had nowhere
>> > to go... the WARN_ON in service_operation
>> > was hit.
>> >
>> > Feb 12 16:19:12 be1 kernel: [ 3658.167544] orangefs: please confirm
>> > that pvfs2-client daemon is running.
>> > Feb 12 16:19:12 be1 kernel: [ 3658.167547] fs/orangefs/dir.c line 264:
>> > orangefs_readdir: orangefs_readdir_index_get() failure (-5)
>>
>> I.e. bufmap is gone.
>>
>> > Feb 12 16:19:12 be1 kernel: [ 3658.170741] ------------[ cut here ]------------
>> > Feb 12 16:19:12 be1 kernel: [ 3658.170746] WARNING: CPU: 0 PID: 1667
>> > at fs/orangefs/waitqueue.c:203 service_operation+0x4f6/0x7f0()
>>
>> ... and we are in wait_for_direct_io(), holding an r/w slot and finding
>> ourselves with bufmap already gone, despite not having freed that slot
>> yet.  Bloody wonderful - we still have bufmap refcounting buggered somewhere.
>>
>> Which tree had that been?  Could you push that tree (having checked that
>> you don't have any uncommitted changes) in some branch?
>
> OK, at the very least there's this; should be folded into "orangefs: delay
> freeing slot until cancel completes"
>
> diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
> index 41f8bb1f..1e28555 100644
> --- a/fs/orangefs/orangefs-kernel.h
> +++ b/fs/orangefs/orangefs-kernel.h
> @@ -261,6 +261,7 @@ static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
>  {
>         spin_lock(&op->lock);
>         if (unlikely(op_is_cancel(op))) {
> +               list_del(&op->list);
>                 spin_unlock(&op->lock);
>                 put_cancel(op);
>         } else {