From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-lb0-f174.google.com ([209.85.217.174]:33957 "EHLO
	mail-lb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1756250AbcBPXP5 (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 16 Feb 2016 18:15:57 -0500
Received: by mail-lb0-f174.google.com with SMTP id of3so14898144lbc.1
        for <linux-fsdevel@vger.kernel.org>; Tue, 16 Feb 2016 15:15:57 -0800 (PST)
MIME-Version: 1.0
In-Reply-To: <20160215230434.GZ17997@ZenIV.linux.org.uk>
References: <CAOg9mSTPQLarX0kpBgfhnpPZL=0Y7netSLhSkhHfAEor_w2mEg@mail.gmail.com>
	<CA+D=wkhJRVsety9WX9KLGbRUEk1hwSf4Q0oP23c1TtQjr_Oguw@mail.gmail.com>
	<CAOg9mSQ=pF=Pg60VoSUeywybDATi1M3TTRmjGNEmvyLCh0AAoQ@mail.gmail.com>
	<20160213174738.GR17997@ZenIV.linux.org.uk>
	<20160214025615.GU17997@ZenIV.linux.org.uk>
	<CAOg9mSRgGghk=RAmWZ_MLgp=3DkR+2iV5dU8ffaAFK8aYmM_xQ@mail.gmail.com>
	<20160214234312.GX17997@ZenIV.linux.org.uk>
	<CAOg9mSSsztHyRsQ=BAA9FoEMo7XEdgTwYHosqvh4ZjykDapj2A@mail.gmail.com>
	<20160215184554.GY17997@ZenIV.linux.org.uk>
	<CA+D=wkgGZ9A8Qa5C6q3cROrr+Gp=jsgowvcbOs-22UU=aVT7Wg@mail.gmail.com>
	<20160215230434.GZ17997@ZenIV.linux.org.uk>
Date: Tue, 16 Feb 2016 18:15:56 -0500
Message-ID: <CAOg9mSSdsVL8tRiAojg=UCNPQ6iPcthtdowH9kWyiWnXUvTEHg@mail.gmail.com>
Subject: Re: Orangefs ABI documentation
From: Mike Marshall <hubcap@omnibond.com>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Brandenburg <martin@omnibond.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>
Content-Type: text/plain; charset=UTF-8
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

This thing is invulnerable now!

Nothing hangs when I kill the client-core, and the client-core
always restarts.

Sometimes, if you hit it right with a kill while dbench is running,
a file create will fail.

I've been trying to trace down why all day, in case there's
something that can be done...

Here's what I see:

  orangefs_create
    service_operation
      wait_for_matching_downcall purges op and returns -EAGAIN
      orangefs_clean_up_interrupted_operation
      if (EAGAIN)
        ...
        goto retry_servicing
      wait_for_matching_downcall returns 0
    service_operation returns 0
  orangefs_create has good return value from service_operation

   op->khandle: 00000000-0000-0000-0000-000000000000
   op->fs_id: 0

   subsequent getattr on bogus object fails orangefs_create on EINVAL.

   seems like the second time around, wait_for_matching_downcall
   must have seen op_state_serviced, but I don't see how yet...

I pushed the new patches out to

gitolite.kernel.org:pub/scm/linux/kernel/git/hubcap/linux
for-next

I made a couple of additional patches that make it easier to read
the flow of gossip statements, and also removed a few lines of vestigial
ASYNC code.

-Mike

On Mon, Feb 15, 2016 at 6:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Feb 15, 2016 at 05:32:54PM -0500, Martin Brandenburg wrote:
>
>> Something that used a slot, such as reader, would call
>> service_operation while holding a bufmap. Then the client-core would
>> crash, and the kernel would get run_down waiting on the slots to be
>> given up. But the slots are not given up until someone wakes all the
>> processes waiting in service_operation up, which happens after all the
>> slots are given up. Then client-core hangs until someone sends a
>> deadly signal to all the processes waiting in service_operation or
>> presumably the timeout expires.
>>
>> This splits finalize and run_down so that orangefs_devreq_release can
>> mark the slot map as killed, then purge waiting ops, then wait for all
>> the slots to be released. Meanwhile, processes which were waiting will
>> get into orangefs_bufmap_get which will see that the slot map is
>> shutting down and wait for the client-core to come back.
>
> D'oh.  Yes, that was exactly the point of separating mark_dead and run_down -
> the latter should've been done after purging all requests.  Fixes folded,
> branch force-pushed.