From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-lf0-f42.google.com ([209.85.215.42]:34052 "EHLO
	mail-lf0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1948081AbcBRUi2 (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 18 Feb 2016 15:38:28 -0500
Received: by mail-lf0-f42.google.com with SMTP id j78so40777154lfb.1
        for <linux-fsdevel@vger.kernel.org>; Thu, 18 Feb 2016 12:38:27 -0800 (PST)
MIME-Version: 1.0
In-Reply-To: <CAOg9mST9mPefWaWNqXieuQQeLdLDXOtS3atk_+JzJXX9G6b+gg@mail.gmail.com>
References: <CA+D=wkgGZ9A8Qa5C6q3cROrr+Gp=jsgowvcbOs-22UU=aVT7Wg@mail.gmail.com>
	<20160215230434.GZ17997@ZenIV.linux.org.uk>
	<CAOg9mSSdsVL8tRiAojg=UCNPQ6iPcthtdowH9kWyiWnXUvTEHg@mail.gmail.com>
	<20160216233609.GE17997@ZenIV.linux.org.uk>
	<20160216235441.GF17997@ZenIV.linux.org.uk>
	<CAOg9mSR0QZOy-JAesN32o51omxdOQyfUh-Zj8k7ZohKLaa3GxQ@mail.gmail.com>
	<alpine.BSO.2.20.1602171722490.29280@server1.sysblue.org>
	<20160217230900.GP17997@ZenIV.linux.org.uk>
	<20160217231524.GQ17997@ZenIV.linux.org.uk>
	<20160218000439.GR17997@ZenIV.linux.org.uk>
	<20160218111122.GS17997@ZenIV.linux.org.uk>
	<CAOg9mSS11uSvfBkNYPeNBw8xMebvHAM3vzh82BH=W273-7oNyg@mail.gmail.com>
	<alpine.BSO.2.20.1602181449280.29280@server1.sysblue.org>
	<CAOg9mSSRebK0WMQHVD3Ne7R0+sdzO+o-mJcxX626X6UzEFsPpg@mail.gmail.com>
	<CAOg9mST9mPefWaWNqXieuQQeLdLDXOtS3atk_+JzJXX9G6b+gg@mail.gmail.com>
Date: Thu, 18 Feb 2016 15:38:26 -0500
Message-ID: <CAOg9mSRWeNV_DbRw1Zn2XzwW+7ORWcYLj85P-3rvZ0YisLzT9g@mail.gmail.com>
Subject: Re: Orangefs ABI documentation
From: Mike Marshall <hubcap@omnibond.com>
To: Martin Brandenburg <martin@omnibond.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>
Content-Type: text/plain; charset=UTF-8
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Yeah, it looks like the fault is entirely with the client-core...

orangefs-kernel.h:      OP_VFS_STATE_UNKNOWN = 0,
orangefs-kernel.h:      OP_VFS_STATE_WAITING = 1,
orangefs-kernel.h:      OP_VFS_STATE_INPROGR = 2,
orangefs-kernel.h:      OP_VFS_STATE_SERVICED = 4,
orangefs-kernel.h:      OP_VFS_STATE_PURGED = 8,
orangefs-kernel.h:      OP_VFS_STATE_GIVEN_UP = 16,


Alloced OP (ffff880011078000: 20210 OP_CREATE)
service_operation: orangefs_create op:ffff880011078000:
service_op: orangefs_create op:ffff880011078000: process:dbench state -> 1

orangefs_devreq_read: op:ffff880011078000: process:pvfs2-client-co state -> 2

set_op_state_purged: op:ffff880011078000: process:pvfs2-client-co state -> 10

wait_for_matching_downcall: operation purged (tag 20210, ffff880011078000, att 0
service_operation: wait_for_matching_downcall returned -11 for ffff880011078000
Interrupted: Removed op ffff880011078000 from htable_ops_in_progress
tag 20210 (orangefs_create) -- operation to be retried (1 attempt)
service_operation: orangefs_create op:ffff880011078000:
process:dbench: pid:1171service_op: orangefs_create
op:ffff880011078000: process:dbench state -> 1
service_operation:client core is NOT in service, ffff880011078000

orangefs_devreq_read: op:ffff880011078000: process:pvfs2-client-co state -> 2

WARNING: CPU: 0 PID: 1216 at fs/orangefs/devorangefs-req.c:423
set_op_state_serviced: op:ffff880011078000: process:pvfs2-client-co state -> 4
service_operation: wait_for_matching_downcall returned 0 for ffff880011078000
service_operation orangefs_create returning: 0 for ffff880011078000
orangefs_create: BENCHS.LWP:
handle:00000000-0000-0000-0000-000000000000: fsid:0:
new_op:ffff880011078000: ret:0:

-Mike

On Thu, Feb 18, 2016 at 3:22 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> I haven't edited up a list of how the debug output looked,
> but most importantly: the WARN_ON is hit... it appears that
> the client-core is sending over fsid:0:
>
> -Mike
>
> On Thu, Feb 18, 2016 at 3:08 PM, Mike Marshall <hubcap@omnibond.com> wrote:
>> I haven't been trussing it... it reports EINVAL to stderr... I find
>> the ops to look
>> at in the debug output by looking for the -22...
>>
>> (373) open ./clients/client8/~dmtmp/PARADOX/STUDENTS.DB failed for
>> handle 9981 (Invalid argument)
>>
>> I just got the whacky code <g> from Al's last message to compile, I'll
>> have results from that soon...
>>
>> -Mike
>>
>> On Thu, Feb 18, 2016 at 2:49 PM, Martin Brandenburg <martin@omnibond.com> wrote:
>>> On Thu, 18 Feb 2016, Mike Marshall wrote:
>>>
>>>> Still busted, exactly the same, I think. The doomed op gets a good
>>>> return code from is_daemon_in_service in service_operation but
>>>> gets EAGAIN from wait_for_matching_downcall... an edge case kind of
>>>> problem.
>>>>
>>>> Here's the raw (well, slightly edited for readability) logs showing
>>>> the doomed op and subsequent failed op that uses the bogus handle
>>>> and fsid from the doomed op.
>>>>
>>>>
>>>>
>>>> Alloced OP (ffff880012898000: 10889 OP_CREATE)
>>>> service_operation: orangefs_create op:ffff880012898000:
>>>>
>>>>
>>>>
>>>> wait_for_matching_downcall: operation purged (tag 10889, ffff880012898000, att 0
>>>> service_operation: wait_for_matching_downcall returned -11 for ffff880012898000
>>>> Interrupted: Removed op ffff880012898000 from htable_ops_in_progress
>>>> tag 10889 (orangefs_create) -- operation to be retried (1 attempt)
>>>> service_operation: orangefs_create op:ffff880012898000:
>>>> service_operation:client core is NOT in service, ffff880012898000
>>>>
>>>>
>>>>
>>>> service_operation: wait_for_matching_downcall returned 0 for ffff880012898000
>>>> service_operation orangefs_create returning: 0 for ffff880012898000
>>>> orangefs_create: PPTOOLS1.PPA:
>>>> handle:00000000-0000-0000-0000-000000000000: fsid:0:
>>>> new_op:ffff880012898000: ret:0:
>>>>
>>>>
>>>>
>>>> Alloced OP (ffff880012888000: 10958 OP_GETATTR)
>>>> service_operation: orangefs_inode_getattr op:ffff880012888000:
>>>> service_operation: wait_for_matching_downcall returned 0 for ffff880012888000
>>>> service_operation orangefs_inode_getattr returning: -22 for ffff880012888000
>>>> Releasing OP (ffff880012888000: 10958
>>>> orangefs_create: Failed to allocate inode for file :PPTOOLS1.PPA:
>>>> Releasing OP (ffff880012898000: 10889
>>>>
>>>>
>>>>
>>>>
>>>> What I'm testing with differs from what is at kernel.org#for-next by
>>>>   - diffs from Al's most recent email
>>>>   - 1 souped up gossip message
>>>>   - changed 0 to OP_VFS_STATE_UNKNOWN one place in service_operation
>>>>   - reinit_completion(&op->waitq) in orangefs_clean_up_interrupted_operation
>>>>
>>>>
>>>>
>>>
>>> Mike,
>>>
>>> what error do you get from userspace (i.e. from dbench)?
>>>
>>> open("./clients/client0/~dmtmp/EXCEL/5D7C0000", O_RDWR|O_CREAT, 0600) = -1 ENODEV (No such device)
>>>
>>> An interesting note is that I can't reproduce at all
>>> with only one dbench process. It seems there's not
>>> enough load.
>>>
>>> I don't see how the kernel could return ENODEV at all.
>>> This may be coming from our client-core.
>>>
>>> -- Martin