From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f42.google.com ([209.85.215.42]:34052 "EHLO mail-lf0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1948081AbcBRUi2 (ORCPT ); Thu, 18 Feb 2016 15:38:28 -0500 Received: by mail-lf0-f42.google.com with SMTP id j78so40777154lfb.1 for ; Thu, 18 Feb 2016 12:38:27 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <20160215230434.GZ17997@ZenIV.linux.org.uk> <20160216233609.GE17997@ZenIV.linux.org.uk> <20160216235441.GF17997@ZenIV.linux.org.uk> <20160217230900.GP17997@ZenIV.linux.org.uk> <20160217231524.GQ17997@ZenIV.linux.org.uk> <20160218000439.GR17997@ZenIV.linux.org.uk> <20160218111122.GS17997@ZenIV.linux.org.uk> Date: Thu, 18 Feb 2016 15:38:26 -0500 Message-ID: Subject: Re: Orangefs ABI documentation From: Mike Marshall To: Martin Brandenburg Cc: Al Viro , Linus Torvalds , linux-fsdevel , Stephen Rothwell Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Yeah, it looks like the fault is entirely with the client-core... orangefs-kernel.h: OP_VFS_STATE_UNKNOWN = 0, orangefs-kernel.h: OP_VFS_STATE_WAITING = 1, orangefs-kernel.h: OP_VFS_STATE_INPROGR = 2, orangefs-kernel.h: OP_VFS_STATE_SERVICED = 4, orangefs-kernel.h: OP_VFS_STATE_PURGED = 8, orangefs-kernel.h: OP_VFS_STATE_GIVEN_UP = 16, Alloced OP (ffff880011078000: 20210 OP_CREATE) service_operation: orangefs_create op:ffff880011078000: service_op: orangefs_create op:ffff880011078000: process:dbench state -> 1 orangefs_devreq_read: op:ffff880011078000: process:pvfs2-client-co state -> 2 set_op_state_purged: op:ffff880011078000: process:pvfs2-client-co state -> 10 wait_for_matching_downcall: operation purged (tag 20210, ffff880011078000, att 0 service_operation: wait_for_matching_downcall returned -11 for ffff880011078000 Interrupted: Removed op ffff880011078000 from htable_ops_in_progress tag 20210 (orangefs_create) -- operation to be retried (1 attempt) service_operation: orangefs_create op:ffff880011078000: process:dbench: pid:1171service_op: orangefs_create op:ffff880011078000: process:dbench state -> 1 service_operation:client core is NOT in service, ffff880011078000 orangefs_devreq_read: op:ffff880011078000: process:pvfs2-client-co state -> 2 WARNING: CPU: 0 PID: 1216 at fs/orangefs/devorangefs-req.c:423 set_op_state_serviced: op:ffff880011078000: process:pvfs2-client-co state -> 4 service_operation: wait_for_matching_downcall returned 0 for ffff880011078000 service_operation orangefs_create returning: 0 for ffff880011078000 orangefs_create: BENCHS.LWP: handle:00000000-0000-0000-0000-000000000000: fsid:0: new_op:ffff880011078000: ret:0: -Mike On Thu, Feb 18, 2016 at 3:22 PM, Mike Marshall wrote: > I haven't edited up a list of how the debug output looked, > but most importantly: the WARN_ON is hit... it appears that > the client-core is sending over fsid:0: > > -Mike > > On Thu, Feb 18, 2016 at 3:08 PM, Mike Marshall wrote: >> I haven't been trussing it... it reports EINVAL to stderr... I find >> the ops to look >> at in the debug output by looking for the -22... >> >> (373) open ./clients/client8/~dmtmp/PARADOX/STUDENTS.DB failed for >> handle 9981 (Invalid argument) >> >> I just got the whacky code from Al's last message to compile, I'll >> have results from that soon... >> >> -Mike >> >> On Thu, Feb 18, 2016 at 2:49 PM, Martin Brandenburg wrote: >>> On Thu, 18 Feb 2016, Mike Marshall wrote: >>> >>>> Still busted, exactly the same, I think. The doomed op gets a good >>>> return code from is_daemon_in_service in service_operation but >>>> gets EAGAIN from wait_for_matching_downcall... an edge case kind of >>>> problem. >>>> >>>> Here's the raw (well, slightly edited for readability) logs showing >>>> the doomed op and subsequent failed op that uses the bogus handle >>>> and fsid from the doomed op. >>>> >>>> >>>> >>>> Alloced OP (ffff880012898000: 10889 OP_CREATE) >>>> service_operation: orangefs_create op:ffff880012898000: >>>> >>>> >>>> >>>> wait_for_matching_downcall: operation purged (tag 10889, ffff880012898000, att 0 >>>> service_operation: wait_for_matching_downcall returned -11 for ffff880012898000 >>>> Interrupted: Removed op ffff880012898000 from htable_ops_in_progress >>>> tag 10889 (orangefs_create) -- operation to be retried (1 attempt) >>>> service_operation: orangefs_create op:ffff880012898000: >>>> service_operation:client core is NOT in service, ffff880012898000 >>>> >>>> >>>> >>>> service_operation: wait_for_matching_downcall returned 0 for ffff880012898000 >>>> service_operation orangefs_create returning: 0 for ffff880012898000 >>>> orangefs_create: PPTOOLS1.PPA: >>>> handle:00000000-0000-0000-0000-000000000000: fsid:0: >>>> new_op:ffff880012898000: ret:0: >>>> >>>> >>>> >>>> Alloced OP (ffff880012888000: 10958 OP_GETATTR) >>>> service_operation: orangefs_inode_getattr op:ffff880012888000: >>>> service_operation: wait_for_matching_downcall returned 0 for ffff880012888000 >>>> service_operation orangefs_inode_getattr returning: -22 for ffff880012888000 >>>> Releasing OP (ffff880012888000: 10958 >>>> orangefs_create: Failed to allocate inode for file :PPTOOLS1.PPA: >>>> Releasing OP (ffff880012898000: 10889 >>>> >>>> >>>> >>>> >>>> What I'm testing with differs from what is at kernel.org#for-next by >>>> - diffs from Al's most recent email >>>> - 1 souped up gossip message >>>> - changed 0 to OP_VFS_STATE_UNKNOWN one place in service_operation >>>> - reinit_completion(&op->waitq) in orangefs_clean_up_interrupted_operation >>>> >>>> >>>> >>> >>> Mike, >>> >>> what error do you get from userspace (i.e. from dbench)? >>> >>> open("./clients/client0/~dmtmp/EXCEL/5D7C0000", O_RDWR|O_CREAT, 0600) = -1 ENODEV (No such device) >>> >>> An interesting note is that I can't reproduce at all >>> with only one dbench process. It seems there's not >>> enough load. >>> >>> I don't see how the kernel could return ENODEV at all. >>> This may be coming from our client-core. >>> >>> -- Martin