From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f172.google.com ([209.85.217.172]:35897 "EHLO mail-lb0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1947158AbcBRUWf (ORCPT ); Thu, 18 Feb 2016 15:22:35 -0500 Received: by mail-lb0-f172.google.com with SMTP id x1so35424832lbj.3 for ; Thu, 18 Feb 2016 12:22:34 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <20160215230434.GZ17997@ZenIV.linux.org.uk> <20160216233609.GE17997@ZenIV.linux.org.uk> <20160216235441.GF17997@ZenIV.linux.org.uk> <20160217230900.GP17997@ZenIV.linux.org.uk> <20160217231524.GQ17997@ZenIV.linux.org.uk> <20160218000439.GR17997@ZenIV.linux.org.uk> <20160218111122.GS17997@ZenIV.linux.org.uk> Date: Thu, 18 Feb 2016 15:22:33 -0500 Message-ID: Subject: Re: Orangefs ABI documentation From: Mike Marshall To: Martin Brandenburg Cc: Al Viro , Linus Torvalds , linux-fsdevel , Stephen Rothwell Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: I haven't edited up a list of how the debug output looked, but most importantly: the WARN_ON is hit... it appears that the client-core is sending over fsid:0: -Mike On Thu, Feb 18, 2016 at 3:08 PM, Mike Marshall wrote: > I haven't been trussing it... it reports EINVAL to stderr... I find > the ops to look > at in the debug output by looking for the -22... > > (373) open ./clients/client8/~dmtmp/PARADOX/STUDENTS.DB failed for > handle 9981 (Invalid argument) > > I just got the whacky code from Al's last message to compile, I'll > have results from that soon... > > -Mike > > On Thu, Feb 18, 2016 at 2:49 PM, Martin Brandenburg wrote: >> On Thu, 18 Feb 2016, Mike Marshall wrote: >> >>> Still busted, exactly the same, I think. The doomed op gets a good >>> return code from is_daemon_in_service in service_operation but >>> gets EAGAIN from wait_for_matching_downcall... an edge case kind of >>> problem. >>> >>> Here's the raw (well, slightly edited for readability) logs showing >>> the doomed op and subsequent failed op that uses the bogus handle >>> and fsid from the doomed op. >>> >>> >>> >>> Alloced OP (ffff880012898000: 10889 OP_CREATE) >>> service_operation: orangefs_create op:ffff880012898000: >>> >>> >>> >>> wait_for_matching_downcall: operation purged (tag 10889, ffff880012898000, att 0 >>> service_operation: wait_for_matching_downcall returned -11 for ffff880012898000 >>> Interrupted: Removed op ffff880012898000 from htable_ops_in_progress >>> tag 10889 (orangefs_create) -- operation to be retried (1 attempt) >>> service_operation: orangefs_create op:ffff880012898000: >>> service_operation:client core is NOT in service, ffff880012898000 >>> >>> >>> >>> service_operation: wait_for_matching_downcall returned 0 for ffff880012898000 >>> service_operation orangefs_create returning: 0 for ffff880012898000 >>> orangefs_create: PPTOOLS1.PPA: >>> handle:00000000-0000-0000-0000-000000000000: fsid:0: >>> new_op:ffff880012898000: ret:0: >>> >>> >>> >>> Alloced OP (ffff880012888000: 10958 OP_GETATTR) >>> service_operation: orangefs_inode_getattr op:ffff880012888000: >>> service_operation: wait_for_matching_downcall returned 0 for ffff880012888000 >>> service_operation orangefs_inode_getattr returning: -22 for ffff880012888000 >>> Releasing OP (ffff880012888000: 10958 >>> orangefs_create: Failed to allocate inode for file :PPTOOLS1.PPA: >>> Releasing OP (ffff880012898000: 10889 >>> >>> >>> >>> >>> What I'm testing with differs from what is at kernel.org#for-next by >>> - diffs from Al's most recent email >>> - 1 souped up gossip message >>> - changed 0 to OP_VFS_STATE_UNKNOWN one place in service_operation >>> - reinit_completion(&op->waitq) in orangefs_clean_up_interrupted_operation >>> >>> >>> >> >> Mike, >> >> what error do you get from userspace (i.e. from dbench)? >> >> open("./clients/client0/~dmtmp/EXCEL/5D7C0000", O_RDWR|O_CREAT, 0600) = -1 ENODEV (No such device) >> >> An interesting note is that I can't reproduce at all >> with only one dbench process. It seems there's not >> enough load. >> >> I don't see how the kernel could return ENODEV at all. >> This may be coming from our client-core. >> >> -- Martin