From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f173.google.com ([209.85.192.173]:35832 "EHLO mail-pf0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1947194AbcBRTto (ORCPT ); Thu, 18 Feb 2016 14:49:44 -0500 Received: by mail-pf0-f173.google.com with SMTP id c10so37934632pfc.2 for ; Thu, 18 Feb 2016 11:49:44 -0800 (PST) Date: Thu, 18 Feb 2016 14:49:41 -0500 (EST) From: Martin Brandenburg To: Mike Marshall cc: Al Viro , Martin Brandenburg , Linus Torvalds , linux-fsdevel , Stephen Rothwell Subject: Re: Orangefs ABI documentation In-Reply-To: Message-ID: References: <20160215230434.GZ17997@ZenIV.linux.org.uk> <20160216233609.GE17997@ZenIV.linux.org.uk> <20160216235441.GF17997@ZenIV.linux.org.uk> <20160217230900.GP17997@ZenIV.linux.org.uk> <20160217231524.GQ17997@ZenIV.linux.org.uk> <20160218000439.GR17997@ZenIV.linux.org.uk> <20160218111122.GS17997@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Thu, 18 Feb 2016, Mike Marshall wrote: > Still busted, exactly the same, I think. The doomed op gets a good > return code from is_daemon_in_service in service_operation but > gets EAGAIN from wait_for_matching_downcall... an edge case kind of > problem. > > Here's the raw (well, slightly edited for readability) logs showing > the doomed op and subsequent failed op that uses the bogus handle > and fsid from the doomed op. > > > > Alloced OP (ffff880012898000: 10889 OP_CREATE) > service_operation: orangefs_create op:ffff880012898000: > > > > wait_for_matching_downcall: operation purged (tag 10889, ffff880012898000, att 0 > service_operation: wait_for_matching_downcall returned -11 for ffff880012898000 > Interrupted: Removed op ffff880012898000 from htable_ops_in_progress > tag 10889 (orangefs_create) -- operation to be retried (1 attempt) > service_operation: orangefs_create op:ffff880012898000: > service_operation:client core is NOT in service, ffff880012898000 > > > > service_operation: wait_for_matching_downcall returned 0 for ffff880012898000 > service_operation orangefs_create returning: 0 for ffff880012898000 > orangefs_create: PPTOOLS1.PPA: > handle:00000000-0000-0000-0000-000000000000: fsid:0: > new_op:ffff880012898000: ret:0: > > > > Alloced OP (ffff880012888000: 10958 OP_GETATTR) > service_operation: orangefs_inode_getattr op:ffff880012888000: > service_operation: wait_for_matching_downcall returned 0 for ffff880012888000 > service_operation orangefs_inode_getattr returning: -22 for ffff880012888000 > Releasing OP (ffff880012888000: 10958 > orangefs_create: Failed to allocate inode for file :PPTOOLS1.PPA: > Releasing OP (ffff880012898000: 10889 > > > > > What I'm testing with differs from what is at kernel.org#for-next by > - diffs from Al's most recent email > - 1 souped up gossip message > - changed 0 to OP_VFS_STATE_UNKNOWN one place in service_operation > - reinit_completion(&op->waitq) in orangefs_clean_up_interrupted_operation > > > Mike, what error do you get from userspace (i.e. from dbench)? open("./clients/client0/~dmtmp/EXCEL/5D7C0000", O_RDWR|O_CREAT, 0600) = -1 ENODEV (No such device) An interesting note is that I can't reproduce at all with only one dbench process. It seems there's not enough load. I don't see how the kernel could return ENODEV at all. This may be coming from our client-core. -- Martin