From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f51.google.com ([209.85.215.51]:33674 "EHLO mail-lf0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751560AbcBNWbM (ORCPT ); Sun, 14 Feb 2016 17:31:12 -0500 Received: by mail-lf0-f51.google.com with SMTP id m1so79357508lfg.0 for ; Sun, 14 Feb 2016 14:31:11 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20160214025615.GU17997@ZenIV.linux.org.uk> References: <20160209221623.GI17997@ZenIV.linux.org.uk> <20160209224050.GJ17997@ZenIV.linux.org.uk> <20160209231328.GK17997@ZenIV.linux.org.uk> <20160211004432.GM17997@ZenIV.linux.org.uk> <20160212042757.GP17997@ZenIV.linux.org.uk> <20160213174738.GR17997@ZenIV.linux.org.uk> <20160214025615.GU17997@ZenIV.linux.org.uk> Date: Sun, 14 Feb 2016 17:31:10 -0500 Message-ID: Subject: Re: Orangefs ABI documentation From: Mike Marshall To: Al Viro Cc: Martin Brandenburg , Linus Torvalds , linux-fsdevel , Stephen Rothwell Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: I added the list_del... Everything is very resilient, I killed the client-core over and over while dbench was running at the same time as ls -R was running, and the client-core always restarted... until finally, it didn't. I guess related to the state of just what was going on at the time... Hit the WARN_ON in service_operation, and then oopsed on the orangefs_bufmap_put down at the end of wait_for_direct_io... http://myweb.clemson.edu/~hubcap/after.list_del -Mike On Sat, Feb 13, 2016 at 9:56 PM, Al Viro wrote: > On Sat, Feb 13, 2016 at 05:47:38PM +0000, Al Viro wrote: >> On Sat, Feb 13, 2016 at 12:18:12PM -0500, Mike Marshall wrote: >> > I added the patches, and ran a bunch of tests. >> > >> > Stuff works fine when left unbothered, and also >> > when wrenches are thrown into the works. >> > >> > I had multiple userspace things going on at the >> > same time, dbench, ls -R, find... kill -9 or control-C on >> > any of them is handled well. When I killed both >> > the client-core and its restarter, the kernel >> > dealt with swarm of ops that had nowhere >> > to go... the WARN_ON in service_operation >> > was hit. >> > >> > Feb 12 16:19:12 be1 kernel: [ 3658.167544] orangefs: please confirm >> > that pvfs2-client daemon is running. >> > Feb 12 16:19:12 be1 kernel: [ 3658.167547] fs/orangefs/dir.c line 264: >> > orangefs_readdir: orangefs_readdir_index_get() failure (-5) >> >> I.e. bufmap is gone. >> >> > Feb 12 16:19:12 be1 kernel: [ 3658.170741] ------------[ cut here ]------------ >> > Feb 12 16:19:12 be1 kernel: [ 3658.170746] WARNING: CPU: 0 PID: 1667 >> > at fs/orangefs/waitqueue.c:203 service_operation+0x4f6/0x7f0() >> >> ... and we are in wait_for_direct_io(), holding an r/w slot and finding >> ourselves with bufmap already gone, despite not having freed that slot >> yet. Bloody wonderful - we still have bufmap refcounting buggered somewhere. >> >> Which tree had that been? Could you push that tree (having checked that >> you don't have any uncommitted changes) in some branch? > > OK, at the very least there's this; should be folded into "orangefs: delay > freeing slot until cancel completes" > > diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h > index 41f8bb1f..1e28555 100644 > --- a/fs/orangefs/orangefs-kernel.h > +++ b/fs/orangefs/orangefs-kernel.h > @@ -261,6 +261,7 @@ static inline void set_op_state_purged(struct orangefs_kernel_op_s *op) > { > spin_lock(&op->lock); > if (unlikely(op_is_cancel(op))) { > + list_del(&op->list); > spin_unlock(&op->lock); > put_cancel(op); > } else {