From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f182.google.com ([209.85.217.182]:36548 "EHLO mail-lb0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753106AbcBOWrZ (ORCPT ); Mon, 15 Feb 2016 17:47:25 -0500 Received: by mail-lb0-f182.google.com with SMTP id x1so7098229lbj.3 for ; Mon, 15 Feb 2016 14:47:24 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20160215184554.GY17997@ZenIV.linux.org.uk> References: <20160212042757.GP17997@ZenIV.linux.org.uk> <20160213174738.GR17997@ZenIV.linux.org.uk> <20160214025615.GU17997@ZenIV.linux.org.uk> <20160214234312.GX17997@ZenIV.linux.org.uk> <20160215184554.GY17997@ZenIV.linux.org.uk> Date: Mon, 15 Feb 2016 17:47:23 -0500 Message-ID: Subject: Re: Orangefs ABI documentation From: Mike Marshall To: Al Viro Cc: Martin Brandenburg , Linus Torvalds , linux-fsdevel , Stephen Rothwell Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: > Bufmap rewrite is really completely untested - > it's done pretty much blindly and I'd be surprised as hell if it has no > brainos at the first try. You did pretty good, it takes me two tries to get hello world right... Right off the bat, the kernel crashed, because: static struct slot_map rw_map = { .c = -1, .q = __WAIT_QUEUE_HEAD_INITIALIZER(rw_map.q) }; static struct slot_map readdir_map = { .c = -1, .q = __WAIT_QUEUE_HEAD_INITIALIZER(rw_map.q) }; ^ | D'OH! But after that stuff almost worked... It can still "sort of" wedge up. We think that when dbench is running and the client-core is killed, you can hit orangefs_bufmap_finalize -> mark_killed -> run_down/schedule(). while those wait_for_completion_* schedules of extant ops in wait_for_matching_downcall have also given up the processor... Then... when you interrupt dbench, stuff starts flowing again... I added a couple of gossip statements inside of mark_killed and run_down... Feb 15 16:40:15 be1 kernel: [ 349.981597] orangefs_bufmap_finalize: called Feb 15 16:40:15 be1 kernel: [ 349.981600] mark_killed enter Feb 15 16:40:15 be1 kernel: [ 349.981602] mark_killed: leave Feb 15 16:40:15 be1 kernel: [ 349.981603] mark_killed enter Feb 15 16:40:15 be1 kernel: [ 349.981605] mark_killed: leave Feb 15 16:40:15 be1 kernel: [ 349.981606] run_down: enter:-1: Feb 15 16:40:15 be1 kernel: [ 349.981608] run_down: leave Feb 15 16:40:15 be1 kernel: [ 349.981609] run_down: enter:-2: Feb 15 16:40:15 be1 kernel: [ 349.981610] run_down: before schedule:-2: stuff just sits here while dbench is still running. Then Ctrl-C on dbench and off to the races again. eb 15 16:42:28 be1 kernel: [ 483.049927] *** wait_for_matching_downcall: operation interrupted by a signal (tag 16523, op ffff880013418000) Feb 15 16:42:28 be1 kernel: [ 483.049930] Interrupted: Removed op ffff880013418000 from htable_ops_in_progress Feb 15 16:42:28 be1 kernel: [ 483.049932] orangefs: service_operation orangefs_inode_getattr returning: -4 for ffff880013418000. Feb 15 16:42:28 be1 kernel: [ 483.050116] *** wait_for_matching_downcall: operation interrupted by a signal (tag 16518, op ffff8800001a8000) Feb 15 16:42:28 be1 kernel: [ 483.050118] Interrupted: Removed op ffff8800001a8000 from htable_ops_in_progress Feb 15 16:42:28 be1 kernel: [ 483.050120] orangefs: service_operation orangefs_inode_getattr returning: -4 for ffff8800001a8000. Martin already has a patch... What do you think? I'm headed home for supper... -Mike On Mon, Feb 15, 2016 at 1:45 PM, Al Viro wrote: > On Mon, Feb 15, 2016 at 12:46:51PM -0500, Mike Marshall wrote: >> I pushed the list_del up to the kernel.org for-next branch... >> >> And I've been running tests with the CRUDE bandaid... weird >> results... >> >> No oopses, no WARN_ONs... I was running dbench and ls -R >> or find and kill-minus-nining different ones of them with no >> perceived resulting problems, so I moved on to signalling >> the client-core to abort... it restarted numerous times, >> and then stuff wedged up differently than I've seen before. > > There are other problems with that thing (starting with the fact that > retrying readdir/wait_for_direct_io can try to grab a slot despite the > bufmap winding down). OK, at that point I think we should try to see > if bufmap rewrite works - I've rebased on top of your branch and pushed > (head at 8c3bc9a). Bufmap rewrite is really completely untested - > it's done pretty much blindly and I'd be surprised as hell if it has no > brainos at the first try.