From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f174.google.com ([209.85.217.174]:33957 "EHLO mail-lb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756250AbcBPXP5 (ORCPT ); Tue, 16 Feb 2016 18:15:57 -0500 Received: by mail-lb0-f174.google.com with SMTP id of3so14898144lbc.1 for ; Tue, 16 Feb 2016 15:15:57 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20160215230434.GZ17997@ZenIV.linux.org.uk> References: <20160213174738.GR17997@ZenIV.linux.org.uk> <20160214025615.GU17997@ZenIV.linux.org.uk> <20160214234312.GX17997@ZenIV.linux.org.uk> <20160215184554.GY17997@ZenIV.linux.org.uk> <20160215230434.GZ17997@ZenIV.linux.org.uk> Date: Tue, 16 Feb 2016 18:15:56 -0500 Message-ID: Subject: Re: Orangefs ABI documentation From: Mike Marshall To: Al Viro Cc: Martin Brandenburg , Linus Torvalds , linux-fsdevel , Stephen Rothwell Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: This thing is invulnerable now! Nothing hangs when I kill the client-core, and the client-core always restarts. Sometimes, if you hit it right with a kill while dbench is running, a file create will fail. I've been trying to trace down why all day, in case there's something that can be done... Here's what I see: orangefs_create service_operation wait_for_matching_downcall purges op and returns -EAGAIN orangefs_clean_up_interrupted_operation if (EAGAIN) ... goto retry_servicing wait_for_matching_downcall returns 0 service_operation returns 0 orangefs_create has good return value from service_operation op->khandle: 00000000-0000-0000-0000-000000000000 op->fs_id: 0 subsequent getattr on bogus object fails orangefs_create on EINVAL. seems like the second time around, wait_for_matching_downcall must have seen op_state_serviced, but I don't see how yet... I pushed the new patches out to gitolite.kernel.org:pub/scm/linux/kernel/git/hubcap/linux for-next I made a couple of additional patches that make it easier to read the flow of gossip statements, and also removed a few lines of vestigial ASYNC code. -Mike On Mon, Feb 15, 2016 at 6:04 PM, Al Viro wrote: > On Mon, Feb 15, 2016 at 05:32:54PM -0500, Martin Brandenburg wrote: > >> Something that used a slot, such as reader, would call >> service_operation while holding a bufmap. Then the client-core would >> crash, and the kernel would get run_down waiting on the slots to be >> given up. But the slots are not given up until someone wakes all the >> processes waiting in service_operation up, which happens after all the >> slots are given up. Then client-core hangs until someone sends a >> deadly signal to all the processes waiting in service_operation or >> presumably the timeout expires. >> >> This splits finalize and run_down so that orangefs_devreq_release can >> mark the slot map as killed, then purge waiting ops, then wait for all >> the slots to be released. Meanwhile, processes which were waiting will >> get into orangefs_bufmap_get which will see that the slot map is >> shutting down and wait for the client-core to come back. > > D'oh. Yes, that was exactly the point of separating mark_dead and run_down - > the latter should've been done after purging all requests. Fixes folded, > branch force-pushed.