From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0A697C47083 for ; Wed, 2 Jun 2021 16:32:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E08016195D for ; Wed, 2 Jun 2021 16:32:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229702AbhFBQdp (ORCPT ); Wed, 2 Jun 2021 12:33:45 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:60581 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229541AbhFBQdo (ORCPT ); Wed, 2 Jun 2021 12:33:44 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1622651521; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=PRB8IZfjavzaEPREL3f7Fu5rJs7AX7M38OL0QWSFf4s=; b=VYSZfH1hr9fGcxT4KkeT2an+IZnQvGbFuNUpVNVGFuOF3Fyh5ZAvtt/seQP5+iLR/Uz49d ClpADEfbHSV9m/oJUzxeTVP4DsDGAzNfDoqNVfypBqrLf/du40gkkPYMe86CeuTXLCvVZn okRUIbzkIbKa8J7CMKuXDP+0YmRM/Lc= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-200-bnxc73vIPqSnwR_0tWRyNQ-1; Wed, 02 Jun 2021 12:31:57 -0400 X-MC-Unique: bnxc73vIPqSnwR_0tWRyNQ-1 Received: by mail-qt1-f200.google.com with SMTP id h12-20020ac8776c0000b02901f1228fdb1bso1635273qtu.6 for ; Wed, 02 Jun 2021 09:31:57 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=PRB8IZfjavzaEPREL3f7Fu5rJs7AX7M38OL0QWSFf4s=; b=BTy/IaEuQkcGo2NYP9PIGmaebY0FL9gdHsA5NNEkCmvhVDROXp6/gXRtaZrn9V/q8A R1Mok+i2+nZlrmpgqhHBeOLdih+WOZ5feH6zZMJNGm7ndZRRcJuWa747tRHre+avr/lN XqVR1FgHx9kf5tZ1xI3Ax0WMTVmAeek+C0ffw+zkZdsELDXXUcaItIIdbU1ilVSH0Fgb FrP0JVBs6EzjUaqkbj1WHnISRnQUlr67AbCpMh63ksy6uvhqE/OwBF5R35Fjjq89rXHl tkB0n0J+gIQ7AzP3hKNBTMi/8SqvbCBxjRIRQPWPdGfkpp/enWpqb0zs4GbXmZ22UxmV BSKA== X-Gm-Message-State: AOAM531ASd4fXtPUzwUMxIRBBkzuoCsYdECEUYPjx4mcTdkIKSe9TCVA dvx8jbh7SOoK2TVCXfqhqdYBLPfEFa2c4zWqn7iVInyo7tCUU+/PpzCmrB9pC3CLbjrV7qreeVN jX6vPBdtdzP+ashQZntGh X-Received: by 2002:a05:622a:14c7:: with SMTP id u7mr25285494qtx.65.1622651517099; Wed, 02 Jun 2021 09:31:57 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw4yCCiKJZTY9Loc9WGkqt1fsDSholrmUqkxgUh6OFMG7XAVFBLJuGWaS5Rvy0+SzuFU0fXag== X-Received: by 2002:a05:622a:14c7:: with SMTP id u7mr25285459qtx.65.1622651516642; Wed, 02 Jun 2021 09:31:56 -0700 (PDT) Received: from bfoster ([98.216.211.229]) by smtp.gmail.com with ESMTPSA id h3sm133783qkk.82.2021.06.02.09.31.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Jun 2021 09:31:56 -0700 (PDT) Date: Wed, 2 Jun 2021 12:31:54 -0400 From: Brian Foster To: "Darrick J. Wong" Cc: Dave Chinner , linux-xfs@vger.kernel.org Subject: Re: [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing Message-ID: References: <20210503121816.561340-1-bfoster@redhat.com> <20210503232539.GI63242@dread.disaster.area> <20210506025611.GM63242@dread.disaster.area> <20210602160238.GJ26380@locust> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210602160238.GJ26380@locust> Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Wed, Jun 02, 2021 at 09:02:38AM -0700, Darrick J. Wong wrote: > On Wed, Jun 02, 2021 at 09:32:34AM -0400, Brian Foster wrote: > > On Thu, May 06, 2021 at 03:29:50PM -0400, Brian Foster wrote: > > > On Thu, May 06, 2021 at 12:56:11PM +1000, Dave Chinner wrote: > > > > On Wed, May 05, 2021 at 07:50:17AM -0400, Brian Foster wrote: > > > > > On Tue, May 04, 2021 at 09:25:39AM +1000, Dave Chinner wrote: > > > > > > On Mon, May 03, 2021 at 08:18:16AM -0400, Brian Foster wrote: > > > > > > i.e. the problem here is that we've dropped the bip->bli_refcount > > > > > > before we've locked the buffer and taken a reference to it for > > > > > > the fail path? > > > > > > > > > > > > OK, I see that xfs_buf_item_done() (called from ioend processing) > > > > > > simply frees the buf log item and doesn't care about the bli > > > > > > refcount at all. So the first ioend caller will free the buf log > > > > > > item regardless of whether there are other references to it at all. > > > > > > > > > > > > IOWs, once we unpin the buffer, the bli attached to the buffer and > > > > > > being tracked in the AIL has -zero- references to the bli and so it > > > > > > gets freed unconditionally on IO completion. > > > > > > > > > > > > That seems to the be the problem here - the bli is not reference > > > > > > counted while it is the AIL.... > > > > > > > > > > > > > > > > I think it depends on how you look at it. As you point out, we've had > > > > > this odd bli reference count pattern for as long as I can remember where > > > > > > > > Yes, and it's been a constant source of use-after free bugs in > > > > shutdown processing for as long as I can remember. I want to fix it > > > > so we don't have to keep band-aiding this code every time we change > > > > how some part of log item or stale inode/buffer processing works... > > > > > > > > > > IME, most of the bugs in this area tend to be shutdown/error related and > > > generally relate to the complexity of all the various states and > > > contexts for the different callback contexts. That isn't a direct result > > > of this bli refcount behavior, as odd as it is, though that certainly > > > contributes to the overall complexity. > > > > > > > > > > > > > > > > > > > - freed = atomic_dec_and_test(&bip->bli_refcount); > > > > > > - > > > > > > + /* > > > > > > + * We can wake pin waiters safely now because we still hold the > > > > > > + * bli_refcount that was taken when the pin was gained. > > > > > > + */ > > > > > > if (atomic_dec_and_test(&bp->b_pin_count)) > > > > > > wake_up_all(&bp->b_waiters); > > > > > > > > > > > > - if (freed && stale) { > > > > > > - ASSERT(bip->bli_flags & XFS_BLI_STALE); > > > > > > - ASSERT(xfs_buf_islocked(bp)); > > > > > > - ASSERT(bp->b_flags & XBF_STALE); > > > > > > - ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL); > > > > > > - > > > > > > - trace_xfs_buf_item_unpin_stale(bip); > > > > > > - > > > > > > - if (remove) { > > > > > > - /* > > > > > > - * If we are in a transaction context, we have to > > > > > > - * remove the log item from the transaction as we are > > > > > > - * about to release our reference to the buffer. If we > > > > > > - * don't, the unlock that occurs later in > > > > > > - * xfs_trans_uncommit() will try to reference the > > > > > > - * buffer which we no longer have a hold on. > > > > > > - */ > > > > > > - if (!list_empty(&lip->li_trans)) > > > > > > - xfs_trans_del_item(lip); > > > > > > - > > > > > > - /* > > > > > > - * Since the transaction no longer refers to the buffer, > > > > > > - * the buffer should no longer refer to the transaction. > > > > > > - */ > > > > > > - bp->b_transp = NULL; > > > > > > + if (!stale) { > > > > > > + if (!remove) { > > > > > > + /* Nothing to do but drop the refcount the pin owned. */ > > > > > > + atomic_dec(&bip->bli_refcount); > > > > > > + return; > > > > > > } > > > > > > > > > > Hmm.. this seems a bit wonky to me. This code historically acts on the > > > > > drop of the final reference to the bli. > > > > > > > > Yes, and that's the problem that needs fixing. The AIL needs a > > > > reference so that we aren't racing with writeback from the AIL to > > > > free the object because both sets of code run without actually > > > > holding an active reference to the BLI... > > > > > > > > > This is not critical for the > > > > > common (!stale && !remove) case because that's basically a no-op here > > > > > outside of dropping the reference, and it looks like the stale buffer > > > > > handling code further down continues to follow that model, but in this > > > > > branch it seems we're trying to be clever in how the reference is > > > > > managed and as a result can act on a bli that might actually have > > > > > additional references. > > > > > > > > Who cares? If something else has active references, then we must not > > > > free the bli or buffer here, anyway. The lack of active references > > > > by active BLI usres is why we keep getting use-after-free bugs in > > > > this code..... > > > > > > > > > > Well, this impacts more than just whether we free the buffer or not in > > > the abort case. This potentially runs the I/O failure sequence on a > > > pinned buffer, or blocks log I/O completion on a buffer lock that might > > > be held by a transaction. > > > > > > I don't know if these are immediate problems or not and this is all > > > abort/shutdown related, but re: my point above around shutdown issues, > > > I'd prefer to try and avoid these kind of oddball quirks if we can vs. > > > just replace the historical quirks with new ones. > > > > > > > > If so, I don't think it's appropriate to run > > > > > through the error sequence that follows. > > > > > > > > > > > > > > > > > /* > > > > > > - * If we get called here because of an IO error, we may or may > > > > > > - * not have the item on the AIL. xfs_trans_ail_delete() will > > > > > > - * take care of that situation. xfs_trans_ail_delete() drops > > > > > > - * the AIL lock. > > > > > > - */ > > > > > > - if (bip->bli_flags & XFS_BLI_STALE_INODE) { > > > > > > - xfs_buf_item_done(bp); > > > > > > - xfs_buf_inode_iodone(bp); > > > > > > - ASSERT(list_empty(&bp->b_li_list)); > > > > > > - } else { > > > > > > - xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR); > > > > > > - xfs_buf_item_relse(bp); > > > > > > - ASSERT(bp->b_log_item == NULL); > > > > > > - } > > > > > > - xfs_buf_relse(bp); > > > > > > - } else if (freed && remove) { > > > > > > - /* > > > > > > + * Fail the IO before we drop the bli refcount. This guarantees > > > > > > + * that a racing writeback completion also failing the buffer > > > > > > + * and running completion will not remove the last reference to > > > > > > + * the bli and free it from under us. > > > > > > + * > > > > > > * The buffer must be locked and held by the caller to simulate > > > > > > * an async I/O failure. > > > > > > */ > > > > > > @@ -559,7 +555,62 @@ xfs_buf_item_unpin( > > > > > > xfs_buf_hold(bp); > > > > > > bp->b_flags |= XBF_ASYNC; > > > > > > xfs_buf_ioend_fail(bp); > > > > > > + xfs_buf_item_relse(bp); > > > > > > > > > > Did you mean for this to be xfs_buf_item_put() instead of _relse()? The > > > > > > > > Yes. I did say "untested" which implies the patch isn't complete or will > > > > work. It's just a demonstration of how this reference counting might > > > > be done, not a complete, working solution. Patches are a much faster > > > > way of explaining the overall solution that plain text... > > > > > > > > > > Sure, I'm just trying to clarify intent. It's a refcount patch so > > > whether we drop a refcount or explicitly free an objects is particularly > > > relevant. ;) > > > > > > > > > } > > > > > > + > > > > > > + /* > > > > > > + * Stale buffer - only process it if this is the last reference to the > > > > > > + * BLI. If this is the last BLI reference, then the buffer will be > > > > > > + * locked and have two references - once from the transaction commit and > > > > > > + * one from the BLI - and we do not unlock and release transaction > > > > > > + * reference until we've finished cleaning up the BLI. > > > > > > + */ > > > > > > + if (!atomic_dec_and_test(&bip->bli_refcount)) > > > > > > + return; > > > > > > + > > > > > > > > > > If the buffer is stale, will this ever be the last reference now that > > > > > _item_committed() bumps the refcount? > > > > > > > > If the AIL has a reference, then no. > > > > > > > > > > Ok, but how would we get here without an AIL reference? That seems > > > impossible to me based on your patch. > > > > > > > > This change also seems to have > > > > > ramifications for the code that follows, such as if a staled buffer is > > > > > already in the AIL (with a bli ref), would this code ever get to the > > > > > point of removing it? > > > > > > > > That's easy enough to handle - if the buffer is stale and we are in > > > > this code, we hold the buffer locked. Hence we can remove from the > > > > AIL if it is in the AIL and drop that reference, too. Indeed, this > > > > code already does the AIL removal, so all this requires is a simple > > > > rearrangement of the logic in this function.... > > > > > > > > > > Hmm, I don't think that's a correct assertion. If the buffer is stale > > > and we're in this code and we have the last reference to the bli, then > > > we know we hold the buffer locked. Otherwise, ISTM we can get here while > > > the transaction that staled the buffer might still own the lock. > > > > > > > The only difference is that we have to do this before we drop the > > > > current reference we have on the BLI, which is not what we do now > > > > and that's where all the problems lie. > > > > > > > > > All in all, I'll reiterate that I think it would be nice to fix up the > > > > > bli reference count handling in general, but I think the scope and > > > > > complexity of that work is significantly beyond what is reasonably > > > > > necessary to fix this bug. > > > > > > > > And so leaving the underlying problem in the code for the next set > > > > of changes someone does to trigger the problem in a differen way. > > > > We've indentified what the root cause is, so can we please spend > > > > the time to fix it properly? > > > > > > > > > > That is not what I'm saying at all. I'm saying that I'd prefer to fix > > > the bug first and then step back and evaluate the overall refcount > > > design independently because the latter is quite complex and there are > > > all kinds of subtle interactions that the RFC patch just glazes over (by > > > design). For example, how log recovery processes bli's slightly > > > differently looks like yet another impedence mismatch from when the fs > > > is fully active. > > > > > > > Just an update on this particular bit... > > > > I've probably spent about a week and a half now working through some > > attempts to rework the bli refcount handling (going back to the drawing > > board at least a couple times) and while I can get to something mostly > > functional (surviving an fstests run), the last steps to test/prove a > > solid/reliable implementation end up falling short. On extended testing > > I end up sorting through the same kind of shutdown/unmount hang, use > > after free, extremely subtle interactions that I've spent many hours > > trying to stamp out over the past several years. > > > > Because of that, I don't think this is something worth pushing for in > > the short term. While the proposed idea sounds nice in theory, my take > > away in practice is that the current design is in place for a reason > > (i.e. the refcount historically looks like a transaction reference count > > used to provide unlocked/isolated access through the log subsystem as > > opposed to a traditional memory object lifecycle reference count) and > > has quite a lot of incremental stability that has been baked in over > > time. > > > > I do still think this is (or should be :P) ultimately fixable, but I'm > > wondering if we're probably better served by exploring some of the > > historical warts and subtle rules/dependencies/hurdles of the current > > model to see if they can be removed or simplified to the point where the > > reference counting incrementally and naturally becomes a bit more > > straightforward. I probably need to reset my brain and think a little > > more about that... > > What do you want to do in the meantime? > > How about: Elevate this patch from RFC to regular patch status (perhaps > with an XXX comment outlining the gap as a breadcrumb to remind future > us?) and merge that so that we at least fix the immediate UAF problem? > Well I had already sent a v1 [1] of the original UAF fix independent from this work because I didn't want to bury the bug fix behind a broader rework. I just wanted to follow up here because I've been looking into the latter and this is where most of the discussion around the design rework took place. That said, I am currently looking at (yet another) alternative approach that might be more in the direction of broadly cleaning things up vs. stamping out an isolated problem. I just don't know how it will turn out quite yet, so I could go either way between waiting on that and possibly sending a v2, or going with v1 for now (assuming it's reviewed) and basing subsequent cleanups on that. My instinct is that to leave a record of an isolated fix is probably the proper thing to do regardless, but I don't feel that strongly about it. Brian [1] https://lore.kernel.org/linux-xfs/20210511135257.878743-1-bfoster@redhat.com/ > I suspected that figuring out all the subtleties of the bli lifetimes > would be an intense effort. > > --D > > > > > Brian > > > > > So I propose to rework my patch a bit into something that reflects the > > > intended direction (see below for an untested diff) and proceed from > > > there... > > > > > > Brian > > > > > > --- 8< --- > > > > > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c > > > index fb69879e4b2b..7ff31788512b 100644 > > > --- a/fs/xfs/xfs_buf_item.c > > > +++ b/fs/xfs/xfs_buf_item.c > > > @@ -475,17 +475,8 @@ xfs_buf_item_pin( > > > } > > > > > > /* > > > - * This is called to unpin the buffer associated with the buf log > > > - * item which was previously pinned with a call to xfs_buf_item_pin(). > > > - * > > > - * Also drop the reference to the buf item for the current transaction. > > > - * If the XFS_BLI_STALE flag is set and we are the last reference, > > > - * then free up the buf log item and unlock the buffer. > > > - * > > > - * If the remove flag is set we are called from uncommit in the > > > - * forced-shutdown path. If that is true and the reference count on > > > - * the log item is going to drop to zero we need to free the item's > > > - * descriptor in the transaction. > > > + * This is called to unpin the buffer associated with the buf log item which > > > + * was previously pinned with a call to xfs_buf_item_pin(). > > > */ > > > STATIC void > > > xfs_buf_item_unpin( > > > @@ -502,12 +493,26 @@ xfs_buf_item_unpin( > > > > > > trace_xfs_buf_item_unpin(bip); > > > > > > + /* > > > + * Drop the bli ref associated with the pin and grab the hold required > > > + * for the I/O simulation failure in the abort case. We have to do this > > > + * before the pin count drops because the AIL doesn't acquire a bli > > > + * reference. Therefore if the refcount drops to zero, the bli could > > > + * still be AIL resident and the buffer submitted for I/O (and freed on > > > + * completion) at any point before we return. This can be removed once > > > + * the AIL properly holds a reference on the bli. > > > + */ > > > freed = atomic_dec_and_test(&bip->bli_refcount); > > > - > > > + if (freed && !stale && remove) > > > + xfs_buf_hold(bp); > > > if (atomic_dec_and_test(&bp->b_pin_count)) > > > wake_up_all(&bp->b_waiters); > > > > > > - if (freed && stale) { > > > + /* nothing to do but drop the pin count if the bli is active */ > > > + if (!freed) > > > + return; > > > + > > > + if (stale) { > > > ASSERT(bip->bli_flags & XFS_BLI_STALE); > > > ASSERT(xfs_buf_islocked(bp)); > > > ASSERT(bp->b_flags & XBF_STALE); > > > @@ -550,13 +555,13 @@ xfs_buf_item_unpin( > > > ASSERT(bp->b_log_item == NULL); > > > } > > > xfs_buf_relse(bp); > > > - } else if (freed && remove) { > > > + } else if (remove) { > > > /* > > > * The buffer must be locked and held by the caller to simulate > > > - * an async I/O failure. > > > + * an async I/O failure. We acquired the hold for this case > > > + * before the buffer was unpinned. > > > */ > > > xfs_buf_lock(bp); > > > - xfs_buf_hold(bp); > > > bp->b_flags |= XBF_ASYNC; > > > xfs_buf_ioend_fail(bp); > > > } > > >