All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [RFC] [PATCH 00/36] xfs: more work towards shrinking.
Date: Fri,  3 Dec 2021 11:00:35 +1100	[thread overview]
Message-ID: <20211203000111.2800982-1-david@fromorbit.com> (raw)

Note: this is "heads up" at this point so that people can see what
is coming down the line and make early comments, not a request to
consider these for merging soon.

This series continues the work towards making shrinking a filesystem
possible.  We need to be able to stop operations from taking place
on AGs that need to be removed by a shrink, so before shrink can be
implemented we need to have the infrastructure in place to prevent
incursion into AGs that are going to be, or are in the process, of
being removed from active duty.

The focus of this is making operations that depend on access to AGs
use the perag to access and pin the AG in active use, thereby
creating a barrier we can use to delay shrink until all active uses
have been drained and new uses are prevented.

This series starts by driving the perag down into the AGI, AGF and
AGFL access routines and unifies the perag structure initialisation
with the high level AG header read functions. This largely replaces
the xfs_mount/agno pair that is passed to all these functions with a
perag, and in most places we already have a perag ready to pass in.
There are a few places where perags need to be grabbed before
reading the AG header buffers - some of these will need to be driven
to higher layers to ensure we can run operations on AGs without
getting stuck part way through waiting on a perag reference.

The next section of this patchset moves some of the AG geometry
information from the xfs_mount to the xfs_perag, and starts
converting code that requires geometry validation to use a perag
instead of a mount and having to extract the AGNO from the object
location. This also allows us to store the AG size in the perag and
then we can stop having to compare the agno against sb_agcount to
determine if the AG is the last AG and so has a runt size.  This
greatly simplifies some of the type validity checking we do and
substantially reduces the CPU overhead of type validity checking. It
also cuts over 1.2kB out of the binary size.

The series then starts converting the code to use active references.
Active reference counts are used by high level code that needs to
prevent the AG from being taken out from under it by a shrink
operation. The high level code needs to be able to handle not
getting an active reference gracefully, and the shrink code will
need to wait for active references to drain before continuing.

Active references are implemented just as reference counts right now
- an active reference is taken at perag init during mount, and all
other active references are dependent on the active reference count
being greater than zero. This gives us an initial method of stopping
new active references without needing other infrastructure; just
drop the reference taken at filesystem mount time and when the
refcount then falls to zero no new references can be taken.

In future, this will need to take into account AG control state
(e.g. offline, no alloc, etc) as well as the reference count, but
right now we can implement a basic barrier for shrink with just
reference count manipulations. There are patches to convert the
perag state to atomic opstate fields similar to the xfs_mount and
xlog opstate fields in preparation for this.

The first target for active reference conversion is the
for_each_perag*() iterators. This captures a lot of high level code
that should skip offline AGs, and introduces the ability to
differentiate between a lookup that didn't have an online AG and the
end of the AG iteration range.

From there, the inode allocation AG selection is converted to active
references, and the perag is driven deeper into the inode allocation
and btree code to replace the xfs_mount. Most of the inode
allocation code operates on a single AG once it is selected, hence
it should pass the perag as the primary referenced object around for
allocation, not the xfs_mount. There is a bit of churn here, but it
emphasises that inode allocation is inherently an allocation group
based operation.

Next the bmap/alloc interface undergoes a major untangling,
reworking xfs_bmap_btalloc() into separate allocation operations for
different contexts and failure handling behaviours. This then allows
us to completely remove the xfs_alloc_vextent() layer via
restructuring the xfs_alloc_vextent/xfs_alloc_ag_vextent() into a
set of realtively simple helper function that describe the
allocation that they are doing. e.g.  xfs_alloc_vextent_exact_bno().

This allows the requirements for accessing AGs to be allocation
context dependent. The allocations that require operation on a
single AG generally can't tolerate failure after the allocation
method and AG has been decided on, and hence the caller needs to
manage the active references to ensure the allocation does not race
with shrink removing the selected AG for the duration of the
operation that requires access to that allocation group.

Other allocations iterate AGs and so the first AG is just a hint -
these do not need to pin a perag first as they can tolerate not
being able to access an AG by simply skipping over it. These require
new perag iteration functions that can start at arbitrary AGs and
wrap around at arbitrary AGs, hence a new set for
for_each_perag_wrap*() helpers to do this.

So far this smoke tests OK - there's a problem with AGF locking
deadlocks as a result of converting xfs_alloc_vextent_iterate_ags()
to use for_each_perag_wrap_range() that shows in stress tests, but
it passes everything in the quick group.

There's more to come:
- the bmapi layer needs to handle active AG references for exact and
  near allocation
- filestreams allocator AG selection needs a significant rework to
  simplify and use active references
- converting the allocation "firstblock" restrictions to hold an
  actively referenced perag, not a filesystem block address.
- inode cache lookups need to converted to active references
- audits needed to find and convert all the places that we use
  bp->b_pag instead of active references passed from high level
  code.
- addition of a "going offline" opstate and state machine to use for
  rejecting new active references as well as blocking shrink from
  making progress until all active references are gone
- ioctls for changing AG state from userspace
- audit of the freeing code to determine whether it can use passive
  references to allow freeing of blocks (which may require
  allocation!) whilst new allocations are prevented from being run
  on "going offline" AGs. This will allow userspace to stop new
  allocations in AGs to be shrunk before it starts emptying them and
  freeing the space that they have in use.
- the physical shrink code.

This current patchset is based on 5.16-rc3.

-Dave.



             reply	other threads:[~2021-12-03  0:01 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-03  0:00 Dave Chinner [this message]
2021-12-03  0:00 ` [PATCH 01/36] xfs: make last AG grow/shrink perag centric Dave Chinner
2021-12-03  0:00 ` [PATCH 02/36] xfs: kill xfs_ialloc_pagi_init() Dave Chinner
2021-12-03  0:00 ` [PATCH 03/36] xfs: pass perag to xfs_ialloc_read_agi() Dave Chinner
2021-12-03  0:00 ` [PATCH 04/36] xfs: kill xfs_alloc_pagf_init() Dave Chinner
2021-12-03  0:00 ` [PATCH 05/36] xfs: pass perag to xfs_alloc_read_agf() Dave Chinner
2021-12-03  5:37   ` kernel test robot
2021-12-03  5:37     ` kernel test robot
2021-12-03  7:10   ` [RFC PATCH] xfs: xfs_reflink_find_shared can be static kernel test robot
2021-12-03  7:10     ` kernel test robot
2021-12-03  7:19   ` [PATCH 05/36] xfs: pass perag to xfs_alloc_read_agf() kernel test robot
2021-12-03  7:19     ` kernel test robot
2021-12-03  7:19   ` kernel test robot
2021-12-03  7:19     ` kernel test robot
2021-12-03  0:00 ` [PATCH 06/36] xfs: pass perag to xfs_read_agi Dave Chinner
2021-12-03  0:00 ` [PATCH 07/36] xfs: pass perag to xfs_read_agf Dave Chinner
2021-12-03  0:00 ` [PATCH 08/36] xfs: pass perag to xfs_alloc_get_freelist Dave Chinner
2021-12-03  0:00 ` [PATCH 09/36] xfs: pass perag to xfs_alloc_put_freelist Dave Chinner
2021-12-03  0:00 ` [PATCH 10/36] xfs: pass perag to xfs_alloc_read_agfl Dave Chinner
2021-12-03  0:00 ` [PATCH 11/36] xfs: Pre-calculate per-AG agbno geometry Dave Chinner
2021-12-03  0:00 ` [PATCH 12/36] xfs: Pre-calculate per-AG agino geometry Dave Chinner
2021-12-03  0:00 ` [PATCH 13/36] xfs: replace xfs_ag_block_count() with perag accesses Dave Chinner
2021-12-03  0:00 ` [PATCH 14/36] xfs: make is_log_ag() a first class helper Dave Chinner
2021-12-03  0:00 ` [PATCH 15/36] xfs: active perag reference counting Dave Chinner
2021-12-03  0:00 ` [PATCH 16/36] xfs: rework the perag trace points to be perag centric Dave Chinner
2021-12-03  0:00 ` [PATCH 17/36] xfs: convert xfs_imap() to take a perag Dave Chinner
2021-12-03  0:00 ` [PATCH 18/36] xfs: use active perag references for inode allocation Dave Chinner
2021-12-03  0:00 ` [PATCH 19/36] xfs: inobt can use perags in many more places than it does Dave Chinner
2021-12-03  0:00 ` [PATCH 20/36] xfs: convert xfs_ialloc_next_ag() to an atomic Dave Chinner
2021-12-03  0:00 ` [PATCH 21/36] xfs: perags need atomic operational state Dave Chinner
2021-12-03  0:00 ` [PATCH 22/36] xfs: introduce xfs_for_each_perag_wrap() Dave Chinner
2021-12-03  0:00 ` [PATCH 23/36] xfs: rework xfs_alloc_vextent() Dave Chinner
2021-12-03  0:00 ` [PATCH 24/36] xfs: use xfs_alloc_vextent_this_ag() in _iterate_ags() Dave Chinner
2021-12-03  0:01 ` [PATCH 25/36] xfs: combine __xfs_alloc_vextent_this_ag and xfs_alloc_ag_vextent Dave Chinner
2021-12-03  0:01 ` [PATCH 26/36] xfs: use xfs_alloc_vextent_this_ag() where appropriate Dave Chinner
2021-12-03  0:01 ` [PATCH 27/36] xfs: factor xfs_bmap_btalloc() Dave Chinner
2021-12-03  0:01 ` [PATCH 28/36] xfs: use xfs_alloc_vextent_first_ag() where appropriate Dave Chinner
2021-12-03  0:01 ` [PATCH 29/36] xfs: use xfs_alloc_vextent_start_bno() " Dave Chinner
2021-12-03  0:01 ` [PATCH 30/36] xfs: introduce xfs_alloc_vextent_near_bno() Dave Chinner
2021-12-03  0:01 ` [PATCH 31/36] xfs: introduce xfs_alloc_vextent_exact_bno() Dave Chinner
2021-12-03  0:01 ` [PATCH 32/36] xfs: introduce xfs_alloc_vextent_prepare() Dave Chinner
2021-12-03  0:01 ` [PATCH 33/36] xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno() Dave Chinner
2021-12-03  0:01 ` [PATCH 34/36] xfs: fold xfs_alloc_ag_vextent() into callers Dave Chinner
2021-12-03  0:01 ` [PATCH 35/36] xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker Dave Chinner
2021-12-03  0:01 ` [PATCH 36/36] xfs: convert trim to use for_each_perag_range Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211203000111.2800982-1-david@fromorbit.com \
    --to=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.