All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH v3 00/42] xfs: per-ag centric allocation alogrithms
Date: Fri, 10 Feb 2023 09:17:43 +1100	[thread overview]
Message-ID: <20230209221825.3722244-1-david@fromorbit.com> (raw)

This series continues the work towards making shrinking a filesystem possible.
We need to be able to stop operations from taking place on AGs that need to be
removed by a shrink, so before shrink can be implemented we need to have the
infrastructure in place to prevent incursion into AGs that are going to be, or
are in the process, of being removed from active duty.

The focus of this is making operations that depend on access to AGs use the
perag to access and pin the AG in active use, thereby creating a barrier we can
use to delay shrink until all active uses of an AG have been drained and new
uses are prevented.

This series starts by fixing some existing issues that are exposed by changes
later in the series. They stand alone, so can be picked up independently of the
rest of this patchset.

The most complex of these fixes is cleaning up the mess that is the AGF deadlock
avoidance algorithm. This algorithm stores the first block that is allocated in
a transaction in tp->t_firstblock, then uses this to try to limit future
allocations within the transaction to AGs at or higher than the filesystem block
stored in tp->t_firstblock. This depends on one of the initial bug fixes in the
series to move the deadlock avoidance checks to xfs_alloc_vextent(), and then
builds on it to relax the constraints of the avoidance algorithm to only be
active when a deadlock is possible.

We also update the algorithm to record allocations from higher AGs that are
allocated from, because we when we need to lock more than two AGs we still have
to ensure lock order is correct. Therefore we can't lock AGs in the order 1, 3,
2, even though tp->t_firstblock indicates that we've allocated from AG 1 and so
AG is valid to lock. It's not valid, because we already hold AG 3 locked, and so
tp->t-first_block should actually point at AG 3, not AG 1 in this situation.

It should now be obvious that the deadlock avoidance algorithm should record
AGs, not filesystem blocks. So the series then changes the transaction to store
the highest AG we've allocated in rather than a filesystem block we allocated.
This makes it obvious what the constraints are, and trivial to update as we
lock and allocate from various AGs.

With all the bug fixes out of the way, the series then starts converting the
code to use active references. Active reference counts are used by high level
code that needs to prevent the AG from being taken out from under it by a shrink
operation. The high level code needs to be able to handle not getting an active
reference gracefully, and the shrink code will need to wait for active
references to drain before continuing.

Active references are implemented just as reference counts right now - an active
reference is taken at perag init during mount, and all other active references
are dependent on the active reference count being greater than zero. This gives
us an initial method of stopping new active references without needing other
infrastructure; just drop the reference taken at filesystem mount time and when
the refcount then falls to zero no new references can be taken.

In future, this will need to take into account AG control state (e.g. offline,
no alloc, etc) as well as the reference count, but right now we can implement
a basic barrier for shrink with just reference count manipulations. As such,
patches to convert the perag state to atomic opstate fields similar to the
xfs_mount and xlog opstate fields follow the initial active perag reference
counting patches.

The first target for active reference conversion is the for_each_perag*()
iterators. This captures a lot of high level code that should skip offline AGs,
and introduces the ability to differentiate between a lookup that didn't have an
online AG and the end of the AG iteration range.

From there, the inode allocation AG selection is converted to active references,
and the perag is driven deeper into the inode allocation and btree code to
replace the xfs_mount. Most of the inode allocation code operates on a single AG
once it is selected, hence it should pass the perag as the primary referenced
object around for allocation, not the xfs_mount. There is a bit of churn here,
but it emphasises that inode allocation is inherently an allocation group based
operation.

Next the bmap/alloc interface undergoes a major untangling, reworking
xfs_bmap_btalloc() into separate allocation operations for different contexts
and failure handling behaviours. This then allows us to completely remove
the xfs_alloc_vextent() layer via restructuring the
xfs_alloc_vextent/xfs_alloc_ag_vextent() into a set of realtively simple helper
function that describe the allocation that they are doing. e.g.
xfs_alloc_vextent_exact_bno().

This allows the requirements for accessing AGs to be allocation context
dependent. The allocations that require operation on a single AG generally can't
tolerate failure after the allocation method and AG has been decided on, and
hence the caller needs to manage the active references to ensure the allocation
does not race with shrink removing the selected AG for the duration of the
operation that requires access to that allocation group.

Other allocations iterate AGs and so the first AG is just a hint - these do
not need to pin a perag first as they can tolerate not being able to access an
AG by simply skipping over it. These require new perag iteration functions that
can start at arbitrary AGs and wrap around at arbitrary AGs, hence a new set for
for_each_perag_wrap*() helpers to do this.

Next is the rework of the filestreams allocator. This doesn't change any
functionality, but gets rid of the unnecessary multi-pass selection algorithm
when the selected AG is not available. It currently does a lookup pass which might
iterate all AGs to select an AG, then checks if the AG is acceptible and if not
does a "new AG" pass that is essentially identical to the lookup pass. Both of
these scans also do the same "longest extent in AG" check before selecting an AG
as is done after the AG is selected.

IOWs, the filestreams algorithm can be greatly simplified into a single new AG
selection pass if the there is no current association or the currently
associated AG doesn't have enough contiguous free space for the allocation to
proceed.  With this simplification of the filestreams allocator, it's then
trivial to convert it to use for_each_perag_wrap() for the AG scan algorithm.

This series passes auto group fstests with rmapbt=1 on both 1kB and 4kB block
size configurations without functional or performance regressions. In some cases
ENOSPC behaviour is improved, but fstests does not capture those improvements as
it only tests for regressions in behaviour.

Version 3:
- rebased on current linux-xfs/for-next
- various whitespace and typo cleanups.
- fixed missing error return from xfs_bmap_btalloc_select_lengths().
- changed git diff algorithm to "patience" for better readability.
- replaced xfs_rfsblock_t with xfs_fsblock_t.
- removed stray trace_printk() debugging code.
- Added assert to ensure we don't leak perag references out of the
  xfs_alloc_vextent_start_ag() iterator.
- changed trylock flag in xfs_filestream_pick_ag() to a boolean to reflect the way
  it is used now.

Version 2:
- https://lore.kernel.org/linux-xfs/20230118224505.1964941-1-david@fromorbit.com/
- AGI, AGF and AGFL access conversion patches removed due to being merged.
- AG geometry conversion patches removed due to being merged
- Rebase on 6.2-rc4
- fixed "firstblock" AGF deadlock avoidance algorithm
- lots of cleanups and bug fixes.

Version 1 [RFC]:
- https://lore.kernel.org/linux-xfs/20220611012659.3418072-1-david@fromorbit.com/


             reply	other threads:[~2023-02-09 22:26 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-09 22:17 Dave Chinner [this message]
2023-02-09 22:17 ` [PATCH 01/42] xfs: fix low space alloc deadlock Dave Chinner
2023-02-09 22:17 ` [PATCH 02/42] xfs: prefer free inodes at ENOSPC over chunk allocation Dave Chinner
2023-02-09 22:17 ` [PATCH 03/42] xfs: block reservation too large for minleft allocation Dave Chinner
2023-02-09 22:17 ` [PATCH 04/42] xfs: drop firstblock constraints from allocation setup Dave Chinner
2023-02-09 22:17 ` [PATCH 05/42] xfs: t_firstblock is tracking AGs not blocks Dave Chinner
2023-02-09 22:17 ` [PATCH 06/42] xfs: don't assert fail on transaction cancel with deferred ops Dave Chinner
2023-02-09 22:17 ` [PATCH 07/42] xfs: active perag reference counting Dave Chinner
2023-02-11  4:06   ` Darrick J. Wong
2023-02-09 22:17 ` [PATCH 08/42] xfs: rework the perag trace points to be perag centric Dave Chinner
2023-02-09 22:17 ` [PATCH 09/42] xfs: convert xfs_imap() to take a perag Dave Chinner
2023-02-09 22:17 ` [PATCH 10/42] xfs: use active perag references for inode allocation Dave Chinner
2023-02-09 22:17 ` [PATCH 11/42] xfs: inobt can use perags in many more places than it does Dave Chinner
2023-02-09 22:17 ` [PATCH 12/42] xfs: convert xfs_ialloc_next_ag() to an atomic Dave Chinner
2023-02-09 22:17 ` [PATCH 13/42] xfs: perags need atomic operational state Dave Chinner
2023-02-09 22:17 ` [PATCH 14/42] xfs: introduce xfs_for_each_perag_wrap() Dave Chinner
2023-02-09 22:17 ` [PATCH 15/42] xfs: rework xfs_alloc_vextent() Dave Chinner
2023-02-09 22:17 ` [PATCH 16/42] xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags() Dave Chinner
2023-02-09 22:18 ` [PATCH 17/42] xfs: combine __xfs_alloc_vextent_this_ag and xfs_alloc_ag_vextent Dave Chinner
2023-02-09 22:18 ` [PATCH 18/42] xfs: use xfs_alloc_vextent_this_ag() where appropriate Dave Chinner
2023-02-09 22:18 ` [PATCH 19/42] xfs: factor xfs_bmap_btalloc() Dave Chinner
2023-02-09 22:18 ` [PATCH 20/42] xfs: use xfs_alloc_vextent_first_ag() where appropriate Dave Chinner
2023-02-09 22:18 ` [PATCH 21/42] xfs: use xfs_alloc_vextent_start_bno() " Dave Chinner
2023-02-09 22:18 ` [PATCH 22/42] xfs: introduce xfs_alloc_vextent_near_bno() Dave Chinner
2023-02-09 22:18 ` [PATCH 23/42] xfs: introduce xfs_alloc_vextent_exact_bno() Dave Chinner
2023-02-09 22:18 ` [PATCH 24/42] xfs: introduce xfs_alloc_vextent_prepare() Dave Chinner
2023-02-09 22:18 ` [PATCH 25/42] xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno() Dave Chinner
2023-02-09 22:18 ` [PATCH 26/42] xfs: fold xfs_alloc_ag_vextent() into callers Dave Chinner
2023-02-09 22:18 ` [PATCH 27/42] xfs: move the minimum agno checks into xfs_alloc_vextent_check_args Dave Chinner
2023-02-09 22:18 ` [PATCH 28/42] xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker Dave Chinner
2023-02-09 22:18 ` [PATCH 29/42] xfs: convert trim to use for_each_perag_range Dave Chinner
2023-02-09 22:18 ` [PATCH 30/42] xfs: factor out filestreams from xfs_bmap_btalloc_nullfb Dave Chinner
2023-02-09 22:18 ` [PATCH 31/42] xfs: get rid of notinit from xfs_bmap_longest_free_extent Dave Chinner
2023-02-09 22:18 ` [PATCH 32/42] xfs: use xfs_bmap_longest_free_extent() in filestreams Dave Chinner
2023-02-09 22:18 ` [PATCH 33/42] xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c Dave Chinner
2023-02-09 22:18 ` [PATCH 34/42] xfs: merge filestream AG lookup into xfs_filestream_select_ag() Dave Chinner
2023-02-09 22:18 ` [PATCH 35/42] xfs: merge new filestream AG selection " Dave Chinner
2023-02-09 22:18 ` [PATCH 36/42] xfs: remove xfs_filestream_select_ag() longest extent check Dave Chinner
2023-02-09 22:18 ` [PATCH 37/42] xfs: factor out MRU hit case in xfs_filestream_select_ag Dave Chinner
2023-02-09 22:18 ` [PATCH 38/42] xfs: track an active perag reference in filestreams Dave Chinner
2023-02-09 22:18 ` [PATCH 39/42] xfs: use for_each_perag_wrap in xfs_filestream_pick_ag Dave Chinner
2023-02-09 22:18 ` [PATCH 40/42] xfs: pass perag to filestreams tracing Dave Chinner
2023-02-09 22:18 ` [PATCH 41/42] xfs: return a referenced perag from filestreams allocator Dave Chinner
2023-02-09 22:18 ` [PATCH 42/42] xfs: refactor the filestreams allocator pick functions Dave Chinner
2023-02-10  3:09 ` [PATCH v3 00/42] xfs: per-ag centric allocation alogrithms Darrick J. Wong
2023-02-10  5:00   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230209221825.3722244-1-david@fromorbit.com \
    --to=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.