All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 0/7] xfs_repair: scale to 150,000 iops
Date: Tue, 30 Oct 2018 22:20:36 +1100	[thread overview]
Message-ID: <20181030112043.6034-1-david@fromorbit.com> (raw)

Hi folks,

This patchset enables me to successfully repair a rather large
metadump image (~500GB of metadata) that was provided to us because
it crashed xfs_repair. Darrick and Eric have already posted patches
to fix the crash bugs, and this series is built on top of them.
Those patches are:

	libxfs: add missing agfl free deferred op type
	xfs_repair: initialize realloced bplist in longform_dir2_entry_check
	xfs_repair: continue after xfs_bunmapi deadlock avoidance

This series starts with another couple of regression fixes - the
revert is for a change in 4.18, the unlinked list issue is only in
the 4.19 dev tree.

The third patch prevents a problem I had during development that
resulted in blowing the buffer cache size out to > 100GB RAM and
causing xfs_repair to be OOM-killed on my 128GB RAM machine. If
there was a sudden prefetch demand or a set of queues were allowed
to grow very deep (e.g. lots of AGs all starting prefetch at the
same time) then they would all race to expand the cache, causing
multiple expansions within a few milliseconds. Only one expansion
was needed, so I rate limited it.

The 4th patch actually solved the runaway queueing problems I was
having, but I figured it was still a good idea to prevent
unnecessary cache growth. The fourth patch allowed me to bound how
much work was queued internally to an AG in phase 6, so the queue
didn't suck up the entire AG's readahead in one go....

patches 5 and 6 protect objects/structures that have concurrent
access in phase 6 - the bad inode list and the inode chunk records
in the per-ag AVL trees. the trees themselves aren't modified in
phase 6, so they don't need any additional concurrency protection.

Patch 7 enables concurrency in phase 6. Firstly it parallelises
across AGs like phase 3 and 4, but because phase 6 is largely CPU
bound processing directories one at a time, it also uses a workqueue
to parallelise processing of individual inode chunks records. This
is convenient and easy to do, and is very effective. If you now have
the IO capability, phase 6 will still run as a CPU workload - I
watched it use 30 of 32 CPUs for 15 minutes before the long tail of
large directories slowly burnt down.

While burning all that CPU, it also sustained about 160k IOPS from
the SSDs. Phase 3 and 4 also ran at about 130-150k IOPS, but that is
about the current limit of the prefetching and IO infrastructure we
have in xfsprogs.

Comments, thoughts, ideas, testing all welcome!

Cheers,

Dave.

             reply	other threads:[~2018-10-30 20:13 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-30 11:20 Dave Chinner [this message]
2018-10-30 11:20 ` [PATCH 1/7] Revert "xfs_repair: treat zero da btree pointers as corruption" Dave Chinner
2018-10-30 17:20   ` Darrick J. Wong
2018-10-30 19:35     ` Eric Sandeen
2018-10-30 20:11       ` Dave Chinner
2018-10-30 11:20 ` [PATCH 2/7] repair: don't dirty inodes which are not unlinked Dave Chinner
2018-10-30 17:26   ` Darrick J. Wong
2018-10-30 20:03   ` Eric Sandeen
2018-10-30 20:09     ` Eric Sandeen
2018-10-30 20:34       ` Dave Chinner
2018-10-30 20:40         ` Eric Sandeen
2018-10-30 20:58           ` Dave Chinner
2018-10-30 11:20 ` [PATCH 3/7] cache: prevent expansion races Dave Chinner
2018-10-30 17:39   ` Darrick J. Wong
2018-10-30 20:35     ` Dave Chinner
2018-10-31 17:13   ` Brian Foster
2018-11-01  1:27     ` Dave Chinner
2018-11-01 13:17       ` Brian Foster
2018-11-01 21:23         ` Dave Chinner
2018-11-02 11:31           ` Brian Foster
2018-11-02 23:26             ` Dave Chinner
2018-10-30 11:20 ` [PATCH 4/7] workqueue: bound maximum queue depth Dave Chinner
2018-10-30 17:58   ` Darrick J. Wong
2018-10-30 20:53     ` Dave Chinner
2018-10-31 17:14       ` Brian Foster
2018-10-30 11:20 ` [PATCH 5/7] repair: Protect bad inode list with mutex Dave Chinner
2018-10-30 17:44   ` Darrick J. Wong
2018-10-30 20:54     ` Dave Chinner
2018-10-30 11:20 ` [PATCH 6/7] repair: protect inode chunk tree records with a mutex Dave Chinner
2018-10-30 17:46   ` Darrick J. Wong
2018-10-30 11:20 ` [PATCH 7/7] repair: parallelise phase 6 Dave Chinner
2018-10-30 17:51   ` Darrick J. Wong
2018-10-30 20:55     ` Dave Chinner
2018-11-07  5:44 ` [PATCH 0/7] xfs_repair: scale to 150,000 iops Arkadiusz Miśkiewicz
2018-11-07  6:48   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181030112043.6034-1-david@fromorbit.com \
    --to=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.