All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: sandeen@sandeen.net, Christoph Hellwig <hch@lst.de>,
	Dave Chinner <dchinner@redhat.com>, Theodore Ts'o <tytso@mit.edu>,
	linux-xfs@vger.kernel.org, allison.henderson@oracle.com
Subject: Re: [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems
Date: Mon, 28 Feb 2022 18:38:22 -0800	[thread overview]
Message-ID: <20220301023822.GD117732@magnolia> (raw)
In-Reply-To: <20220301004249.GT59715@dread.disaster.area>

On Tue, Mar 01, 2022 at 11:42:49AM +1100, Dave Chinner wrote:
> On Mon, Feb 28, 2022 at 03:22:11PM -0800, Darrick J. Wong wrote:
> > On Sun, Feb 27, 2022 at 08:37:20AM +1100, Dave Chinner wrote:
> > > On Fri, Feb 25, 2022 at 06:54:50PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Recently, the upstream kernel maintainer has been taking a lot of heat on
> > > > account of writer threads encountering high latency when asking for log
> > > > grant space when the log is small.  The reported use case is a heavily
> > > > threaded indexing product logging trace information to a filesystem
> > > > ranging in size between 20 and 250GB.  The meetings that result from the
> > > > complaints about latency and stall warnings in dmesg both from this use
> > > > case and also a large well known cloud product are now consuming 25% of
> > > > the maintainer's weekly time and have been for months.
> > > 
> > > Is the transaction reservation space exhaustion caused by, as I
> > > pointed out in another thread yesterday, the unbound concurrency in
> > > IO completion?
> > 
> > No.  They're using synchronous directio writes to write trace data in 4k
> 
> synchronous as in O_(D)SYNC or as in "not using AIO"? Is is also
> append data, and is it one file per logging thread or multiple
> threads writing to a single file?

One file per <cough> tenant, and multiple threads spread across all the
tenants on a system.  This means that all the threads can end up
pounding on a single file, or each thread can write to a single file.
The scenario that generated all those numbers (I think) is opening the
file O_DIRECT|O_SYNC and pwrite()ing blocks at EOF.

> > chunks.  The number of files does not exceed the number of writer
> > threads, and the number of writer threads can be up to 10*NR_CPUS (~400
> > on the test system).  If I'm reading the iomap directio code correctly,
> > the writer threads block and do not issue more IO until the first IO
> > completes...
> 
> So, up to 400 threads concurrently issuing IO that does block
> allocation and performing unwritten extent conversion, so up to ~800
> concurrent running allocation related transactions at a time?

Assuming you got 800 from the 400 writers + 400 bmbt allocations, yes.
Though I wouldn't count /quite/ that high for bmbt expansions.

> > > i.e. we have hundreds of active concurrent
> > > transactions that then block on common objects between them (e.g.
> > > inode locks) and serialise?
> > 
> > ...so yes, there are hundreds of active transactions, but (AFAICT) they
> > mostly don't share objects, other than the log itself.  Once we made the
> > log bigger, the hotspot moved to the AGF buffers.  I'm not sure what to
> > do about /that/, since a 5GB AG is pretty small.  That aside...
> 
> No surprise, AG selection is based on the is based on trying to get
> an adjacent extent for file extension. Hence assuming random
> distribution because of contention and skipping done by the search
> algorithm, then if we have ~50 AGs and 400 writers trying to
> allocate at the same time then you've got, on average, 8 allocations
> per AG being attempted roughly concurrently.
> 
> Of course, append write workloads tend to respond really well to
> extent size hints - make sure you allocate a large chunk that
> extents beyond EOF on the first write, then subsequent extending
> writes only need unwritten extent conversion which shouldn't need
> AGF access because it won't require BMBT block allocation during
> conversion because it's just taking away from the unwritten extent
> and putting the space into the adjacent written extent.

Yes, I know, I've been battling $internalgroups for over a year now to
get extent size hints turned on for things like append logs and VM
images.  If the IO sizes get larger (or they turn on extent size hints)
then the heat goes down...

> > > Hence only handful of completions can
> > > actually run concurrently, depsite every completion holding a full
> > > reservation of log space to allow them to run concurrently?
> > 
> > ...this is still an issue for different scenarios.  I would still be
> > interested in experimenting with constraining the number of writeback
> > completion workers that get started, even though that isn't at play
> > here.
> 
> Well, the "running out of log space" problem is still going to
> largely be caused by having up to 400 concurrent unwritten extent
> conversion transactions running at any given point in time...

I know!  But right now the default log size you get with a 220G volume
is 110MB.  An 800MB log just pushes the scaling problems around the
system, but on the plus side the latency is no longer so high that the
heartbeat timers trip, which causes node evictions, which leads to even
/worse/ escalations from customers whose systems experience high rates
of failure.  Every single customer who escalates this causes another
round of bug reports and another round of meetings for me, even though
the causes and the bandaids are the same.

> > > I also wonder if the right thing to do here is just set a minimum
> > > log size of 32MB? The worst of the long tail latencies are mitigated
> > > by this point, and so even small filesystems grown out to 200GB will
> > > have a log size that results in decent performance for this sort of
> > > workload.
> > 
> > Are you asking for a second patch where mkfs refuses to format a log
> > smaller than 32MB (e.g. 8GB with the x86 defaults)?  Or a second patch
> > that cranks the minimum log size up to 32MB, even if that leads to
> > absurd results (e.g. 66MB filesystems with 2 AGs and a 32MB log)?
> 
> I'm suggesting the latter.
> 
> Along with a refusal to make an XFS filesystem smaller than, say,
> 256MB, because allowing users to make tiny XFS filesystems seems to
> always just lead to future troubles.

I'm going to drop this patch, because I don't have the strength to push
back against you /and/ Eric.  I'm once again in a crunch because I've
spent so much of the past few weeks in meetings about bugs that I didn't
have time to try out Allison's LARP patches so I basically have nothing
to push for 5.18 because we're past -rc6 and it's too late, too late for
anything.  Hell, I can't even get that capable() thing reviewed, and
(right now) that's causing me even more meetingsgalore pain.

Is it so dangerous to turn this on so that real users can experiment
with the new setting?

Frankly I'm also thinking about throwing away six years of online repair
work because I just don't see myself being able to summon the mental
energy to run the review process when simple things are this hard.
Every week I sit down and ask myself if I really have what it takes to
keep going, and I've reached the point where the answer is NO.  I was
really hoping that some of our discussions about process improvements
would have made things better, but instead I realize that I have failed
as maintainer.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

  reply	other threads:[~2022-03-01  2:38 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
2022-01-20  0:21 ` [PATCH 01/17] libxcmd: use emacs mode for command history editing Darrick J. Wong
2022-01-20  0:21 ` [PATCH 02/17] libxfs: shut down filesystem if we xfs_trans_cancel with deferred work items Darrick J. Wong
2022-02-04 21:36   ` Eric Sandeen
2022-02-04 21:47     ` Darrick J. Wong
2022-01-20  0:21 ` [PATCH 03/17] libxfs: don't leave dangling perag references from xfs_buf Darrick J. Wong
2022-02-04 22:05   ` Eric Sandeen
2022-01-20  0:21 ` [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog Darrick J. Wong
2022-02-04 23:18   ` Eric Sandeen
2022-02-05  0:36     ` Darrick J. Wong
2022-02-07  1:05       ` Dave Chinner
2022-02-07 17:09         ` Darrick J. Wong
2022-02-07 21:32           ` Eric Sandeen
2022-02-10  3:33             ` Dave Chinner
2022-02-08 16:46   ` [PATCH v1.1 04/17] libfrog: always use the kernel GETFSMAP definitions Darrick J. Wong
2022-02-25 22:35     ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 05/17] misc: add a crc32c self test to mkfs and repair Darrick J. Wong
2022-02-04 23:23   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 06/17] libxfs-apply: support filterdiff >= 0.4.2 only Darrick J. Wong
2022-01-20  0:22 ` [PATCH 07/17] xfs_db: fix nbits parameter in fa_ino[48] functions Darrick J. Wong
2022-02-25 21:45   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 08/17] xfs_repair: explicitly cast resource usage counts in do_warn Darrick J. Wong
2022-02-25 21:46   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 09/17] xfs_repair: explicitly cast directory inode numbers " Darrick J. Wong
2022-02-25 21:48   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 10/17] xfs_repair: fix indentation problems in upgrade_filesystem Darrick J. Wong
2022-02-25 21:53   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 11/17] xfs_repair: update secondary superblocks after changing features Darrick J. Wong
2022-02-25 21:57   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 12/17] xfs_scrub: report optional features in version string Darrick J. Wong
2022-01-20  1:16   ` Theodore Ts'o
2022-01-20  1:28     ` Darrick J. Wong
2022-01-20  1:32   ` [PATCH v2 " Darrick J. Wong
2022-02-25 22:14     ` Eric Sandeen
2022-02-26  0:04       ` Darrick J. Wong
2022-02-26  2:48         ` Darrick J. Wong
2022-02-26  2:53   ` [PATCH v3 " Darrick J. Wong
2022-02-28 21:38     ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 13/17] mkfs: prevent corruption of passed-in suboption string values Darrick J. Wong
2022-01-20  0:22 ` [PATCH 14/17] mkfs: add configuration files for the last few LTS kernels Darrick J. Wong
2022-01-20  0:22 ` [PATCH 15/17] mkfs: document sample configuration file location Darrick J. Wong
2022-01-20  0:23 ` [PATCH 16/17] mkfs: add a config file for x86_64 pmem filesystems Darrick J. Wong
2022-02-25 22:21   ` Eric Sandeen
2022-02-26  2:38     ` Darrick J. Wong
2022-02-26  2:52   ` [PATCH v2 " Darrick J. Wong
2022-02-28 21:37     ` Eric Sandeen
2022-01-20  0:23 ` [PATCH 17/17] mkfs: enable inobtcount and bigtime by default Darrick J. Wong
2022-02-25 22:22   ` Eric Sandeen
2022-01-28 22:44 ` [PATCH 18/17] xfs_scrub: fix reporting if we can't open raw block devices Darrick J. Wong
2022-01-31 12:28   ` Christoph Hellwig
2022-02-26  2:54 ` [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems Darrick J. Wong
2022-02-26 21:37   ` Dave Chinner
2022-02-28 23:22     ` Darrick J. Wong
2022-03-01  0:42       ` Dave Chinner
2022-03-01  2:38         ` Darrick J. Wong [this message]
2022-03-01 15:55           ` Brian Foster
2022-03-01  3:10         ` Dave Chinner
2022-02-28 21:44   ` Eric Sandeen
2022-03-01  2:21     ` Darrick J. Wong
2022-03-01  2:44       ` Eric Sandeen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220301023822.GD117732@magnolia \
    --to=djwong@kernel.org \
    --cc=allison.henderson@oracle.com \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=hch@lst.de \
    --cc=linux-xfs@vger.kernel.org \
    --cc=sandeen@sandeen.net \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.