All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: sandeen@sandeen.net, Christoph Hellwig <hch@lst.de>,
	Dave Chinner <dchinner@redhat.com>, Theodore Ts'o <tytso@mit.edu>,
	linux-xfs@vger.kernel.org, allison.henderson@oracle.com
Subject: Re: [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems
Date: Tue, 1 Mar 2022 14:10:54 +1100	[thread overview]
Message-ID: <20220301031054.GU59715@dread.disaster.area> (raw)
In-Reply-To: <20220301004249.GT59715@dread.disaster.area>

On Tue, Mar 01, 2022 at 11:42:49AM +1100, Dave Chinner wrote:
> On Mon, Feb 28, 2022 at 03:22:11PM -0800, Darrick J. Wong wrote:
> > On Sun, Feb 27, 2022 at 08:37:20AM +1100, Dave Chinner wrote:
> > > On Fri, Feb 25, 2022 at 06:54:50PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Recently, the upstream kernel maintainer has been taking a lot of heat on
> > > > account of writer threads encountering high latency when asking for log
> > > > grant space when the log is small.  The reported use case is a heavily
> > > > threaded indexing product logging trace information to a filesystem
> > > > ranging in size between 20 and 250GB.  The meetings that result from the
> > > > complaints about latency and stall warnings in dmesg both from this use
> > > > case and also a large well known cloud product are now consuming 25% of
> > > > the maintainer's weekly time and have been for months.
> > > 
> > > Is the transaction reservation space exhaustion caused by, as I
> > > pointed out in another thread yesterday, the unbound concurrency in
> > > IO completion?
> > 
> > No.  They're using synchronous directio writes to write trace data in 4k
> 
> synchronous as in O_(D)SYNC or as in "not using AIO"? Is is also
> append data, and is it one file per logging thread or multiple
> threads writing to a single file?
> 
> > chunks.  The number of files does not exceed the number of writer
> > threads, and the number of writer threads can be up to 10*NR_CPUS (~400
> > on the test system).  If I'm reading the iomap directio code correctly,
> > the writer threads block and do not issue more IO until the first IO
> > completes...
> 
> So, up to 400 threads concurrently issuing IO that does block
> allocation and performing unwritten extent conversion, so up to ~800
> concurrent running allocation related transactions at a time?

Let me put this a different way: optimal log size is determined by
workload transaction concurrency, not filesystem capacity.

I just did a quick check of the write reservation that unwritten
extent conversion uses, and it was 150kB on my test system. If there
are going to be 400 of these unwritten completions running at once,
then for them to not block on each other, there must be more than
60MB of free log space available at all times.

If we assume the AIL tail pushing is keeping up with incoming
changes, that means we are keeping 25% of the log space is free for
in-memory transaction reservations. That means for this IO
completion transaction workload, we need a log size of at least
240MB to ensure free log space keeps ahead of transaction
concurrency. So even for really small filesysetms, this workload
needs a large log size.

For default sizing, we use filesystem size as a proxy for
concurrency - the larger the filesystem, the more workload
concurrency it can support, so the more AGs and log space is
requires. We can't get this right for every single type of storage
or workload that is out there, and we've always known this.  This is
one of the reasons behind developing mkfs config file support.

That is, if you have an application that doesn't fit the
general case and has specific log size requirements, you can write
a mkfs recipe for it and ship that so that users don't need to know
the details, just mkfs using the specific config file for that
application. In this case, the workload specific mkfs config file
simply needs to have "logsize = 500MB" and it will work for all
installations regardless of the filesystem capacity it is running
on.

Hence I'm not sure that tying the default log size to "we have a
problematic workload where small filesystems need X concurrency" is
exactly the right way to proceed here. It's the messenger, but it
should not dictate a specific solution to a much more general
problem.

Instead, I think we should be looking at things like the rotational
flag on the storage device. If that's not set, we can substantially
increase the AG count and log size scaling rate because those
devices will support much more IO and transaction concurrency.

We should be looking at CPU count, and scaling log size and AG count
based on the number of CPUs. The more CPUs we have, the more likely
we are going to see large amounts of concurrency, and so AG count
and log size should be adjusted to suit.

These are generic scaling factors that will improve the situation
for the workload being described, as well as improve the situation
for any other workload that runs on hardware that has high
concurrency capability. We still use capacity as a proxy for
scalability, we just adjust the ramp up ratio based on the
capabilities of the system presented to us.

I know, this doesn't solve the crazy "start tiny, grow 1000x before
deploy" issues, but that's a separate problem that really needs to
be addressed with independent changes such as updating minimum log
and AG sizes to mitigate those sorts of growth issues.

> > > I also wonder if the right thing to do here is just set a minimum
> > > log size of 32MB? The worst of the long tail latencies are mitigated
> > > by this point, and so even small filesystems grown out to 200GB will
> > > have a log size that results in decent performance for this sort of
> > > workload.
> > 
> > Are you asking for a second patch where mkfs refuses to format a log
> > smaller than 32MB (e.g. 8GB with the x86 defaults)?  Or a second patch
> > that cranks the minimum log size up to 32MB, even if that leads to
> > absurd results (e.g. 66MB filesystems with 2 AGs and a 32MB log)?
> 
> I'm suggesting the latter.
> 
> Along with a refusal to make an XFS filesystem smaller than, say,
> 256MB, because allowing users to make tiny XFS filesystems seems to
> always just lead to future troubles.

When I say stuff like this, keep in mind that I'm trying to suggest
things that will stand for years, not just alleviate the immediate
concerns. Really small XFS filesystems are at the end of the scale
we really don't care about, so let's just stop with the make-beleive
that we'll support and optimise for them. Let's just set some new
limits that are appropriate for the sort of storage and minimum
sizes we actually see XFS used for.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2022-03-01  3:10 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
2022-01-20  0:21 ` [PATCH 01/17] libxcmd: use emacs mode for command history editing Darrick J. Wong
2022-01-20  0:21 ` [PATCH 02/17] libxfs: shut down filesystem if we xfs_trans_cancel with deferred work items Darrick J. Wong
2022-02-04 21:36   ` Eric Sandeen
2022-02-04 21:47     ` Darrick J. Wong
2022-01-20  0:21 ` [PATCH 03/17] libxfs: don't leave dangling perag references from xfs_buf Darrick J. Wong
2022-02-04 22:05   ` Eric Sandeen
2022-01-20  0:21 ` [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog Darrick J. Wong
2022-02-04 23:18   ` Eric Sandeen
2022-02-05  0:36     ` Darrick J. Wong
2022-02-07  1:05       ` Dave Chinner
2022-02-07 17:09         ` Darrick J. Wong
2022-02-07 21:32           ` Eric Sandeen
2022-02-10  3:33             ` Dave Chinner
2022-02-08 16:46   ` [PATCH v1.1 04/17] libfrog: always use the kernel GETFSMAP definitions Darrick J. Wong
2022-02-25 22:35     ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 05/17] misc: add a crc32c self test to mkfs and repair Darrick J. Wong
2022-02-04 23:23   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 06/17] libxfs-apply: support filterdiff >= 0.4.2 only Darrick J. Wong
2022-01-20  0:22 ` [PATCH 07/17] xfs_db: fix nbits parameter in fa_ino[48] functions Darrick J. Wong
2022-02-25 21:45   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 08/17] xfs_repair: explicitly cast resource usage counts in do_warn Darrick J. Wong
2022-02-25 21:46   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 09/17] xfs_repair: explicitly cast directory inode numbers " Darrick J. Wong
2022-02-25 21:48   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 10/17] xfs_repair: fix indentation problems in upgrade_filesystem Darrick J. Wong
2022-02-25 21:53   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 11/17] xfs_repair: update secondary superblocks after changing features Darrick J. Wong
2022-02-25 21:57   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 12/17] xfs_scrub: report optional features in version string Darrick J. Wong
2022-01-20  1:16   ` Theodore Ts'o
2022-01-20  1:28     ` Darrick J. Wong
2022-01-20  1:32   ` [PATCH v2 " Darrick J. Wong
2022-02-25 22:14     ` Eric Sandeen
2022-02-26  0:04       ` Darrick J. Wong
2022-02-26  2:48         ` Darrick J. Wong
2022-02-26  2:53   ` [PATCH v3 " Darrick J. Wong
2022-02-28 21:38     ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 13/17] mkfs: prevent corruption of passed-in suboption string values Darrick J. Wong
2022-01-20  0:22 ` [PATCH 14/17] mkfs: add configuration files for the last few LTS kernels Darrick J. Wong
2022-01-20  0:22 ` [PATCH 15/17] mkfs: document sample configuration file location Darrick J. Wong
2022-01-20  0:23 ` [PATCH 16/17] mkfs: add a config file for x86_64 pmem filesystems Darrick J. Wong
2022-02-25 22:21   ` Eric Sandeen
2022-02-26  2:38     ` Darrick J. Wong
2022-02-26  2:52   ` [PATCH v2 " Darrick J. Wong
2022-02-28 21:37     ` Eric Sandeen
2022-01-20  0:23 ` [PATCH 17/17] mkfs: enable inobtcount and bigtime by default Darrick J. Wong
2022-02-25 22:22   ` Eric Sandeen
2022-01-28 22:44 ` [PATCH 18/17] xfs_scrub: fix reporting if we can't open raw block devices Darrick J. Wong
2022-01-31 12:28   ` Christoph Hellwig
2022-02-26  2:54 ` [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems Darrick J. Wong
2022-02-26 21:37   ` Dave Chinner
2022-02-28 23:22     ` Darrick J. Wong
2022-03-01  0:42       ` Dave Chinner
2022-03-01  2:38         ` Darrick J. Wong
2022-03-01 15:55           ` Brian Foster
2022-03-01  3:10         ` Dave Chinner [this message]
2022-02-28 21:44   ` Eric Sandeen
2022-03-01  2:21     ` Darrick J. Wong
2022-03-01  2:44       ` Eric Sandeen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220301031054.GU59715@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=allison.henderson@oracle.com \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-xfs@vger.kernel.org \
    --cc=sandeen@sandeen.net \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.