All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 01/10] xfs: create simplified inode walk function
Date: Wed, 12 Jun 2019 08:13:10 -0400	[thread overview]
Message-ID: <20190612121310.GD12395@bfoster> (raw)
In-Reply-To: <20190611230514.GU1871505@magnolia>

On Tue, Jun 11, 2019 at 04:05:14PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 12, 2019 at 08:33:41AM +1000, Dave Chinner wrote:
> > On Mon, Jun 10, 2019 at 04:11:34PM -0700, Darrick J. Wong wrote:
> > > On Mon, Jun 10, 2019 at 01:55:10PM -0400, Brian Foster wrote:
> > > > > I could extend the comment to explain why we don't use PAGE_SIZE...
> > > > > 
> > > > 
> > > > Sounds good, though what I think would be better is to define a
> > > > IWALK_DEFAULT_RECS or some such somewhere and put the calculation
> > > > details with that.
> > > > 
> > > > Though now that you point out the readahead thing, aren't we at risk of
> > > > a similar problem for users who happen to pass a really large userspace
> > > > buffer? Should we cap the kernel allocation/readahead window in all
> > > > cases and not just the default case?
> > > 
> > > Hmm, that's right, we don't want to let userspace arbitrarily determine
> > > the size of the buffer, and I think the current implementation caps it
> > > the readahaead at ... oh, PAGE_SIZE / sizeof(xfs_inogrp_t).
> > > 
> > > Oh, right, and in the V1 patchset Dave said that we should constrain
> > > readahead even further.
> > 
> > Right, I should explain a bit further why, too - it's about
> > performance.  I've found that a user buffer size of ~1024 inodes is
> > generally enough to max out performance of bulkstat. i.e. somewhere
> > around 1000 inodes per syscall is enough to mostly amortise all of
> > the cost of syscall, setup, readahead, etc vs the CPU overhead of
> > copying all the inodes into the user buffer.
> > 
> > Once the user buffer goes over a few thousand inodes, performance
> > then starts to tail back off - we don't get any gains from trying to
> > bulkstat tens of thousands of inodes at a time, especially under
> > memory pressure because that can push us into readahead and buffer
> > cache thrashing.
> 
> <nod> I don't mind setting the max inobt record cache buffer size to a
> smaller value (1024 bytes == 4096 inodes readahead?) so we can get a
> little farther into future hardware scalability (or decreases in syscall
> performance :P).
> 

The 1k baseline presumably applies to the current code. Taking a closer
look at the current code, we unconditionally allocate a 4 page record
buffer and start to fill it. For every record we grab, we issue
readahead on the underlying clusters.

Hmm, that seems like generally what this patchset is doing aside from
the more variable record buffer allocation. I'm fine with changing
things like record buffer allocation, readahead semantics, etc. given
Dave's practical analysis above, but TBH I don't think that should all
be part of the same patch. IMO, this rework patch should maintain as
close as possible to current behavior and a subsequent patches in the
series can tweak record buffer size and whatnot to improve readahead
logic. That makes this all easier to review, discuss and maintain in the
event of regression.

> I guess the question here is how to relate the number of inodes the user
> asked for to how many inobt records we have to read to find that many
> allocated inodes?  Or in other words, what's the average ir_freecount
> across all the inobt records?
> 

The current code basically looks like it allocates an oversized buffer
and hopes for the best with regard to readahead. If we took a similar
approach in terms of overestimating the buffer size (assuming not all
inode records are fully allocated), I suppose we could also track the
number of cluster readaheads issued and govern the collect/drain
sequences of the record buffer based on that..? But again, I think we
should have this as a separate "xfs: make iwalk readahead smarter ..."
patch that documents Dave's analysis above, perhaps includes some
numbers, etc..

> Note that this is technically a decrease since the old code would
> reserve 16K for this purpose...
> 

Indeed.

Brian

> > > > > /*
> > > > >  * Note: We hardcode 4096 here (instead of, say, PAGE_SIZE) because we want to
> > > > >  * constrain the amount of inode readahead to 16k inodes regardless of CPU:
> > > > >  *
> > > > >  * 4096 bytes / 16 bytes per inobt record = 256 inobt records
> > > > >  * 256 inobt records * 64 inodes per record = 16384 inodes
> > > > >  * 16384 inodes * 512 bytes per inode(?) = 8MB of inode readahead
> > > > >  */
> > 
> > Hence I suspect that even this is overkill - it makes no sense to
> > have a huge readahead window when there has been no measurable
> > performance benefit to doing large inode count bulkstat syscalls.
> > 
> > And, FWIW, readahead probably should also be capped at what the user
> > buffer can hold - no point in reading 16k inodes when the output
> > buffer can only fit 1000 inodes...
> 
> It already is -- the icount parameter from userspace is (eventually) fed
> to xfs_iwalk-set_prefetch.
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

  reply	other threads:[~2019-06-12 12:13 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
2019-06-04 21:49 ` [PATCH 01/10] xfs: create simplified inode walk function Darrick J. Wong
2019-06-10 13:58   ` Brian Foster
2019-06-10 16:59     ` Darrick J. Wong
2019-06-10 17:55       ` Brian Foster
2019-06-10 23:11         ` Darrick J. Wong
2019-06-11 22:33           ` Dave Chinner
2019-06-11 23:05             ` Darrick J. Wong
2019-06-12 12:13               ` Brian Foster [this message]
2019-06-12 16:53                 ` Darrick J. Wong
2019-06-12 17:54               ` Darrick J. Wong
2019-06-04 21:49 ` [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions Darrick J. Wong
2019-06-10 13:58   ` Brian Foster
2019-06-10 17:10     ` Darrick J. Wong
2019-06-11 23:23     ` Dave Chinner
2019-06-12  0:32       ` Darrick J. Wong
2019-06-12 12:55         ` Brian Foster
2019-06-12 23:33           ` Dave Chinner
2019-06-13 18:34             ` Brian Foster
2019-06-04 21:49 ` [PATCH 03/10] xfs: bulkstat should copy lastip whenever userspace supplies one Darrick J. Wong
2019-06-10 13:59   ` Brian Foster
2019-06-04 21:49 ` [PATCH 04/10] xfs: convert bulkstat to new iwalk infrastructure Darrick J. Wong
2019-06-10 14:02   ` Brian Foster
2019-06-10 17:38     ` Darrick J. Wong
2019-06-10 18:29       ` Brian Foster
2019-06-10 23:42         ` Darrick J. Wong
2019-06-04 21:49 ` [PATCH 05/10] xfs: move bulkstat ichunk helpers to iwalk code Darrick J. Wong
2019-06-10 14:02   ` Brian Foster
2019-06-04 21:50 ` [PATCH 06/10] xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino Darrick J. Wong
2019-06-10 19:32   ` Brian Foster
2019-06-04 21:50 ` [PATCH 07/10] xfs: clean up long conditionals in xfs_iwalk_ichunk_ra Darrick J. Wong
2019-06-10 19:32   ` Brian Foster
2019-06-04 21:50 ` [PATCH 08/10] xfs: multithreaded iwalk implementation Darrick J. Wong
2019-06-10 19:40   ` Brian Foster
2019-06-11  1:10     ` Darrick J. Wong
2019-06-11 13:13       ` Brian Foster
2019-06-11 15:29         ` Darrick J. Wong
2019-06-11 17:00           ` Brian Foster
2019-06-04 21:50 ` [PATCH 09/10] xfs: poll waiting for quotacheck Darrick J. Wong
2019-06-11 15:07   ` Brian Foster
2019-06-11 16:06     ` Darrick J. Wong
2019-06-11 17:01       ` Brian Foster
2019-06-04 21:50 ` [PATCH 10/10] xfs: refactor INUMBERS to use iwalk functions Darrick J. Wong
2019-06-11 15:08   ` Brian Foster
2019-06-11 16:21     ` Darrick J. Wong
2019-06-11 17:01       ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190612121310.GD12395@bfoster \
    --to=bfoster@redhat.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.