All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 3/7] cache: prevent expansion races
Date: Thu, 1 Nov 2018 09:17:32 -0400	[thread overview]
Message-ID: <20181101131732.GB21654@bfoster> (raw)
In-Reply-To: <20181101012739.GY19305@dastard>

On Thu, Nov 01, 2018 at 12:27:39PM +1100, Dave Chinner wrote:
> On Wed, Oct 31, 2018 at 01:13:02PM -0400, Brian Foster wrote:
> > On Tue, Oct 30, 2018 at 10:20:39PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > When a sudden buffer cache growth demand occurs from multiple
> > > concurrent sources, they can all fail shaking the cache at the same
> > > time and expand the cache. This results in the cache doubling in
> > > size multiple times when only a single expansion was really
> > > necessary. The resultant massive cache can cause repair to OOM as
> > > the cache will grow to much larger than memory can actually hold.
> > > 
> > > Prevent this sudden and unnecessary expansion by rate limiting cache
> > > growth to one size increase every tens seconds. This is sufficient
> > > to prevent racing expansion calls from doing the wrong thing, but
> > > not too long to prevent necesary cache growth when it is undersized.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > 
> > This seems fairly harmless, but I'm not really a fan of this kind of
> > time-based algorithm in general. Could we do something similarly
> > effective that doesn't require a magic time value?
> > 
> > For example, suppose we sample the current cache size at the top of the
> > function, pass that value along to cache_expand() and have the latter
> > only bump ->c_maxcount if it still matches the parameter value. Then
> > return-assign the local var with the latest value:
> > 
> > 	max_sample = cache_expand(cache, max_sample);
> > 
> > The end result is that an allocator must shake every mru under a given
> > cache size before it's allowed to grow the cache. Hm?
> 
> I tried that and it still causes excessive cache expansion on my
> really fast IO subsystem.  I'm seeing peaks of 300,000 IOPS when cache
> expansion bursts occur. They are lasting for up to 20s on my machine
> and it's enough to cause the cache to double in size multiple times.
> Once the initial bursts have run their course, demand drops to a
> steady state of 130-150kiops which the original cache size is quite
> capable of sustaining.
> 

Interesting.

> These bursts are driven by the initial read-ahead queues being
> filled and stop when the queues fill up.  If the IO is slow, then
> there isn't cache pressure because processing keeps up with
> readahead. If IO is fast, it fills the RA queues and then throttles
> back to processing speed. It's filling the RA queues that causes the
> buffer cache growth pressure, and it's really not necessary - the RA
> queues are already full enough to maximise performance in phase 6,
> and so growing memory footprint doesn't gain us anything.
> 

Makes sense..

> i.e. the buffer cache on large filesystems like this is used as a
> prefetch buffer. It is not a cache that stores anything for repeated
> use - there's 500GB of metadata (>450 million inodes) in this
> filesystem and we're only running w/ 128GB RAM, so we have no hope
> of caching the entire working set in RAM. Hence we want to keep the
> cache size down to just what is necessary to sustain effective
> prefetch.
> 
> The problem with throttling on "scanned the whole cache" is that
> this happens really quickly when you have a buffer demand of ~300k/s
> When I start with a cache size of 800k buffers (-o bhash=101031)
> doubling the cache size from 800k objects to 1.6m objects occurs
> within a couple of seconds of phase 6 starting. A couple of seconds
> later it doubles again, and by the time the initial 20s burst has
> finished, it's doubled 3 or 4 times.
> 

Ok, but I'm curious how much of this boils down to lock contention
exacerbated by the fast I/O. I can see how reduced I/O latency could
contribute to the rapid growth rate. At the same time, it looks like all
the shaker does in this pattern is flush buffers if dirty (which seems
unlikely if we're in some kind of prefetch overload?) and move them to
another list (for potential reuse of memory). 

We clearly have multiple lookup/fail/expand races going on as evidenced
by the need for this patch in the first place. The cache growth is
driven by cache misses (i.e. prefetch). The buffer lookups acquire hash
locks and the cache lock (for the node allocation attempt). If the alloc
is allowed, we read the buf and add it to the hash. If not, we shake the
mru and retry the same mru or next. The shake trylocks each node (buf)
and hash lock in the mru. If either lock fails, we skip to the next
buffer.

I'm not sure how likely this is, but what are the chances the shakes are
contending with eachother such that lock contention prevents real work
from being done in the shaker? That seems like a reasonable possibility
to me given that you're clearly reproducing enough parallel shakes to
explode the cache rather quickly. Each shake is essentially iterating
through the same dataset, right? Of course, I don't have the complete
picture so perhaps there are other factors at play. I think it's at
least worth looking into if you haven't already because if something
like that is going on, it could be more likely that the time-based
implementation basically trades one kind of resource (memory) wastage
for another (cpu).

Brian

> After the last expansion, the buffer cache can now grow way beyond
> what we can fit in memory and reapir eventually gets OOM killed on a
> 128GB RAM machine even though it really only needs ~70GB RAM to
> complete successfully.
> 
> IOWs, throttling expansion based on "cache full" demand doesn't
> throttle hard enough to prevent OOM kill in these cases. The time
> based throttling is based around preventing bursts for driving
> unnecessary expansion. If the sudden burst turns out ot be sustained
> demand and hence the cache really does need to be expanded multiple
> times, then the cache will grow quickly enough to meet that demand.
> We just don't want bursts to trigger that.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

  reply	other threads:[~2018-11-01 22:20 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-30 11:20 [PATCH 0/7] xfs_repair: scale to 150,000 iops Dave Chinner
2018-10-30 11:20 ` [PATCH 1/7] Revert "xfs_repair: treat zero da btree pointers as corruption" Dave Chinner
2018-10-30 17:20   ` Darrick J. Wong
2018-10-30 19:35     ` Eric Sandeen
2018-10-30 20:11       ` Dave Chinner
2018-10-30 11:20 ` [PATCH 2/7] repair: don't dirty inodes which are not unlinked Dave Chinner
2018-10-30 17:26   ` Darrick J. Wong
2018-10-30 20:03   ` Eric Sandeen
2018-10-30 20:09     ` Eric Sandeen
2018-10-30 20:34       ` Dave Chinner
2018-10-30 20:40         ` Eric Sandeen
2018-10-30 20:58           ` Dave Chinner
2018-10-30 11:20 ` [PATCH 3/7] cache: prevent expansion races Dave Chinner
2018-10-30 17:39   ` Darrick J. Wong
2018-10-30 20:35     ` Dave Chinner
2018-10-31 17:13   ` Brian Foster
2018-11-01  1:27     ` Dave Chinner
2018-11-01 13:17       ` Brian Foster [this message]
2018-11-01 21:23         ` Dave Chinner
2018-11-02 11:31           ` Brian Foster
2018-11-02 23:26             ` Dave Chinner
2018-10-30 11:20 ` [PATCH 4/7] workqueue: bound maximum queue depth Dave Chinner
2018-10-30 17:58   ` Darrick J. Wong
2018-10-30 20:53     ` Dave Chinner
2018-10-31 17:14       ` Brian Foster
2018-10-30 11:20 ` [PATCH 5/7] repair: Protect bad inode list with mutex Dave Chinner
2018-10-30 17:44   ` Darrick J. Wong
2018-10-30 20:54     ` Dave Chinner
2018-10-30 11:20 ` [PATCH 6/7] repair: protect inode chunk tree records with a mutex Dave Chinner
2018-10-30 17:46   ` Darrick J. Wong
2018-10-30 11:20 ` [PATCH 7/7] repair: parallelise phase 6 Dave Chinner
2018-10-30 17:51   ` Darrick J. Wong
2018-10-30 20:55     ` Dave Chinner
2018-11-07  5:44 ` [PATCH 0/7] xfs_repair: scale to 150,000 iops Arkadiusz Miśkiewicz
2018-11-07  6:48   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181101131732.GB21654@bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.