[PATCH 0/5][RESEND] Support for metadata specific accounting

* [PATCH 0/5][RESEND] Support for metadata specific accounting
@ 2016-10-25 18:41 Josef Bacik
  2016-10-25 18:41 ` [PATCH 1/5] remove mapping from balance_dirty_pages*() Josef Bacik
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Josef Bacik @ 2016-10-25 18:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

(Sending again as 5/5 got eaten and I used the wrong email address for Dave.)

(Dave again I apologize, for some reason our email server hates you and so I
didn't get your previous responses again, and didn't notice until I was looking
at the patchwork history for my previous submissions, so I'll try and answer all
your questions here, and then I'll be much more judicious about making sure I'm
getting your emails.)

This is my current batch of patches to add some better infrastructure for
handling metadata related memory.  There's been a few changes to these patches
since the last go around but the approach is essentially the same.

Changes since last go around:
-WB_WRITTEN/WB_DIRTIED are now being counted in bytes.  This is necessary to
deal with metadata that is smaller than page size.  Currently btrfs only does
things in pagesize increments, but with these patches it will allow us to easily
support sub-pagesize blocksizes so we need these counters to be in bytes.
-Added a METADATA_BYTES counter to account for the total number of bytes used
for metadata.  We need this because the VM adjusts the dirty ratios based on how
much memory it thinks we can dirty, which is just the amount of free memory
plus the pagecache.
-Patch 5, which is a doozy, and I will explain below.

This is a more complete approach to keeping track of memory that is in use for
metadata.  Previously Btrfs had been using a special inode to allocate all of
our metadata pages out of, so if we had a heavy metadata workload we still
benefited from the balance_dirty_pages() throttling which kept everything sane.
We want to kill this in order to make it easier to support pagesize > blocksize
support in btrfs, in much the same way XFS does this support.

One concern that was brought up was the global accounting vs one bad actor.  We
still need the global accounting so we can enforce the global dirty limits, but
we have the per-bdi accounting as well to make sure we are punishing the bad
actor and not all the other file systems that happened to be hooked into this
infrastructure.

The *REFERENCED patch is to fix a behavior I noticed when testing these patches
on a more memory-constrained system than I had previously used.  Now that the
eviction of metadata pages is controlled soley through the shrinker interface I
was seeing a big perf drop when we had to start doing memory reclaim.  This was
because most of the RAM on the box (10gig of 16gig) was used soley by icache and
dcache.  Because this workload was just create a file and never touch it again
most of this cache was easily evictable.  However because we by default send
every object in the icache/dcache through the LRU twice, it would take around
10,000 iterations through the shrinker before it would start to evict things
(there was ~10 million objects between the two caches).  The pagecache solves
this by putting everything on the INACTIVE list first, and then theres a two
part activation phase before it's moved to the active list, so you have to
really touch a page a lot before it is considered active.  However we assume
active first, which is very latency inducing when we want to start reclaiming
objects.  Making this change keeps us in line with what pagecache does and
eliminates the performance drop I was seeing.

Let me know what you think.  I've tried to keep this as minimal and sane as
possible so we can build on it in the future as we want to handle more nuanced
scenarios.  Also if we think this is a bad approach I won't feel as bad throwing
it out ;).  Thanks,

Josef

^ permalink raw reply	[flat|nested] 26+ messages in thread