All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Nhat Pham <nphamcs@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling
Date: Fri, 29 Mar 2024 13:37:59 -0400	[thread overview]
Message-ID: <20240329173759.GI7597@cmpxchg.org> (raw)
In-Reply-To: <CAJD7tkbKY0T6jd5v_GhNFyCO0578SNqfMgs1ZrZiCckY05hzZA@mail.gmail.com>

On Thu, Mar 28, 2024 at 09:27:17PM -0700, Yosry Ahmed wrote:
> On Thu, Mar 28, 2024 at 7:05 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Mar 28, 2024 at 4:19 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > On Thu, Mar 28, 2024 at 2:07 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Thu, Mar 28, 2024 at 01:23:42PM -0700, Yosry Ahmed wrote:
> > > > > On Thu, Mar 28, 2024 at 12:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > >
> > > > > > On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote:
> > > > > > > The current same-filled pages handling supports pages filled with any
> > > > > > > repeated word-sized pattern. However, in practice, most of these should
> > > > > > > be zero pages anyway. Other patterns should be nearly as common.
> > > > > > >
> > > > > > > Drop the support for non-zero same-filled pages, but keep the names of
> > > > > > > knobs exposed to userspace as "same_filled", which isn't entirely
> > > > > > > inaccurate.
> > > > > > >
> > > > > > > This yields some nice code simplification and enables a following patch
> > > > > > > that eliminates the need to allocate struct zswap_entry for those pages
> > > > > > > completely.
> > > > > > >
> > > > > > > There is also a very small performance improvement observed over 50 runs
> > > > > > > of kernel build test (kernbench) comparing the mean build time on a
> > > > > > > skylake machine when building the kernel in a cgroup v1 container with a
> > > > > > > 3G limit:
> > > > > > >
> > > > > > >               base            patched         % diff
> > > > > > > real          70.167          69.915          -0.359%
> > > > > > > user          2953.068        2956.147        +0.104%
> > > > > > > sys           2612.811        2594.718        -0.692%
> > > > > > >
> > > > > > > This probably comes from more optimized operations like memchr_inv() and
> > > > > > > clear_highpage(). Note that the percentage of zero-filled pages during
> > > > > > > this test was only around 1.5% on average, and was not affected by this
> > > > > > > patch. Practical workloads could have a larger proportion of such pages
> > > > > > > (e.g. Johannes observed around 10% [1]), so the performance improvement
> > > > > > > should be larger.
> > > > > > >
> > > > > > > [1]https://lore.kernel.org/linux-mm/20240320210716.GH294822@cmpxchg.org/
> > > > > > >
> > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > > > > >
> > > > > > This is an interesting direction to pursue, but I actually thinkg it
> > > > > > doesn't go far enough. Either way, I think it needs more data.
> > > > > >
> > > > > > 1) How frequent are non-zero-same-filled pages? Difficult to
> > > > > >    generalize, but if you could gather some from your fleet, that
> > > > > >    would be useful. If you can devise a portable strategy, I'd also be
> > > > > >    more than happy to gather this on ours (although I think you have
> > > > > >    more widespread zswap use, whereas we have more disk swap.)
> > > > >
> > > > > I am trying to collect the data, but there are.. hurdles. It would
> > > > > take some time, so I was hoping the data could be collected elsewhere
> > > > > if possible.
> > > > >
> > > > > The idea I had was to hook a BPF program to the entry of
> > > > > zswap_fill_page() and create a histogram of the "value" argument. We
> > > > > would get more coverage by hooking it to the return of
> > > > > zswap_is_page_same_filled() and only updating the histogram if the
> > > > > return value is true, as it includes pages in zswap that haven't been
> > > > > swapped in.
> > > > >
> > > > > However, with zswap_is_page_same_filled() the BPF program will run in
> > > > > all zswap stores, whereas for zswap_fill_page() it will only run when
> > > > > needed. Not sure if this makes a practical difference tbh.
> > > > >
> > > > > >
> > > > > > 2) The fact that we're doing any of this pattern analysis in zswap at
> > > > > >    all strikes me as a bit misguided. Being efficient about repetitive
> > > > > >    patterns is squarely in the domain of a compression algorithm. Do
> > > > > >    we not trust e.g. zstd to handle this properly?
> > > > >
> > > > > I thought about this briefly, but I didn't follow through. I could try
> > > > > to collect some data by swapping out different patterns and observing
> > > > > how different compression algorithms react. That would be interesting
> > > > > for sure.
> > > > >
> > > > > >
> > > > > >    I'm guessing this goes back to inefficient packing from something
> > > > > >    like zbud, which would waste half a page on one repeating byte.
> > > > > >
> > > > > >    But zsmalloc can do 32 byte objects. It's also a batching slab
> > > > > >    allocator, where storing a series of small, same-sized objects is
> > > > > >    quite fast.
> > > > > >
> > > > > >    Add to that the additional branches, the additional kmap, the extra
> > > > > >    scanning of every single page for patterns - all in the fast path
> > > > > >    of zswap, when we already know that the vast majority of incoming
> > > > > >    pages will need to be properly compressed anyway.
> > > > > >
> > > > > >    Maybe it's time to get rid of the special handling entirely?
> > > > >
> > > > > We would still be wasting some memory (~96 bytes between zswap_entry
> > > > > and zsmalloc object), and wasting cycling allocating them. This could
> > > > > be made up for by cycles saved by removing the handling. We will be
> > > > > saving some branches for sure. I am not worried about kmap as I think
> > > > > it's a noop in most cases.
> > > >
> > > > Yes, true.
> > > >
> > > > > I am interested to see how much we could save by removing scanning for
> > > > > patterns. We may not save much if we abort after reading a few words
> > > > > in most cases, but I guess we could also be scanning a considerable
> > > > > amount before aborting. On the other hand, we would be reading the
> > > > > page contents into cache anyway for compression, so maybe it doesn't
> > > > > really matter?
> > > > >
> > > > > I will try to collect some data about this. I will start by trying to
> > > > > find out how the compression algorithms handle same-filled pages. If
> > > > > they can compress it efficiently, then I will try to get more data on
> > > > > the tradeoff from removing the handling.
> > > >
> > > > I do wonder if this could be overthinking it, too.
> > > >
> > > > Double checking the numbers on our fleet, a 96 additional bytes for
> > > > each same-filled entry would result in a
> > > >
> > > > 1) p50 waste of 0.008% of total memory, and a
> > > >
> > > > 2) p99 waste of 0.06% of total memory.
> >
> > Right. Assuming the compressors do not surprise us and store
> > same-filled pages in an absurd way, it's not worth it in terms of
> > memory savings.
> >
> > > >
> > > > And this is without us having even thought about trying to make
> > > > zsmalloc more efficient for this particular usecase - which might be
> > > > the better point of attack, if we think it's actually worth it.
> > > >
> > > > So my take is that unless removing it would be outright horrible from
> > > > a %sys POV (which seems pretty unlikely), IMO it would be fine to just
> > > > delete it entirely with a "not worth the maintenance cost" argument.
> > > >
> > > > If you turn the argument around, and somebody would submit the code as
> > > > it is today, with the numbers being what they are above, I'm not sure
> > > > we would even accept it!
> > >
> > > The context guy is here :)
> > >
> > > Not arguing for one way or another, but I did find the original patch
> > > that introduced same filled page handling:
> > >
> > > https://github.com/torvalds/linux/commit/a85f878b443f8d2b91ba76f09da21ac0af22e07f
> > >
> > > https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/T/#u
> >
> > Thanks for digging this up. I don't know why I didn't start there :)
> >
> > Following in your footsteps, and given that zram has the same feature,
> > I found the patch that added support for non-zero same-filled pages in
> > zram:
> > https://lore.kernel.org/all/1483692145-75357-1-git-send-email-zhouxianrong@huawei.com/#t
> >
> > Both of them confirm that most same-filled pages are zero pages, but
> > they show a more significant portion of same-filled pages being
> > non-zero (17% to 40%). I suspect this will be less in data centers
> > compared to consumer apps.
> >
> > The zswap patch also reports significant performance improvements from
> > the same-filled handling, but this is with 17-22% same-filled pages.
> > Johannes mentioned around 10% in your data centers, so the performance
> > improvement would be less. In the kernel build tests I ran with only
> > around 1.5% same-filled pages I observed 1.4% improvements just by
> > optimizing them (only zero-filled, skipping allocations).
> >
> > So I think removing the same-filled pages handling completely may be
> > too aggressive, because it doesn't only affect the memory efficiency,
> > but also cycles spent when handling those pages. Just avoiding going
> > through the allocator and compressor has to account for something :)
> 
> Here is another data point. I tried removing the same-filled handling
> code completely with the diff Johannes sent upthread. I saw 1.3%
> improvement in the kernel build test, very similar to the improvement
> from this patch series. _However_, the kernel build test only produces
> ~1.5% zero-filled pages in my runs. More realistic workloads have
> significantly higher percentages as demonstrated upthread.
> 
> In other words, the kernel build test (at least in my runs) seems to
> be the worst case scenario for same-filled/zero-filled pages. Since
> the improvement from removing same-filled handling is quite small in
> this case, I suspect there will be no improvement, but possibly a
> regression, on real workloads.
> 
> As the zero-filled pages ratio increases:
> - The performance with this series will improve.
> - The performance with removing same-filled handling completely will
> become worse.

Sorry, this thread is still really lacking practical perspective.

As do the numbers that initially justified the patch. Sure, the stores
of same-filled pages are faster. What's the cost of prechecking 90% of
the other pages that need compression?

Also, this is the swap path we're talking about. There is vmscan, swap
slot allocations, page table walks, TLB flushes, zswap tree inserts;
then a page fault and everything in reverse.

I perf'd zswapping out data that is 10% same-filled and 90% data that
always needs compression. It does nothing but madvise(MADV_PAGEOUT),
and the zswap_store() stack is already only ~60% of the cycles.

Using zsmalloc + zstd, this is the diff between vanilla and my patch:

# Baseline  Delta Abs  Shared Object         Symbol
# ........  .........  ....................  .....................................................
#
     4.34%     -3.02%  [kernel.kallsyms]     [k] zswap_store
    11.07%     +1.41%  [kernel.kallsyms]     [k] ZSTD_compressBlock_doubleFast
    15.55%     +0.91%  [kernel.kallsyms]     [k] FSE_buildCTable_wksp

As expected, we have to compress a bit more; on the other hand we're
removing the content scan for same-filled for 90% of the pages that
don't benefit from it. They almost amortize each other. Let's round it
up and the remaining difference is ~1%.

It's difficult to make the case that this matters to any real
workloads with actual think time in between paging.

But let's say you do make the case that zero-filled pages are worth
optimizing for. Why is this in zswap? Why not do it in vmscan with a
generic zero-swp_entry_t, and avoid the swap backend altogether? No
swap slot allocation, no zswap tree, no *IO on disk swap*.

However you slice it, I fail to see how this has a place in
zswap. It's trying to optimize the slow path of a slow path, at the
wrong layer of the reclaim stack.


  reply	other threads:[~2024-03-29 17:38 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-25 23:50 [RFC PATCH 0/9] zswap: store zero-filled pages more efficiently Yosry Ahmed
2024-03-25 23:50 ` [RFC PATCH 1/9] mm: zswap: always shrink in zswap_store() if zswap_pool_reached_full Yosry Ahmed
2024-03-26 21:49   ` Nhat Pham
2024-03-27  2:21   ` Chengming Zhou
2024-03-28 19:09   ` Johannes Weiner
2024-03-25 23:50 ` [RFC PATCH 2/9] mm: zswap: refactor storing to the tree out of zswap_store() Yosry Ahmed
2024-03-27  2:25   ` Chengming Zhou
2024-03-27 22:29     ` Yosry Ahmed
2024-03-25 23:50 ` [RFC PATCH 3/9] mm: zswap: refactor limit checking from zswap_store() Yosry Ahmed
2024-03-27  2:42   ` Chengming Zhou
2024-03-27 22:30     ` Yosry Ahmed
2024-03-25 23:50 ` [RFC PATCH 4/9] mm: zswap: move more same-filled pages checks outside of zswap_store() Yosry Ahmed
2024-03-26 21:57   ` Nhat Pham
2024-03-27  2:39   ` Chengming Zhou
2024-03-27 22:32     ` Yosry Ahmed
2024-03-25 23:50 ` [RFC PATCH 5/9] mm: zswap: remove zswap_same_filled_pages_enabled Yosry Ahmed
2024-03-26 22:01   ` Nhat Pham
2024-03-27  2:44   ` Chengming Zhou
2024-03-27 22:34     ` Yosry Ahmed
2024-03-28 19:11   ` Johannes Weiner
2024-03-28 20:06     ` Yosry Ahmed
2024-03-29  2:14       ` Yosry Ahmed
2024-03-29 14:02         ` Maciej S. Szmigiero
2024-03-29 17:44           ` Johannes Weiner
2024-03-29 18:22             ` Yosry Ahmed
2024-04-01 10:37               ` Maciej S. Szmigiero
2024-04-01 18:29                 ` Yosry Ahmed
2024-03-25 23:50 ` [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling Yosry Ahmed
2024-03-27 11:25   ` Chengming Zhou
2024-03-27 16:40   ` Nhat Pham
2024-03-27 22:38     ` Yosry Ahmed
2024-03-28 19:31   ` Johannes Weiner
2024-03-28 20:23     ` Yosry Ahmed
2024-03-28 21:07       ` Johannes Weiner
2024-03-28 23:19         ` Nhat Pham
2024-03-29  2:05           ` Yosry Ahmed
2024-03-29  4:27             ` Yosry Ahmed
2024-03-29 17:37               ` Johannes Weiner [this message]
2024-03-29 18:56                 ` Yosry Ahmed
2024-03-29 21:17                   ` Johannes Weiner
2024-03-29 22:29                     ` Yosry Ahmed
2024-03-28 23:33       ` Nhat Pham
2024-03-29  2:07         ` Yosry Ahmed
2024-03-25 23:50 ` [RFC PATCH 7/9] mm: zswap: store zero-filled pages without a zswap_entry Yosry Ahmed
2024-03-28  8:12   ` Chengming Zhou
2024-03-28 18:45     ` Yosry Ahmed
2024-03-28 19:38   ` Johannes Weiner
2024-03-28 20:29     ` Yosry Ahmed
2024-03-25 23:50 ` [RFC PATCH 8/9] mm: zswap: do not check the global limit for zero-filled pages Yosry Ahmed
2024-03-28  8:15   ` Chengming Zhou
2024-03-25 23:50 ` [RFC PATCH 9/9] mm: zswap: use zswap_entry_free() for partially initialized entries Yosry Ahmed
2024-03-28  8:31   ` Chengming Zhou
2024-03-28 18:49     ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240329173759.GI7597@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.