Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

From: Yosry Ahmed <yosryahmed@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Chengming Zhou <chengming.zhou@linux.dev>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 Srividya Desireddy <srividya.desireddy@gmail.com>
Subject: Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling
Date: Fri, 29 Mar 2024 11:56:46 -0700	[thread overview]
Message-ID: <CAJD7tkaySFP2hBQw4pnZHJJwe3bMdjJ1t9VC2VJd=khn1_TXvA@mail.gmail.com> (raw)
In-Reply-To: <20240329173759.GI7597@cmpxchg.org>

[..]
> > > > The context guy is here :)
> > > >
> > > > Not arguing for one way or another, but I did find the original patch
> > > > that introduced same filled page handling:
> > > >
> > > > https://github.com/torvalds/linux/commit/a85f878b443f8d2b91ba76f09da21ac0af22e07f
> > > >
> > > > https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/T/#u
> > >
> > > Thanks for digging this up. I don't know why I didn't start there :)
> > >
> > > Following in your footsteps, and given that zram has the same feature,
> > > I found the patch that added support for non-zero same-filled pages in
> > > zram:
> > > https://lore.kernel.org/all/1483692145-75357-1-git-send-email-zhouxianrong@huawei.com/#t
> > >
> > > Both of them confirm that most same-filled pages are zero pages, but
> > > they show a more significant portion of same-filled pages being
> > > non-zero (17% to 40%). I suspect this will be less in data centers
> > > compared to consumer apps.
> > >
> > > The zswap patch also reports significant performance improvements from
> > > the same-filled handling, but this is with 17-22% same-filled pages.
> > > Johannes mentioned around 10% in your data centers, so the performance
> > > improvement would be less. In the kernel build tests I ran with only
> > > around 1.5% same-filled pages I observed 1.4% improvements just by
> > > optimizing them (only zero-filled, skipping allocations).
> > >
> > > So I think removing the same-filled pages handling completely may be
> > > too aggressive, because it doesn't only affect the memory efficiency,
> > > but also cycles spent when handling those pages. Just avoiding going
> > > through the allocator and compressor has to account for something :)
> >
> > Here is another data point. I tried removing the same-filled handling
> > code completely with the diff Johannes sent upthread. I saw 1.3%
> > improvement in the kernel build test, very similar to the improvement
> > from this patch series. _However_, the kernel build test only produces
> > ~1.5% zero-filled pages in my runs. More realistic workloads have
> > significantly higher percentages as demonstrated upthread.
> >
> > In other words, the kernel build test (at least in my runs) seems to
> > be the worst case scenario for same-filled/zero-filled pages. Since
> > the improvement from removing same-filled handling is quite small in
> > this case, I suspect there will be no improvement, but possibly a
> > regression, on real workloads.
> >
> > As the zero-filled pages ratio increases:
> > - The performance with this series will improve.
> > - The performance with removing same-filled handling completely will
> > become worse.
>
> Sorry, this thread is still really lacking practical perspective.
>
> As do the numbers that initially justified the patch. Sure, the stores
> of same-filled pages are faster. What's the cost of prechecking 90% of
> the other pages that need compression?

Well, that's exactly what I was trying to find out :)

The kernel build test has over 98% pages that are not same-filled in
my runs, so we are paying the cost of prechecking 98% of pages for
same-filled contents needlessly. However, removing same-filled
handling only resulted in a 1.4% performance improvement. We should
expect even less for workloads that have 90% non-same-filled pages,
right?

It seems like the cost of prechecking is not that bad at all,
potentially because the page contents are read into cache anyway. Did
I miss something?

>
> Also, this is the swap path we're talking about. There is vmscan, swap
> slot allocations, page table walks, TLB flushes, zswap tree inserts;
> then a page fault and everything in reverse.
>
> I perf'd zswapping out data that is 10% same-filled and 90% data that
> always needs compression. It does nothing but madvise(MADV_PAGEOUT),
> and the zswap_store() stack is already only ~60% of the cycles.
>
> Using zsmalloc + zstd, this is the diff between vanilla and my patch:
>
> # Baseline  Delta Abs  Shared Object         Symbol
> # ........  .........  ....................  .....................................................
> #
>      4.34%     -3.02%  [kernel.kallsyms]     [k] zswap_store
>     11.07%     +1.41%  [kernel.kallsyms]     [k] ZSTD_compressBlock_doubleFast
>     15.55%     +0.91%  [kernel.kallsyms]     [k] FSE_buildCTable_wksp
>
> As expected, we have to compress a bit more; on the other hand we're
> removing the content scan for same-filled for 90% of the pages that
> don't benefit from it. They almost amortize each other. Let's round it
> up and the remaining difference is ~1%.

Thanks for the data, this is very useful.

There is also the load path. The original patch that introduced
same-filled pages reports more gains on the load side, which also
happens to be more latency-sensitive. Of course, the data could be
outdated -- but it's a different type of workload than what will be
running in a data center fleet AFAICT.

Is there also no noticeable difference on the load side in your data?

Also, how much increase did you observe in the size of compressed data
with your patch? Just curious about how zstd ended up handling those
pages.

>
> It's difficult to make the case that this matters to any real
> workloads with actual think time in between paging.

If the difference in performance is 1%, I surely agree.

The patch reported 19-32% improvement in store time for same-filled
pages depending on the workload and platform. For 10% same-filled
pages, this should roughly correspond to 2-3% overall improvement,
which is not too far from the 1% you observed.

The patch reported much larger improvement for load times (which
matters more), 49-85% for same-filled pages. If this corresponds to
5-8% overall improvement in zswap load time for a workload with 10%
same-filled pages, that would be very significant. I don't expect to
see such improvements tbh, but we should check.

Let's CC Srividya here for visibility.

>
> But let's say you do make the case that zero-filled pages are worth
> optimizing for.

I am not saying they are for sure, but removing the same-filled
checking didn't seem to improve performance much in my testing, so the
cost seems to be mostly in maintenance. This makes it seem to me that
it *could* be a good tradeoff to only handle zero-filled pages. We can
make them slightly faster and we can trim the complexity -- as shown
by this patch.

> Why is this in zswap? Why not do it in vmscan with a
> generic zero-swp_entry_t, and avoid the swap backend altogether? No
> swap slot allocation, no zswap tree, no *IO on disk swap*.

That part I definitely agree with, especially that the logic is
duplicated in zram.

>
> However you slice it, I fail to see how this has a place in
> zswap. It's trying to optimize the slow path of a slow path, at the
> wrong layer of the reclaim stack.

Agreeing for the store path, we still need some clarity on the load
path. But yeah all-in-all zswap is not the correct place for such
optimizations, but it's the way the handling currently lives.