Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory

From: Yu Zhao <yuzhao@google.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Yin Fengwei <fengwei.yin@intel.com>,
	David Hildenbrand <david@redhat.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Yang Shi <shy828301@gmail.com>,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
Date: Mon, 3 Jul 2023 20:18:50 -0600	[thread overview]
Message-ID: <CAOUHufYB2kB0r9hhSbzfEzdF85MkXVfWoFOhy3LwLfJ5Qo8H6g@mail.gmail.com> (raw)
In-Reply-To: <20230703135330.1865927-1-ryan.roberts@arm.com>

On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> This is v2 of a series to implement variable order, large folios for anonymous
> memory. The objective of this is to improve performance by allocating larger
> chunks of memory during anonymous page faults. See [1] for background.

Thanks for the quick response!

> I've significantly reworked and simplified the patch set based on comments from
> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> VARIABLE_THP, on Yu's advice.
>
> The last patch is for arm64 to explicitly override the default
> arch_wants_pte_order() and is intended as an example. If this series is accepted
> I suggest taking the first 4 patches through the mm tree and the arm64 change
> could be handled through the arm64 tree separately. Neither has any build
> dependency on the other.
>
> The one area where I haven't followed Yu's advice is in the determination of the
> size of folio to use. It was suggested that I have a single preferred large
> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> being existing overlapping populated PTEs, etc) then fallback immediately to
> order-0. It turned out that this approach caused a performance regression in the
> Speedometer benchmark.

I suppose it's regression against the v1, not the unpatched kernel.

> With my v1 patch, there were significant quantities of
> memory which could not be placed in the 64K bucket and were instead being
> allocated for the 32K and 16K buckets. With the proposed simplification, that
> memory ended up using the 4K bucket, so page faults increased by 2.75x compared
> to the v1 patch (although due to the 64K bucket, this number is still a bit
> lower than the baseline). So instead, I continue to calculate a folio order that
> is somewhere between the preferred order and 0. (See below for more details).

I suppose the benchmark wasn't running under memory pressure, which is
uncommon for client devices. It could be easier the other way around:
using 32/16KB shows regression whereas order-0 shows better
performance under memory pressure.

I'm not sure we should use v1 as the baseline. Unpatched kernel sounds
more reasonable at this point. If 32/16KB is proven to be better in
most scenarios including under memory pressure, we can reintroduce
that policy. I highly doubt this is the case: we tried 16KB base page
size on client devices, and overall, the regressions outweighs the
benefits.

> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> [2], which is a hard dependency. I have a branch at [3].

It's not clear to me why [2] is a hard dependency.

It seems to me we are getting close and I was hoping we could get into
mm-unstable soon without depending on other series...