Re: [RFC 2/2] mm: Defer TLB flush by keeping both src and dst folios at migration

From: Zi Yan <ziy@nvidia.com>
To: Byungchul Park <byungchul@sk.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	kernel_team@skhynix.com, akpm@linux-foundation.org,
	ying.huang@intel.com, namit@vmware.com, xhao@linux.alibaba.com,
	mgorman@techsingularity.net, hughd@google.com,
	willy@infradead.org, david@redhat.com
Subject: Re: [RFC 2/2] mm: Defer TLB flush by keeping both src and dst folios at migration
Date: Fri, 04 Aug 2023 12:08:07 -0400	[thread overview]
Message-ID: <A59E5AD9-11DA-435A-AD17-6F735F0C0840@nvidia.com> (raw)
In-Reply-To: <20230804061850.21498-3-byungchul@sk.com>

[-- Attachment #1: Type: text/plain, Size: 4511 bytes --]

On 4 Aug 2023, at 2:18, Byungchul Park wrote:

> Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'.
>
> We always face the migration overhead at either promotion or demotion,
> while working with tiered memory e.g. CXL memory and found out TLB
> shootdown is a quite big one that is needed to get rid of if possible.
>
> Fortunately, TLB flush can be defered or even skipped if both source and
> destination of folios during migration are kept until all TLB flushes
> required will have been done, of course, only if the target PTE entries
> have read only permission, more precisely speaking, don't have write
> permission. Otherwise, no doubt the folio might get messed up.

So this would only reduce or eliminate TLB flushes? The same goal should
be achievable with batched TLB flush, right? You probably can group
to-be-migrated pages into a read-only group and a writable group and
migrate the read-only group first then the writable group. It would
reduce or eliminate the TLB flushes for the read-only group of pages, right?

>
> To achieve that:
>
>    1. For the folios that have only non-writable TLB entries, prevent
>       TLB flush by keeping both source and destination of folios during
>       migration, which will be handled later at a better time.

In this case, the page table points to the destination folio, but TLB
would cache the old translation pointing to the source folio. I wonder
if there would be any correctness issue.

>
>    2. When any non-writable TLB entry changes to writable e.g. through
>       fault handler, give up CONFIG_MIGRC mechanism so as to perform
>       TLB flush required right away.
>
>    3. TLB flushes can be skipped if all TLB flushes required to free the
>       duplicated folios have been done by any reason, which doesn't have
>       to be done from migrations.
>
>    4. Adjust watermark check routine, __zone_watermark_ok(), with the
>       number of duplicated folios because those folios can be freed
>       and obtained right away through appropreate TLB flushes.
>
>    5. Perform TLB flushes and free the duplicated folios pending the
>       flushes if page allocation routine is in trouble due to memory
>       pressure, even more aggresively for high order allocation.
>
> The measurement result:
>
>    Architecture - x86_64
>    QEMU - kvm enabled, host cpu, 2nodes((4cpus, 2GB)+(cpuless, 6GB))
>    Linux Kernel - v6.4, numa balancing tiering on, demotion enabled
>    Benchmark - XSBench with no parameter changed
>
>    run 'perf stat' using events:
>    (FYI, process wide result ~= system wide result(-a option))
>       1) itlb.itlb_flush
>       2) tlb_flush.dtlb_thread
>       3) tlb_flush.stlb_any
>
>    run 'cat /proc/vmstat' and pick up:
>       1) pgdemote_kswapd
>       2) numa_pages_migrated
>       3) pgmigrate_success
>       4) nr_tlb_remote_flush
>       5) nr_tlb_remote_flush_received
>       6) nr_tlb_local_flush_all
>       7) nr_tlb_local_flush_one
>
>    BEFORE - mainline v6.4
>    ==========================================
>
>    $ perf stat -e itlb.itlb_flush,tlb_flush.dtlb_thread,tlb_flush.stlb_any ./XSBench
>
>    Performance counter stats for './XSBench':
>
>       426856       itlb.itlb_flush
>       6900414      tlb_flush.dtlb_thread
>       7303137      tlb_flush.stlb_any
>
>    33.500486566 seconds time elapsed
>    92.852128000 seconds user
>    10.526718000 seconds sys
>
>    $ cat /proc/vmstat
>
>    ...
>    pgdemote_kswapd 1052596
>    numa_pages_migrated 1052359
>    pgmigrate_success 2161846
>    nr_tlb_remote_flush 72370
>    nr_tlb_remote_flush_received 213711
>    nr_tlb_local_flush_all 3385
>    nr_tlb_local_flush_one 198679
>    ...
>
>    AFTER - mainline v6.4 + CONFIG_MIGRC
>    ==========================================
>
>    $ perf stat -e itlb.itlb_flush,tlb_flush.dtlb_thread,tlb_flush.stlb_any ./XSBench
>
>    Performance counter stats for './XSBench':
>
>       179537       itlb.itlb_flush
>       6131135      tlb_flush.dtlb_thread
>       6920979      tlb_flush.stlb_any
>
>    30.396700625 seconds time elapsed
>    80.331252000 seconds user
>    10.303761000 seconds sys
>
>    $ cat /proc/vmstat
>
>    ...
>    pgdemote_kswapd 1044602
>    numa_pages_migrated 1044202
>    pgmigrate_success 2157808
>    nr_tlb_remote_flush 30453
>    nr_tlb_remote_flush_received 88840
>    nr_tlb_local_flush_all 3039
>    nr_tlb_local_flush_one 198875
>    ...
>
> Signed-off-by: Byungchul Park <byungchul@sk.com>
> ---

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]