Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries

From: Yu Zhao <yuzhao@google.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: linux-mm@kvack.org, Alex Shi <alex.shi@linux.alibaba.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hillf Danton <hdanton@sina.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>, Michal Hocko <mhocko@suse.com>,
	Roman Gushchin <guro@fb.com>, Vlastimil Babka <vbabka@suse.cz>,
	Wei Yang <richard.weiyang@linux.alibaba.com>,
	Yang Shi <shy828301@gmail.com>, Ying Huang <ying.huang@intel.com>,
	linux-kernel@vger.kernel.org, page-reclaim@google.com
Subject: Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries
Date: Sun, 14 Mar 2021 21:16:35 -0600	[thread overview]
Message-ID: <YE7Rk/YA1Uj7yFn2@google.com> (raw)
In-Reply-To: <ce818341-ed33-cd8c-5c06-65147f510c4d@intel.com>

On Sun, Mar 14, 2021 at 04:22:03PM -0700, Dave Hansen wrote:
> On 3/12/21 11:57 PM, Yu Zhao wrote:
> > Some architectures support the accessed bit on non-leaf PMD entries
> > (parents) in addition to leaf PTE entries (children) where pages are
> > mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> > as part of linear-address translation [1]. Page table walkers who are
> > interested in the accessed bit on children can take advantage of this:
> > they do not need to search the children when the accessed bit is not
> > set on a parent, given that they have previously cleared the accessed
> > bit on this parent in addition to its children.
> 
> I'd like to hear a *LOT* more about how this is going to be used.
> 
> The one part of this which is entirely missing is the interaction with
> the TLB and mid-level paging structure caches.  The CPU is pretty
> aggressive about setting no-leaf accessed bits when TLB entries are
> created.  This *looks* to be depending on that behavior, but it would be
> nice to spell it out explicitly.

Good point. Let me start with a couple of observations we've made:
  1) some applications create very sparse address spaces, for various
  reasons. A notable example is those using Scudo memory allocations:
  they usually have double-digit numbers of PTE entries for each PMD
  entry (and thousands of VMAs for just a few hundred MBs of memory
  usage, sigh...).
  2) scans of an address space (from the reclaim path) are much less
  frequent than context switches of it. Under our heaviest memory
  pressure (30%+ overcommitted; guess how much we've profited from
  it :) ), their magnitudes are still on different orders.
  Specifically, on our smallest system (2GB, with PCID), we observed
  no difference between flushing and not flushing TLB in terms of page
  selections. We actually observed more TLB misses under heavier
  memory pressure, and our theory is that this is due to increased
  memory footprint that causes the pressure.

There are two use cases for the accessed bit on non-leaf PMD entries:
the hot tracking and the cold tracking. I'll focus on the cold
tracking, which is what this series about.

Since non-leaf entries are more likely to be cached, in theory, the
false negative rate is higher compared with leaf entries as the CPU
won't set the accessed bit again until the next TLB miss. (Here a
false negative means the accessed bit isn't set on an entry has been
used, after we cleared the accessed bit. And IIRC, there are also
false positives, i.e., the accessed bit is set on entries used by
speculative execution only.) But this is not a problem because of the
second observation aforementioned.

Now let's consider the worst case scenario: what happens when we hit
a false negative on a non-leaf PMD entry? We think the pages mapped
by the PTE entries of this PMD entry are inactive and try to reclaim
them, until we see the accessed bit set on one of the PTE entries.
This will cost us one futile attempt for all the 512 PTE entries. A
glance at lru_gen_scan_around() in the 11th patch would explain
exactly why. If you are guessing that function embodies the same idea
of "fault around", you are right.

And there are two places that could benefit from this patch (and the
next) immediately, independent to this series. One is
clear_refs_test_walk() in fs/proc/task_mmu.c. The other is
madvise_pageout_page_range() and madvise_cold_page_range() in
mm/madvise.c. Both are page table walkers that clear the accessed bit.

I think I've covered a lot of ground but I'm sure there is a lot more.
So please feel free to add and I'll include everything we discuss here
in the next version.