linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: linux-mm@kvack.org, Alex Shi <alex.shi@linux.alibaba.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hillf Danton <hdanton@sina.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>, Michal Hocko <mhocko@suse.com>,
	Roman Gushchin <guro@fb.com>, Vlastimil Babka <vbabka@suse.cz>,
	Wei Yang <richard.weiyang@linux.alibaba.com>,
	Yang Shi <shy828301@gmail.com>, Ying Huang <ying.huang@intel.com>,
	linux-kernel@vger.kernel.org, page-reclaim@google.com
Subject: Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries
Date: Sun, 14 Mar 2021 21:16:35 -0600	[thread overview]
Message-ID: <YE7Rk/YA1Uj7yFn2@google.com> (raw)
In-Reply-To: <ce818341-ed33-cd8c-5c06-65147f510c4d@intel.com>

On Sun, Mar 14, 2021 at 04:22:03PM -0700, Dave Hansen wrote:
> On 3/12/21 11:57 PM, Yu Zhao wrote:
> > Some architectures support the accessed bit on non-leaf PMD entries
> > (parents) in addition to leaf PTE entries (children) where pages are
> > mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> > as part of linear-address translation [1]. Page table walkers who are
> > interested in the accessed bit on children can take advantage of this:
> > they do not need to search the children when the accessed bit is not
> > set on a parent, given that they have previously cleared the accessed
> > bit on this parent in addition to its children.
> 
> I'd like to hear a *LOT* more about how this is going to be used.
> 
> The one part of this which is entirely missing is the interaction with
> the TLB and mid-level paging structure caches.  The CPU is pretty
> aggressive about setting no-leaf accessed bits when TLB entries are
> created.  This *looks* to be depending on that behavior, but it would be
> nice to spell it out explicitly.

Good point. Let me start with a couple of observations we've made:
  1) some applications create very sparse address spaces, for various
  reasons. A notable example is those using Scudo memory allocations:
  they usually have double-digit numbers of PTE entries for each PMD
  entry (and thousands of VMAs for just a few hundred MBs of memory
  usage, sigh...).
  2) scans of an address space (from the reclaim path) are much less
  frequent than context switches of it. Under our heaviest memory
  pressure (30%+ overcommitted; guess how much we've profited from
  it :) ), their magnitudes are still on different orders.
  Specifically, on our smallest system (2GB, with PCID), we observed
  no difference between flushing and not flushing TLB in terms of page
  selections. We actually observed more TLB misses under heavier
  memory pressure, and our theory is that this is due to increased
  memory footprint that causes the pressure.

There are two use cases for the accessed bit on non-leaf PMD entries:
the hot tracking and the cold tracking. I'll focus on the cold
tracking, which is what this series about.

Since non-leaf entries are more likely to be cached, in theory, the
false negative rate is higher compared with leaf entries as the CPU
won't set the accessed bit again until the next TLB miss. (Here a
false negative means the accessed bit isn't set on an entry has been
used, after we cleared the accessed bit. And IIRC, there are also
false positives, i.e., the accessed bit is set on entries used by
speculative execution only.) But this is not a problem because of the
second observation aforementioned.

Now let's consider the worst case scenario: what happens when we hit
a false negative on a non-leaf PMD entry? We think the pages mapped
by the PTE entries of this PMD entry are inactive and try to reclaim
them, until we see the accessed bit set on one of the PTE entries.
This will cost us one futile attempt for all the 512 PTE entries. A
glance at lru_gen_scan_around() in the 11th patch would explain
exactly why. If you are guessing that function embodies the same idea
of "fault around", you are right.

And there are two places that could benefit from this patch (and the
next) immediately, independent to this series. One is
clear_refs_test_walk() in fs/proc/task_mmu.c. The other is
madvise_pageout_page_range() and madvise_cold_page_range() in
mm/madvise.c. Both are page table walkers that clear the accessed bit.

I think I've covered a lot of ground but I'm sure there is a lot more.
So please feel free to add and I'll include everything we discuss here
in the next version.


  reply	other threads:[~2021-03-15  3:16 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
2021-03-13  7:57 ` [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
2021-03-13 15:09   ` Matthew Wilcox
2021-03-14  7:45     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 02/14] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao
2021-03-13  7:57 ` [PATCH v1 03/14] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao
2021-03-13  7:57 ` [PATCH v1 04/14] include/linux/cgroup.h: export cgroup_mutex Yu Zhao
2021-03-13  7:57 ` [PATCH v1 05/14] mm/swap.c: export activate_page() Yu Zhao
2021-03-13  7:57 ` [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
2021-03-14 22:12   ` Zi Yan
2021-03-14 22:51     ` Matthew Wilcox
2021-03-15  0:03       ` Yu Zhao
2021-03-15  0:27         ` Zi Yan
2021-03-15  1:04           ` Yu Zhao
2021-03-14 23:22   ` Dave Hansen
2021-03-15  3:16     ` Yu Zhao [this message]
2021-03-13  7:57 ` [PATCH v1 07/14] mm/pagewalk.c: add pud_entry_post() for post-order traversals Yu Zhao
2021-03-13  7:57 ` [PATCH v1 08/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
2021-03-13  7:57 ` [PATCH v1 09/14] mm: multigenerational lru: mm_struct list Yu Zhao
2021-03-15 19:40   ` Rik van Riel
2021-03-16  2:07     ` Huang, Ying
2021-03-16  3:57       ` Yu Zhao
2021-03-16  6:44         ` Huang, Ying
2021-03-16  7:56           ` Yu Zhao
2021-03-17  3:37             ` Huang, Ying
2021-03-17 10:46               ` Yu Zhao
2021-03-22  3:13                 ` Huang, Ying
2021-03-22  8:08                   ` Yu Zhao
2021-03-24  6:58                     ` Huang, Ying
2021-04-10 18:48                       ` Yu Zhao
2021-04-13  3:06                         ` Huang, Ying
2021-03-13  7:57 ` [PATCH v1 10/14] mm: multigenerational lru: core Yu Zhao
2021-03-15  2:02   ` Andi Kleen
2021-03-15  3:37     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 11/14] mm: multigenerational lru: page activation Yu Zhao
2021-03-16 16:34   ` Matthew Wilcox
2021-03-16 21:29     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 12/14] mm: multigenerational lru: user space interface Yu Zhao
2021-03-13 12:23   ` kernel test robot
2021-03-13  7:57 ` [PATCH v1 13/14] mm: multigenerational lru: Kconfig Yu Zhao
2021-03-13 12:53   ` kernel test robot
2021-03-13 13:36   ` kernel test robot
2021-03-13  7:57 ` [PATCH v1 14/14] mm: multigenerational lru: documentation Yu Zhao
2021-03-19  9:31   ` Alex Shi
2021-03-22  6:09     ` Yu Zhao
2021-03-14 22:48 ` [PATCH v1 00/14] Multigenerational LRU Zi Yan
2021-03-15  0:52   ` Yu Zhao
2021-03-15  1:13 ` Hillf Danton
2021-03-15  6:49   ` Yu Zhao
2021-03-15 18:00 ` Dave Hansen
2021-03-16  2:24   ` Yu Zhao
2021-03-16 14:50     ` Dave Hansen
2021-03-16 20:30       ` Yu Zhao
2021-03-16 21:14         ` Dave Hansen
2021-04-10  9:21           ` Yu Zhao
2021-04-13  3:02             ` Huang, Ying
2021-04-13 23:00               ` Yu Zhao
2021-03-15 18:38 ` Yang Shi
2021-03-16  3:38   ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YE7Rk/YA1Uj7yFn2@google.com \
    --to=yuzhao@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@linux.alibaba.com \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=page-reclaim@google.com \
    --cc=richard.weiyang@linux.alibaba.com \
    --cc=shy828301@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).