linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: linux-mm@kvack.org, Alex Shi <alex.shi@linux.alibaba.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hillf Danton <hdanton@sina.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>, Michal Hocko <mhocko@suse.com>,
	Roman Gushchin <guro@fb.com>, Vlastimil Babka <vbabka@suse.cz>,
	Wei Yang <richard.weiyang@linux.alibaba.com>,
	Yang Shi <shy828301@gmail.com>, Ying Huang <ying.huang@intel.com>,
	linux-kernel@vger.kernel.org, page-reclaim@google.com
Subject: Re: [PATCH v1 00/14] Multigenerational LRU
Date: Sat, 10 Apr 2021 03:21:51 -0600	[thread overview]
Message-ID: <YHFuL/Ddtiml4biw@google.com> (raw)
In-Reply-To: <4e76078c-846f-a0f0-2349-12d9d806d1a8@intel.com>

[-- Attachment #1: Type: text/plain, Size: 6026 bytes --]

On Tue, Mar 16, 2021 at 02:14:43PM -0700, Dave Hansen wrote:
> On 3/16/21 1:30 PM, Yu Zhao wrote:
> > On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote:
> >> I think it would also be very worthwhile to include some research in
> >> this series about why the kernel moved away from page table scanning.
> >> What has changed?  Are the workloads we were concerned about way back
> >> then not around any more?  Has faster I/O or larger memory sizes with a
> >> stagnating page size changed something?
> > 
> > Sure. Hugh also suggested this too but I personally found that ancient
> > pre-2.4 history too irrelevant (and uninteresting) to the modern age
> > and decided to spare audience of the boredom.
> 
> IIRC, rmap chains showed up in the 2.5 era and the VM was quite bumpy
> until anon_vmas came around, which was early-ish in the 2.6 era.
> 
> But, either way, I think there is a sufficient population of nostalgic
> crusty old folks around to warrant a bit of a history lesson.  We'll
> enjoy the trip down memory lane, fondly remembering the old days in
> Ottawa...
> 
> >>> nr_vmscan_write 24900719
> >>> nr_vmscan_immediate_reclaim 115535
> >>> pgscan_kswapd 320831544
> >>> pgscan_direct 23396383
> >>> pgscan_direct_throttle 0
> >>> pgscan_anon 127491077
> >>> pgscan_file 216736850
> >>> slabs_scanned 400469680
> >>> compact_migrate_scanned 1092813949
> >>> compact_free_scanned 4919523035
> >>> compact_daemon_migrate_scanned 2372223
> >>> compact_daemon_free_scanned 20989310
> >>> unevictable_pgs_scanned 307388545
> > 
> > 10G swap + 8G anon rss + 6G file rss, hmm... an interesting workload.
> > The file rss does seem a bit high to me, my wild speculation is there
> > have been git/make activities in addition to a VM?
> 
> I wish I was doing more git/make activities.  It's been an annoying
> amount of email and web browsers for 12 days.  If anything, I'd suspect
> that Thunderbird is at fault for keeping a bunch of mail in the page
> cache.  There are a couple of VM's running though.

Hi Dave,

Sorry for the late reply. Here is the benchmark result from the worst
case scenario.

As you suggested, we create a lot of processes sharing one large
sparse shmem, and they access the shmem at random 2MB-aligned offsets.
So there will be at most one valid PTE entry per PTE table, hence the
worst case scenario for the multigenerational LRU, since it is based
on page table scanning.

TL;DR: the multigenerational LRU did not perform worse than the rmap.

My test configurations:

  The size of the shmem: 256GB
  The number of processes: 450
  Total memory size: 200GB
  The number of CPUs: 64
  The number of nodes: 2

There is no clear winner in the background reclaim path (kswapd).

  kswapd (5.12.0-rc6):
    43.99%  kswapd1  page_vma_mapped_walk
    34.86%  kswapd0  page_vma_mapped_walk
     2.43%  kswapd0  count_shadow_nodes
     1.17%  kswapd1  page_referenced_one
     1.15%  kswapd0  _find_next_bit.constprop.0
     0.95%  kswapd0  page_referenced_one
     0.87%  kswapd1  try_to_unmap_one
     0.75%  kswapd0  cpumask_next
     0.67%  kswapd0  shrink_slab
     0.66%  kswapd0  down_read_trylock

  kswapd (the multigenerational LRU):
    33.39%  kswapd0  walk_pud_range
    10.93%  kswapd1  walk_pud_range
     9.36%  kswapd0  page_vma_mapped_walk
     7.15%  kswapd1  page_vma_mapped_walk
     3.83%  kswapd0  count_shadow_nodes
     2.60%  kswapd1  shrink_slab
     2.47%  kswapd1  down_read_trylock
     2.03%  kswapd0  _raw_spin_lock
     1.87%  kswapd0  shrink_slab
     1.67%  kswapd1  count_shadow_nodes

The multigenerational LRU is somewhat winning in the direct reclaim
path (sparse is the test binary name):

  The test process context (5.12.0-rc6):
    65.02%  sparse   page_vma_mapped_walk
     5.49%  sparse   page_counter_try_charge
     3.60%  sparse   propagate_protected_usage
     2.31%  sparse   page_counter_uncharge
     2.06%  sparse   count_shadow_nodes
     1.81%  sparse   native_queued_spin_lock_slowpath
     1.79%  sparse   down_read_trylock
     1.67%  sparse   page_referenced_one
     1.42%  sparse   shrink_slab
     0.87%  sparse   try_to_unmap_one

  CPU % (direct reclaim vs the rest): 71% vs 29%
  # grep oom_kill /proc/vmstat
  oom_kill 81

  The test process context (the multigenerational LRU):
    33.12%  sparse   page_vma_mapped_walk
    10.70%  sparse   walk_pud_range
     9.64%  sparse   page_counter_try_charge
     6.63%  sparse   propagate_protected_usage
     4.43%  sparse   native_queued_spin_lock_slowpath
     3.85%  sparse   page_counter_uncharge
     3.71%  sparse   irqentry_exit_to_user_mode
     2.16%  sparse   _raw_spin_lock
     1.83%  sparse   unmap_page_range
     1.82%  sparse   shrink_slab

  CPU % (direct reclaim vs the rest): 47% vs 53%
  # grep oom_kill /proc/vmstat
  oom_kill 80

I also compared other numbers from /proc/vmstat. They do not provide
any additional insight than the profiles, so I will just omit them
here.

The following optimizations and the stats measuring their efficacies
explain why the multigenerational LRU did not perform worse:

  Optimization 1: take advantage of the scheduling information.
    # of active processes           270
    # of inactive processes         105

  Optimization 2: take the advantage of the accessed bit on non-leaf
  PMD entries.
    # of old non-leaf PMD entries   30523335
    # of young non-leaf PMD entries 1358400

These stats are not currently included. But I will add them to the
debugfs interface in the next version coming soon. And I will also add
another optimization for Android. It reduces zigzags when there are
many single-page VMAs, i.e., not returning to the PGD table for each
of such VMAs. Just a heads-up.

The rmap, on the other hand, had to
  1) lock each (shmem) page it scans
  2) go through five levels of page tables for each page, even though
  some of them have the same LCAs
during the test. The second part is worse given that I have 5 levels
of page tables configured.

Any additional benchmarks you would suggest? Thanks.

[-- Attachment #2: sparse.c --]
[-- Type: text/x-csrc, Size: 961 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/types.h>

#define NR_TASKS	450UL
#define MMAP_SIZE	(256UL << 30)

#define PMD_SIZE	(1UL << 21)
#define NR_PMDS		(MMAP_SIZE / PMD_SIZE)
#define NR_LOOPS	(NR_PMDS * 200)

int main(void)
{
	unsigned long i;
	void *start;
	pid_t pid;

	start = mmap(NULL, MMAP_SIZE, PROT_READ | PROT_WRITE,
		     MAP_ANONYMOUS | MAP_SHARED | MAP_NORESERVE, -1, 0);
	if (start == MAP_FAILED) {
		perror("mmap");
		return -1;
	}

	if (madvise(start, MMAP_SIZE, MADV_NOHUGEPAGE)) {
		perror("madvise");
		return -1;
	}

	for (i = 0; i < NR_TASKS; i++) {
		pid = fork();
		if (pid < 0) {
			perror("fork");
			return -1;
		}

		if (!pid)
			break;
	}

	pid = getpid();
	srand48(pid);

	for (i = 0; i < NR_LOOPS; i++) {
		unsigned long offset = (lrand48() % NR_PMDS) * PMD_SIZE;
		unsigned long *addr = start + offset;

		*addr = i;
	}

	return 0;
}

  reply	other threads:[~2021-04-10  9:22 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
2021-03-13  7:57 ` [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
2021-03-13 15:09   ` Matthew Wilcox
2021-03-14  7:45     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 02/14] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao
2021-03-13  7:57 ` [PATCH v1 03/14] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao
2021-03-13  7:57 ` [PATCH v1 04/14] include/linux/cgroup.h: export cgroup_mutex Yu Zhao
2021-03-13  7:57 ` [PATCH v1 05/14] mm/swap.c: export activate_page() Yu Zhao
2021-03-13  7:57 ` [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
2021-03-14 22:12   ` Zi Yan
2021-03-14 22:51     ` Matthew Wilcox
2021-03-15  0:03       ` Yu Zhao
2021-03-15  0:27         ` Zi Yan
2021-03-15  1:04           ` Yu Zhao
2021-03-14 23:22   ` Dave Hansen
2021-03-15  3:16     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 07/14] mm/pagewalk.c: add pud_entry_post() for post-order traversals Yu Zhao
2021-03-13  7:57 ` [PATCH v1 08/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
2021-03-13  7:57 ` [PATCH v1 09/14] mm: multigenerational lru: mm_struct list Yu Zhao
2021-03-15 19:40   ` Rik van Riel
2021-03-16  2:07     ` Huang, Ying
2021-03-16  3:57       ` Yu Zhao
2021-03-16  6:44         ` Huang, Ying
2021-03-16  7:56           ` Yu Zhao
2021-03-17  3:37             ` Huang, Ying
2021-03-17 10:46               ` Yu Zhao
2021-03-22  3:13                 ` Huang, Ying
2021-03-22  8:08                   ` Yu Zhao
2021-03-24  6:58                     ` Huang, Ying
2021-04-10 18:48                       ` Yu Zhao
2021-04-13  3:06                         ` Huang, Ying
2021-03-13  7:57 ` [PATCH v1 10/14] mm: multigenerational lru: core Yu Zhao
2021-03-15  2:02   ` Andi Kleen
2021-03-15  3:37     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 11/14] mm: multigenerational lru: page activation Yu Zhao
2021-03-16 16:34   ` Matthew Wilcox
2021-03-16 21:29     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 12/14] mm: multigenerational lru: user space interface Yu Zhao
2021-03-13  7:57 ` [PATCH v1 13/14] mm: multigenerational lru: Kconfig Yu Zhao
2021-03-13  7:57 ` [PATCH v1 14/14] mm: multigenerational lru: documentation Yu Zhao
2021-03-19  9:31   ` Alex Shi
2021-03-22  6:09     ` Yu Zhao
2021-03-14 22:48 ` [PATCH v1 00/14] Multigenerational LRU Zi Yan
2021-03-15  0:52   ` Yu Zhao
     [not found] ` <20210315011350.3648-1-hdanton@sina.com>
2021-03-15  6:49   ` Yu Zhao
2021-03-15 18:00 ` Dave Hansen
2021-03-16  2:24   ` Yu Zhao
2021-03-16 14:50     ` Dave Hansen
2021-03-16 20:30       ` Yu Zhao
2021-03-16 21:14         ` Dave Hansen
2021-04-10  9:21           ` Yu Zhao [this message]
2021-04-13  3:02             ` Huang, Ying
2021-04-13 23:00               ` Yu Zhao
2021-03-15 18:38 ` Yang Shi
2021-03-16  3:38   ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YHFuL/Ddtiml4biw@google.com \
    --to=yuzhao@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@linux.alibaba.com \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=page-reclaim@google.com \
    --cc=richard.weiyang@linux.alibaba.com \
    --cc=shy828301@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).