From: Barry Song <21cnbao@gmail.com>
To: yuzhao@google.com
Cc: 21cnbao@gmail.com, Hi-Angel@yandex.ru,
Michael@michaellarabel.com, ak@linux.intel.com,
akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com,
axboe@kernel.dk, bgeffon@google.com, catalin.marinas@arm.com,
corbet@lwn.net, d@chaos-reins.com, dave.hansen@linux.intel.com,
djbyrne@mtu.edu, hannes@cmpxchg.org, hdanton@sina.com,
heftig@archlinux.org, holger@applied-asynchrony.com,
jsbarnes@google.com, linux-arm-kernel@lists.infradead.org,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, mgorman@suse.de, mhocko@kernel.org,
oleksandr@natalenko.name, page-reclaim@google.com,
riel@surriel.com, rppt@kernel.org, sofia.trinh@edi.works,
steven@liquorix.net, suleiman@google.com,
szhai2@cs.rochester.edu, torvalds@linux-foundation.org,
vbabka@suse.cz, will@kernel.org, willy@infradead.org,
x86@kernel.org, ying.huang@intel.com
Subject: Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
Date: Tue, 15 Mar 2022 12:38:12 +1300 [thread overview]
Message-ID: <20220314233812.9011-1-21cnbao@gmail.com> (raw)
In-Reply-To: <CAOUHufbN_56UJBkgA2LjAfbTt9nzPOCHaSeS4P3GHcYst+Y+eg@mail.gmail.com>
On Tue, Mar 15, 2022 at 5:45 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Mar 14, 2022 at 5:12 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > > > >
> > > > > > > We used to put a faulted file page in inactive, if we access it a
> > > > > > > second time, it can be promoted
> > > > > > > to active. then in recent years, we have also applied this to anon
> > > > > > > pages while kernel adds
> > > > > > > workingset protection for anon pages. so basically both anon and file
> > > > > > > pages go into the inactive
> > > > > > > list for the 1st time, if we access it for the second time, they go to
> > > > > > > the active list. if we don't access
> > > > > > > it any more, they are likely to be reclaimed as they are inactive.
> > > > > > > we do have some special fastpath for code section, executable file
> > > > > > > pages are kept on active list
> > > > > > > as long as they are accessed.
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > > so all of the above concerns are actually not that correct?
> > > > > >
> > > > > > They are valid concerns but I don't know any popular workloads that
> > > > > > care about them.
> > > > >
> > > > > Hi Yu,
> > > > > here we can get a workload in Kim's patchset while he added workingset
> > > > > protection
> > > > > for anon pages:
> > > > > https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/
> > > >
> > > > Thanks. I wouldn't call that a workload because it's not a real
> > > > application. By popular workloads, I mean applications that the
> > > > majority of people actually run on phones, in cloud, etc.
> > > >
> > > > > anon pages used to go to active rather than inactive, but kim's patchset
> > > > > moved to use inactive first. then only after the anon page is accessed
> > > > > second time, it can move to active.
> > > >
> > > > Yes. To clarify, the A-bit doesn't really mean the first or second
> > > > access. It can be many accesses each time it's set.
> > > >
> > > > > "In current implementation, newly created or swap-in anonymous page is
> > > > >
> > > > > started on the active list. Growing the active list results in rebalancing
> > > > > active/inactive list so old pages on the active list are demoted to the
> > > > > inactive list. Hence, hot page on the active list isn't protected at all.
> > > > >
> > > > > Following is an example of this situation.
> > > > >
> > > > > Assume that 50 hot pages on active list and system can contain total
> > > > > 100 pages. Numbers denote the number of pages on active/inactive
> > > > > list (active | inactive). (h) stands for hot pages and (uo) stands for
> > > > > used-once pages.
> > > > >
> > > > > 1. 50 hot pages on active list
> > > > > 50(h) | 0
> > > > >
> > > > > 2. workload: 50 newly created (used-once) pages
> > > > > 50(uo) | 50(h)
> > > > >
> > > > > 3. workload: another 50 newly created (used-once) pages
> > > > > 50(uo) | 50(uo), swap-out 50(h)
> > > > >
> > > > > As we can see, hot pages are swapped-out and it would cause swap-in later."
> > > > >
> > > > > Is MGLRU able to avoid the swap-out of the 50 hot pages?
> > > >
> > > > I think the real question is why the 50 hot pages can be moved to the
> > > > inactive list. If they are really hot, the A-bit should protect them.
> > >
> > > This is a good question.
> > >
> > > I guess it is probably because the current lru is trying to maintain a balance
> > > between the sizes of active and inactive lists. Thus, it can shrink active list
> > > even though pages might be still "hot" but not the recently accessed ones.
> > >
> > > 1. 50 hot pages on active list
> > > 50(h) | 0
> > >
> > > 2. workload: 50 newly created (used-once) pages
> > > 50(uo) | 50(h)
> > >
> > > 3. workload: another 50 newly created (used-once) pages
> > > 50(uo) | 50(uo), swap-out 50(h)
> > >
> > > the old kernel without anon workingset protection put workload 2 on active, so
> > > pushed 50 hot pages from active to inactive. workload 3 would further contribute
> > > to evict the 50 hot pages.
> > >
> > > it seems mglru doesn't demote pages from the youngest generation to older
> > > generation only in order to balance the list size? so mglru is probably safe
> > > in these cases.
> > >
> > > I will run some tests mentioned in Kim's patchset and report the result to you
> > > afterwards.
> > >
> >
> > Hi Yu,
> > I did find putting faulted pages to the youngest generation lead to some
> > regression in the case ebizzy Kim's patchset mentioned while he tried
> > to support workingset protection for anon pages.
> > i did a little bit modification for rand_chunk() which is probably similar
> > with the modifcation() Kim mentioned in his patchset. The modification
> > can be found here:
> > https://github.com/21cnbao/ltp/commit/7134413d747bfa9ef
> >
> > The test env is a x86 machine in which I have set memory size to 2.5GB and
> > set zRAM to 2GB and disabled external disk swap.
> >
> > with the vanilla kernel:
> > \time -v ./a.out -vv -t 4 -s 209715200 -S 200000
> >
> > so we have 10 chunks and 4 threads, each trunk is 209715200(200MB)
> >
> > typical result:
> > Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> > User time (seconds): 36.19
> > System time (seconds): 229.72
> > Percent of CPU this job got: 371%
> > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:11.59
> > Average shared text size (kbytes): 0
> > Average unshared data size (kbytes): 0
> > Average stack size (kbytes): 0
> > Average total size (kbytes): 0
> > Maximum resident set size (kbytes): 2166196
> > Average resident set size (kbytes): 0
> > Major (requiring I/O) page faults: 9990128
> > Minor (reclaiming a frame) page faults: 33315945
> > Voluntary context switches: 59144
> > Involuntary context switches: 167754
> > Swaps: 0
> > File system inputs: 2760
> > File system outputs: 8
> > Socket messages sent: 0
> > Socket messages received: 0
> > Signals delivered: 0
> > Page size (bytes): 4096
> > Exit status: 0
> >
> > with gen_lru and lru_gen/enabled=0x3:
> > typical result:
> > Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> > User time (seconds): 36.34
> > System time (seconds): 276.07
> > Percent of CPU this job got: 378%
> > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:22.46
> > **** 15% time +
> > Average shared text size (kbytes): 0
> > Average unshared data size (kbytes): 0
> > Average stack size (kbytes): 0
> > Average total size (kbytes): 0
> > Maximum resident set size (kbytes): 2168120
> > Average resident set size (kbytes): 0
> > Major (requiring I/O) page faults: 13362810
> > ***** 30% page fault +
> > Minor (reclaiming a frame) page faults: 33394617
> > Voluntary context switches: 55216
> > Involuntary context switches: 137220
> > Swaps: 0
> > File system inputs: 4088
> > File system outputs: 8
> > Socket messages sent: 0
> > Socket messages received: 0
> > Signals delivered: 0
> > Page size (bytes): 4096
> > Exit status: 0
> >
> > with gen_lru and lru_gen/enabled=0x7:
> > typical result:
> > Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> > User time (seconds): 36.13
> > System time (seconds): 251.71
> > Percent of CPU this job got: 378%
> > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:16.00
> > *****better than enabled=0x3, worse than vanilla
> > Average shared text size (kbytes): 0
> > Average unshared data size (kbytes): 0
> > Average stack size (kbytes): 0
> > Average total size (kbytes): 0
> > Maximum resident set size (kbytes): 2120988
> > Average resident set size (kbytes): 0
> > Major (requiring I/O) page faults: 12706512
> > Minor (reclaiming a frame) page faults: 33422243
> > Voluntary context switches: 49485
> > Involuntary context switches: 126765
> > Swaps: 0
> > File system inputs: 2976
> > File system outputs: 8
> > Socket messages sent: 0
> > Socket messages received: 0
> > Signals delivered: 0
> > Page size (bytes): 4096
> > Exit status: 0
> >
> > I can also reproduce the problem on arm64.
> >
> > I am not saying this is going to block mglru from being mainlined. But I am
> > still curious if this is an issue worth being addressed somehow in mglru.
>
> You've missed something very important: *thoughput* :)
>
noop :-)
in the test case, there are 4 threads. they are searching a key in 10 chunks
of memory. for each chunk, the size is 200MB.
a "random" chunk index is returned for those threads to search. but chunk2
is the hottest, and chunk3, 7, 4 are relatively hotter than others.
static inline unsigned int rand_chunk(void)
{
/* simulate hot and cold chunk */
unsigned int rand[16] = {2, 2, 3, 4, 5, 2, 6, 7, 9, 2, 8, 3, 7, 2, 2, 4};
static int nr = 0;
return rand[nr++%16];
}
each thread does search_mem():
static unsigned int search_mem(void)
{
record_t key, *found;
record_t *src, *copy;
unsigned int chunk;
size_t copy_size = chunk_size;
unsigned int i;
unsigned int state = 0;
/* run 160 loops or till timeout */
for (i = 0; threads_go == 1 && i < 160; i++) {
chunk = rand_chunk();
src = mem[chunk];
...
copy = alloc_mem(copy_size);
...
memcpy(copy, src, copy_size);
key = rand_num(copy_size / record_size, &state);
bsearch(&key, copy, copy_size / record_size,
record_size, compare);
/* Below check is mainly for memory corruption or other bug */
if (found == NULL) {
fprintf(stderr, "Couldn't find key %zd\n", key);
exit(1);
}
} /* end if ! touch_pages */
free_mem(copy, copy_size);
}
return (i);
}
each thread picks up a chunk, then allocates a new memory and copies the chunk to the
new allocated memory, and searches a key in the allocated memory.
as i have set time to rather big by -S, so each thread actually exits while it
completes 160 loops.
$ \time -v ./ebizzy -t 4 -s $((200*1024*1024)) -S 6000000
so the one who finishes the whole jobs earlier wins in throughput as
well.
> Dollars to doughnuts there was a large increase in throughput -- I
> haven't tried this benchmark but I've seen many reports similar to
> this one.
I have no doubt about this. I am just trying to figure out some potential we can
further achieve in mglru.
Thanks,
Barry
next prev parent reply other threads:[~2022-03-14 23:38 UTC|newest]
Thread overview: 74+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-08 8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
2022-02-08 8:18 ` [PATCH v7 01/12] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
2022-02-08 8:24 ` Yu Zhao
2022-02-08 10:33 ` Will Deacon
2022-02-08 8:18 ` [PATCH v7 02/12] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
2022-02-08 8:27 ` Yu Zhao
2022-02-08 8:18 ` [PATCH v7 03/12] mm/vmscan.c: refactor shrink_node() Yu Zhao
2022-02-08 8:18 ` [PATCH v7 04/12] mm: multigenerational LRU: groundwork Yu Zhao
2022-02-08 8:28 ` Yu Zhao
2022-02-10 20:41 ` Johannes Weiner
2022-02-15 9:43 ` Yu Zhao
2022-02-15 21:53 ` Johannes Weiner
2022-02-21 8:14 ` Yu Zhao
2022-02-23 21:18 ` Yu Zhao
2022-02-25 16:34 ` Minchan Kim
2022-03-03 15:29 ` Johannes Weiner
2022-03-03 19:26 ` Yu Zhao
2022-03-03 21:43 ` Johannes Weiner
2022-03-11 10:16 ` Barry Song
2022-03-11 23:45 ` Yu Zhao
2022-03-12 10:37 ` Barry Song
2022-03-12 21:11 ` Yu Zhao
2022-03-13 4:57 ` Barry Song
2022-03-14 11:11 ` Barry Song
2022-03-14 16:45 ` Yu Zhao
2022-03-14 23:38 ` Barry Song [this message]
[not found] ` <CAOUHufa9eY44QadfGTzsxa2=hEvqwahXd7Canck5Gt-N6c4UKA@mail.gmail.com>
[not found] ` <CAGsJ_4zvj5rmz7DkW-kJx+jmUT9G8muLJ9De--NZma9ey0Oavw@mail.gmail.com>
2022-03-15 10:29 ` Barry Song
2022-03-16 2:46 ` Yu Zhao
2022-03-16 4:37 ` Barry Song
2022-03-16 5:44 ` Yu Zhao
2022-03-16 6:06 ` Barry Song
2022-03-16 21:37 ` Yu Zhao
2022-02-10 21:37 ` Matthew Wilcox
2022-02-13 21:16 ` Yu Zhao
2022-02-08 8:18 ` [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation Yu Zhao
2022-02-08 8:33 ` Yu Zhao
2022-02-08 16:50 ` Johannes Weiner
2022-02-10 2:53 ` Yu Zhao
2022-02-13 10:04 ` Hillf Danton
2022-02-17 0:13 ` Yu Zhao
2022-02-23 8:27 ` Huang, Ying
2022-02-23 9:36 ` Yu Zhao
2022-02-24 0:59 ` Huang, Ying
2022-02-24 1:34 ` Yu Zhao
2022-02-24 3:31 ` Huang, Ying
2022-02-24 4:09 ` Yu Zhao
2022-02-24 5:27 ` Huang, Ying
2022-02-24 5:35 ` Yu Zhao
2022-02-08 8:18 ` [PATCH v7 06/12] mm: multigenerational LRU: exploit locality in rmap Yu Zhao
2022-02-08 8:40 ` Yu Zhao
2022-02-08 8:18 ` [PATCH v7 07/12] mm: multigenerational LRU: support page table walks Yu Zhao
2022-02-08 8:39 ` Yu Zhao
2022-02-08 8:18 ` [PATCH v7 08/12] mm: multigenerational LRU: optimize multiple memcgs Yu Zhao
2022-02-08 8:18 ` [PATCH v7 09/12] mm: multigenerational LRU: runtime switch Yu Zhao
2022-02-08 8:42 ` Yu Zhao
2022-02-08 8:19 ` [PATCH v7 10/12] mm: multigenerational LRU: thrashing prevention Yu Zhao
2022-02-08 8:43 ` Yu Zhao
2022-02-08 8:19 ` [PATCH v7 11/12] mm: multigenerational LRU: debugfs interface Yu Zhao
2022-02-18 18:56 ` [page-reclaim] " David Rientjes
2022-02-08 8:19 ` [PATCH v7 12/12] mm: multigenerational LRU: documentation Yu Zhao
2022-02-08 8:44 ` Yu Zhao
2022-02-14 10:28 ` Mike Rapoport
2022-02-16 3:22 ` Yu Zhao
2022-02-21 9:01 ` Mike Rapoport
2022-02-22 1:47 ` Yu Zhao
2022-02-23 10:58 ` Mike Rapoport
2022-02-23 21:20 ` Yu Zhao
2022-02-08 10:11 ` [PATCH v7 00/12] Multigenerational LRU Framework Oleksandr Natalenko
2022-02-08 11:14 ` Michal Hocko
2022-02-08 11:23 ` Oleksandr Natalenko
2022-02-11 20:12 ` Alexey Avramov
2022-02-12 21:01 ` Yu Zhao
2022-03-03 6:06 ` Vaibhav Jain
2022-03-03 6:47 ` Yu Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220314233812.9011-1-21cnbao@gmail.com \
--to=21cnbao@gmail.com \
--cc=Hi-Angel@yandex.ru \
--cc=Michael@michaellarabel.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=axboe@kernel.dk \
--cc=bgeffon@google.com \
--cc=catalin.marinas@arm.com \
--cc=corbet@lwn.net \
--cc=d@chaos-reins.com \
--cc=dave.hansen@linux.intel.com \
--cc=djbyrne@mtu.edu \
--cc=hannes@cmpxchg.org \
--cc=hdanton@sina.com \
--cc=heftig@archlinux.org \
--cc=holger@applied-asynchrony.com \
--cc=jsbarnes@google.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@kernel.org \
--cc=oleksandr@natalenko.name \
--cc=page-reclaim@google.com \
--cc=riel@surriel.com \
--cc=rppt@kernel.org \
--cc=sofia.trinh@edi.works \
--cc=steven@liquorix.net \
--cc=suleiman@google.com \
--cc=szhai2@cs.rochester.edu \
--cc=torvalds@linux-foundation.org \
--cc=vbabka@suse.cz \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
--cc=ying.huang@intel.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).