* [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache
@ 2020-03-23 23:41 Dave Hansen
2020-03-23 23:41 ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Dave Hansen
2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen
0 siblings, 2 replies; 7+ messages in thread
From: Dave Hansen @ 2020-03-23 23:41 UTC (permalink / raw)
To: linux-kernel
Cc: Dave Hansen, mhocko, jannh, vbabka, minchan, dancol, joel, akpm
MADV_PAGEOUT ignores swap cache pages which are not mapped
into the calling process. That's nasty because it means that
some of the most easily reclaimable pages are inaccessible to
a mechanism designed to reclaim pages.
This all turned out to be a bit nastier and more complicated
than I would have liked. This has been lightly tested, but I
did pass a normal barrage of compile tests.
I rigged up a little test program to try to create these
situations. Basically, if the parent "reader" RSS changes
in response to MADV_PAGEOUT actions in the child, there is
a problem.
https://www.sr71.net/~dave/intel/madv-pageout.c
I'd add this to selftests, but it *requires* swap to work
and its parsing of /proc/self/maps is quite icky.
P.S. mincore() really doesn't work at all in this situation,
probably something we need to look at too.
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jann Horn <jannh@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages
2020-03-23 23:41 [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache Dave Hansen
@ 2020-03-23 23:41 ` Dave Hansen
2020-03-26 6:24 ` Minchan Kim
2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen
1 sibling, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2020-03-23 23:41 UTC (permalink / raw)
To: linux-kernel
Cc: Dave Hansen, mhocko, jannh, vbabka, minchan, dancol, joel, akpm
From: Dave Hansen <dave.hansen@linux.intel.com>
tl;dr: MADV_PAGEOUT ignores unmapped swap cache pages. Enable
MADV_PAGEOUT to find and reclaim swap cache.
The long story:
Looking for another issue, I wrote a simple test which had two
processes: a parent and a fork()'d child. The parent reads a
memory buffer shared by the fork() and the child calls
madvise(MADV_PAGEOUT) on the same buffer.
The first call to MADV_PAGEOUT does what is expected: it pages
the memory out and causes faults in the parent. However, after
that, it does not cause any faults in the parent. MADV_PAGEOUT
only works once! This was a surprise.
The PTEs in the shared buffer start out pte_present()==1 in
both parent and child. The first MADV_PAGEOUT operation replaces
those with pte_present()==0 swap PTEs. The parent process
quickly faults and recreates pte_present()==1. However, the
child process (the one calling MADV_PAGEOUT) never touches the
memory and has retained the non-present swap PTEs.
This situation could also happen in the case where a single
process had some of its data placed in the swap cache but where
the memory has not yet been reclaimed.
The MADV_PAGEOUT code has a pte_present()==0 check. It will
essentially ignore any pte_present()==0 pages. This essentially
makes unmapped swap cache immune from MADV_PAGEOUT, which is not
very friendly behavior.
Enable MADV_PAGEOUT to find and reclaim swap cache. Because
swap cache is not pinned by holding the PTE lock, a reference
must be held until the page is isolated, where a second
reference is obtained.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jann Horn <jannh@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
b/mm/madvise.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 57 insertions(+), 11 deletions(-)
diff -puN mm/madvise.c~madv-pageout-find-swap-cache mm/madvise.c
--- a/mm/madvise.c~madv-pageout-find-swap-cache 2020-03-23 16:30:48.505385896 -0700
+++ b/mm/madvise.c 2020-03-23 16:30:48.509385896 -0700
@@ -250,6 +250,52 @@ static void force_shm_swapin_readahead(s
#endif /* CONFIG_SWAP */
/*
+ * Given a PTE, find the corresponding 'struct page'
+ * and acquire a reference. Also handles non-present
+ * swap PTEs.
+ *
+ * Returns NULL when there is no page to reclaim.
+ */
+static struct page *pte_get_reclaim_page(struct vm_area_struct *vma,
+ unsigned long addr, pte_t ptent)
+{
+ swp_entry_t entry;
+ struct page *page;
+
+ /* Totally empty PTE: */
+ if (pte_none(ptent))
+ return NULL;
+
+ /* Handle present or PROT_NONE ptes: */
+ if (!is_swap_pte(ptent)) {
+ page = vm_normal_page(vma, addr, ptent);
+ if (page)
+ get_page(page);
+ return page;
+ }
+
+ /*
+ * 'ptent' is now definitely a (non-present) swap
+ * PTE in this process. Go look for additional
+ * references to the swap cache.
+ */
+
+ /*
+ * Is it one of the "swap PTEs" that's not really
+ * swap? Do not try to reclaim those.
+ */
+ entry = pte_to_swp_entry(ptent);
+ if (non_swap_entry(entry))
+ return NULL;
+
+ /*
+ * The PTE was a true swap entry. The page may be in
+ * the swap cache.
+ */
+ return lookup_swap_cache(entry, vma, addr);
+}
+
+/*
* Schedule all required I/O operations. Do not wait for completion.
*/
static long madvise_willneed(struct vm_area_struct *vma,
@@ -398,13 +444,8 @@ regular_page:
for (; addr < end; pte++, addr += PAGE_SIZE) {
ptent = *pte;
- if (pte_none(ptent))
- continue;
-
- if (!pte_present(ptent))
- continue;
-
- page = vm_normal_page(vma, addr, ptent);
+ /* 'page' can be mapped, in the swap cache or both */
+ page = pte_get_reclaim_page(vma, addr, ptent);
if (!page)
continue;
@@ -413,9 +454,10 @@ regular_page:
* are sure it's worth. Split it if we are only owner.
*/
if (PageTransCompound(page)) {
- if (page_mapcount(page) != 1)
+ if (page_mapcount(page) != 1) {
+ put_page(page);
break;
- get_page(page);
+ }
if (!trylock_page(page)) {
put_page(page);
break;
@@ -436,12 +478,14 @@ regular_page:
}
/* Do not interfere with other mappings of this page */
- if (page_mapcount(page) != 1)
+ if (page_mapcount(page) != 1) {
+ put_page(page);
continue;
+ }
VM_BUG_ON_PAGE(PageTransCompound(page), page);
- if (pte_young(ptent)) {
+ if (!is_swap_pte(ptent) && pte_young(ptent)) {
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
ptent = pte_mkold(ptent);
@@ -466,6 +510,8 @@ regular_page:
}
} else
deactivate_page(page);
+ /* drop ref acquired in pte_get_reclaim_page() */
+ put_page(page);
}
arch_leave_lazy_mmu_mode();
_
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared swap cache pages
2020-03-23 23:41 [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache Dave Hansen
2020-03-23 23:41 ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Dave Hansen
@ 2020-03-23 23:41 ` Dave Hansen
2020-03-26 6:28 ` Minchan Kim
1 sibling, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2020-03-23 23:41 UTC (permalink / raw)
To: linux-kernel
Cc: Dave Hansen, mhocko, jannh, vbabka, minchan, dancol, joel, akpm
From: Dave Hansen <dave.hansen@linux.intel.com>
MADV_PAGEOUT might interfere with other processes if it is
allowed to reclaim pages shared with other processses. A
previous patch tried to avoid this for anonymous pages
which were shared by a fork(). It did this by checking
page_mapcount().
That works great for mapped pages. But, it can not detect
unmapped swap cache pages. This has not been a problem,
until the previous patch which added the ability for
MADV_PAGEOUT to *find* swap cache pages.
A process doing MADV_PAGEOUT which finds an unmapped swap
cache page and evicts it might interfere with another process
which had the same page mapped. But, such a page would have
a page_mapcount() of 1 since the page is only actually mapped
in the *other* process. The page_mapcount() test would fail
to detect the situation.
Thankfully, there is a reference count for swap entries.
To fix this, simply consult both page_mapcount() and the swap
reference count via page_swapcount().
I rigged up a little test program to try to create these
situations. Basically, if the parent "reader" RSS changes
in response to MADV_PAGEOUT actions in the child, there is
a problem.
https://www.sr71.net/~dave/intel/madv-pageout.c
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jann Horn <jannh@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
b/mm/madvise.c | 37 +++++++++++++++++++++++++++++--------
1 file changed, 29 insertions(+), 8 deletions(-)
diff -puN mm/madvise.c~madv-pageout-ignore-shared-swap-cache mm/madvise.c
--- a/mm/madvise.c~madv-pageout-ignore-shared-swap-cache 2020-03-23 16:30:52.022385888 -0700
+++ b/mm/madvise.c 2020-03-23 16:41:15.448384333 -0700
@@ -261,6 +261,7 @@ static struct page *pte_get_reclaim_page
{
swp_entry_t entry;
struct page *page;
+ int nr_page_references = 0;
/* Totally empty PTE: */
if (pte_none(ptent))
@@ -271,7 +272,7 @@ static struct page *pte_get_reclaim_page
page = vm_normal_page(vma, addr, ptent);
if (page)
get_page(page);
- return page;
+ goto got_page;
}
/*
@@ -292,7 +293,33 @@ static struct page *pte_get_reclaim_page
* The PTE was a true swap entry. The page may be in
* the swap cache.
*/
- return lookup_swap_cache(entry, vma, addr);
+ page = lookup_swap_cache(entry, vma, addr);
+ if (!page)
+ return NULL;
+got_page:
+ /*
+ * Account for references to the swap entry. These
+ * might be "upgraded" to a normal mapping at any
+ * time.
+ */
+ if (PageSwapCache(page))
+ nr_page_references += page_swapcount(page);
+
+ /*
+ * Account for all mappings of the page, including
+ * when it is in the swap cache. This ensures that
+ * MADV_PAGOUT not interfere with anything shared
+ * with another process.
+ */
+ nr_page_references += page_mapcount(page);
+
+ /* Any extra references? Do not reclaim it. */
+ if (nr_page_references > 1) {
+ put_page(page);
+ return NULL;
+ }
+
+ return page;
}
/*
@@ -477,12 +504,6 @@ regular_page:
continue;
}
- /* Do not interfere with other mappings of this page */
- if (page_mapcount(page) != 1) {
- put_page(page);
- continue;
- }
-
VM_BUG_ON_PAGE(PageTransCompound(page), page);
if (!is_swap_pte(ptent) && pte_young(ptent)) {
_
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages
2020-03-23 23:41 ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Dave Hansen
@ 2020-03-26 6:24 ` Minchan Kim
0 siblings, 0 replies; 7+ messages in thread
From: Minchan Kim @ 2020-03-26 6:24 UTC (permalink / raw)
To: Dave Hansen; +Cc: linux-kernel, mhocko, jannh, vbabka, dancol, joel, akpm
On Mon, Mar 23, 2020 at 04:41:49PM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> tl;dr: MADV_PAGEOUT ignores unmapped swap cache pages. Enable
> MADV_PAGEOUT to find and reclaim swap cache.
>
> The long story:
>
> Looking for another issue, I wrote a simple test which had two
> processes: a parent and a fork()'d child. The parent reads a
> memory buffer shared by the fork() and the child calls
> madvise(MADV_PAGEOUT) on the same buffer.
>
> The first call to MADV_PAGEOUT does what is expected: it pages
> the memory out and causes faults in the parent. However, after
> that, it does not cause any faults in the parent. MADV_PAGEOUT
> only works once! This was a surprise.
>
> The PTEs in the shared buffer start out pte_present()==1 in
> both parent and child. The first MADV_PAGEOUT operation replaces
> those with pte_present()==0 swap PTEs. The parent process
> quickly faults and recreates pte_present()==1. However, the
> child process (the one calling MADV_PAGEOUT) never touches the
> memory and has retained the non-present swap PTEs.
>
> This situation could also happen in the case where a single
> process had some of its data placed in the swap cache but where
> the memory has not yet been reclaimed.
>
> The MADV_PAGEOUT code has a pte_present()==0 check. It will
> essentially ignore any pte_present()==0 pages. This essentially
> makes unmapped swap cache immune from MADV_PAGEOUT, which is not
> very friendly behavior.
>
> Enable MADV_PAGEOUT to find and reclaim swap cache. Because
> swap cache is not pinned by holding the PTE lock, a reference
> must be held until the page is isolated, where a second
> reference is obtained.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Minchan Kim <minchan@kernel.org>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared swap cache pages
2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen
@ 2020-03-26 6:28 ` Minchan Kim
2020-03-26 23:00 ` Dave Hansen
0 siblings, 1 reply; 7+ messages in thread
From: Minchan Kim @ 2020-03-26 6:28 UTC (permalink / raw)
To: Dave Hansen; +Cc: linux-kernel, mhocko, jannh, vbabka, dancol, joel, akpm
On Mon, Mar 23, 2020 at 04:41:51PM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> MADV_PAGEOUT might interfere with other processes if it is
> allowed to reclaim pages shared with other processses. A
> previous patch tried to avoid this for anonymous pages
> which were shared by a fork(). It did this by checking
> page_mapcount().
>
> That works great for mapped pages. But, it can not detect
> unmapped swap cache pages. This has not been a problem,
> until the previous patch which added the ability for
> MADV_PAGEOUT to *find* swap cache pages.
>
> A process doing MADV_PAGEOUT which finds an unmapped swap
> cache page and evicts it might interfere with another process
> which had the same page mapped. But, such a page would have
> a page_mapcount() of 1 since the page is only actually mapped
> in the *other* process. The page_mapcount() test would fail
> to detect the situation.
>
> Thankfully, there is a reference count for swap entries.
> To fix this, simply consult both page_mapcount() and the swap
> reference count via page_swapcount().
>
> I rigged up a little test program to try to create these
> situations. Basically, if the parent "reader" RSS changes
> in response to MADV_PAGEOUT actions in the child, there is
> a problem.
>
> https://www.sr71.net/~dave/intel/madv-pageout.c
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Daniel Colascione <dancol@google.com>
> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>
> b/mm/madvise.c | 37 +++++++++++++++++++++++++++++--------
> 1 file changed, 29 insertions(+), 8 deletions(-)
>
> diff -puN mm/madvise.c~madv-pageout-ignore-shared-swap-cache mm/madvise.c
> --- a/mm/madvise.c~madv-pageout-ignore-shared-swap-cache 2020-03-23 16:30:52.022385888 -0700
> +++ b/mm/madvise.c 2020-03-23 16:41:15.448384333 -0700
> @@ -261,6 +261,7 @@ static struct page *pte_get_reclaim_page
> {
> swp_entry_t entry;
> struct page *page;
> + int nr_page_references = 0;
nit: just 'referenced' would be enough.
Acked-by: Minchan Kim <minchan@kernel.org>
Thanks, Dave!
_
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared swap cache pages
2020-03-26 6:28 ` Minchan Kim
@ 2020-03-26 23:00 ` Dave Hansen
2020-03-27 6:42 ` Minchan Kim
0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2020-03-26 23:00 UTC (permalink / raw)
To: Minchan Kim, Dave Hansen
Cc: linux-kernel, mhocko, jannh, vbabka, dancol, joel, akpm
On 3/25/20 11:28 PM, Minchan Kim wrote:
>> diff -puN mm/madvise.c~madv-pageout-ignore-shared-swap-cache mm/madvise.c
>> --- a/mm/madvise.c~madv-pageout-ignore-shared-swap-cache 2020-03-23 16:30:52.022385888 -0700
>> +++ b/mm/madvise.c 2020-03-23 16:41:15.448384333 -0700
>> @@ -261,6 +261,7 @@ static struct page *pte_get_reclaim_page
>> {
>> swp_entry_t entry;
>> struct page *page;
>> + int nr_page_references = 0;
> nit: just 'referenced' would be enough.
I guess I could track one bit like that. But, it would require checking
both page_mapcount() and page_swapcount() for being >1. This way, I
just accumulate the count and have a check at a single place.
I think it ends up much simpler this way.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared swap cache pages
2020-03-26 23:00 ` Dave Hansen
@ 2020-03-27 6:42 ` Minchan Kim
0 siblings, 0 replies; 7+ messages in thread
From: Minchan Kim @ 2020-03-27 6:42 UTC (permalink / raw)
To: Dave Hansen
Cc: Dave Hansen, linux-kernel, mhocko, jannh, vbabka, dancol, joel, akpm
On Thu, Mar 26, 2020 at 04:00:09PM -0700, Dave Hansen wrote:
> On 3/25/20 11:28 PM, Minchan Kim wrote:
> >> diff -puN mm/madvise.c~madv-pageout-ignore-shared-swap-cache mm/madvise.c
> >> --- a/mm/madvise.c~madv-pageout-ignore-shared-swap-cache 2020-03-23 16:30:52.022385888 -0700
> >> +++ b/mm/madvise.c 2020-03-23 16:41:15.448384333 -0700
> >> @@ -261,6 +261,7 @@ static struct page *pte_get_reclaim_page
> >> {
> >> swp_entry_t entry;
> >> struct page *page;
> >> + int nr_page_references = 0;
> > nit: just 'referenced' would be enough.
>
> I guess I could track one bit like that. But, it would require checking
> both page_mapcount() and page_swapcount() for being >1. This way, I
> just accumulate the count and have a check at a single place.
>
> I think it ends up much simpler this way.
I meant the variable name. 'referenced' would be enough for indication
like rmap.c and khugepaged.c. Anyway, it's up to you.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-03-27 6:42 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-23 23:41 [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache Dave Hansen
2020-03-23 23:41 ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Dave Hansen
2020-03-26 6:24 ` Minchan Kim
2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen
2020-03-26 6:28 ` Minchan Kim
2020-03-26 23:00 ` Dave Hansen
2020-03-27 6:42 ` Minchan Kim
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.