* [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache @ 2020-03-23 23:41 Dave Hansen 2020-03-23 23:41 ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Dave Hansen 2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen 0 siblings, 2 replies; 7+ messages in thread From: Dave Hansen @ 2020-03-23 23:41 UTC (permalink / raw) To: linux-kernel Cc: Dave Hansen, mhocko, jannh, vbabka, minchan, dancol, joel, akpm MADV_PAGEOUT ignores swap cache pages which are not mapped into the calling process. That's nasty because it means that some of the most easily reclaimable pages are inaccessible to a mechanism designed to reclaim pages. This all turned out to be a bit nastier and more complicated than I would have liked. This has been lightly tested, but I did pass a normal barrage of compile tests. I rigged up a little test program to try to create these situations. Basically, if the parent "reader" RSS changes in response to MADV_PAGEOUT actions in the child, there is a problem. https://www.sr71.net/~dave/intel/madv-pageout.c I'd add this to selftests, but it *requires* swap to work and its parsing of /proc/self/maps is quite icky. P.S. mincore() really doesn't work at all in this situation, probably something we need to look at too. Cc: Michal Hocko <mhocko@suse.com> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Colascione <dancol@google.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: Andrew Morton <akpm@linux-foundation.org> ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages 2020-03-23 23:41 [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache Dave Hansen @ 2020-03-23 23:41 ` Dave Hansen 2020-03-26 6:24 ` Minchan Kim 2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen 1 sibling, 1 reply; 7+ messages in thread From: Dave Hansen @ 2020-03-23 23:41 UTC (permalink / raw) To: linux-kernel Cc: Dave Hansen, mhocko, jannh, vbabka, minchan, dancol, joel, akpm From: Dave Hansen <dave.hansen@linux.intel.com> tl;dr: MADV_PAGEOUT ignores unmapped swap cache pages. Enable MADV_PAGEOUT to find and reclaim swap cache. The long story: Looking for another issue, I wrote a simple test which had two processes: a parent and a fork()'d child. The parent reads a memory buffer shared by the fork() and the child calls madvise(MADV_PAGEOUT) on the same buffer. The first call to MADV_PAGEOUT does what is expected: it pages the memory out and causes faults in the parent. However, after that, it does not cause any faults in the parent. MADV_PAGEOUT only works once! This was a surprise. The PTEs in the shared buffer start out pte_present()==1 in both parent and child. The first MADV_PAGEOUT operation replaces those with pte_present()==0 swap PTEs. The parent process quickly faults and recreates pte_present()==1. However, the child process (the one calling MADV_PAGEOUT) never touches the memory and has retained the non-present swap PTEs. This situation could also happen in the case where a single process had some of its data placed in the swap cache but where the memory has not yet been reclaimed. The MADV_PAGEOUT code has a pte_present()==0 check. It will essentially ignore any pte_present()==0 pages. This essentially makes unmapped swap cache immune from MADV_PAGEOUT, which is not very friendly behavior. Enable MADV_PAGEOUT to find and reclaim swap cache. Because swap cache is not pinned by holding the PTE lock, a reference must be held until the page is isolated, where a second reference is obtained. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Colascione <dancol@google.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: Andrew Morton <akpm@linux-foundation.org> --- b/mm/madvise.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 57 insertions(+), 11 deletions(-) diff -puN mm/madvise.c~madv-pageout-find-swap-cache mm/madvise.c --- a/mm/madvise.c~madv-pageout-find-swap-cache 2020-03-23 16:30:48.505385896 -0700 +++ b/mm/madvise.c 2020-03-23 16:30:48.509385896 -0700 @@ -250,6 +250,52 @@ static void force_shm_swapin_readahead(s #endif /* CONFIG_SWAP */ /* + * Given a PTE, find the corresponding 'struct page' + * and acquire a reference. Also handles non-present + * swap PTEs. + * + * Returns NULL when there is no page to reclaim. + */ +static struct page *pte_get_reclaim_page(struct vm_area_struct *vma, + unsigned long addr, pte_t ptent) +{ + swp_entry_t entry; + struct page *page; + + /* Totally empty PTE: */ + if (pte_none(ptent)) + return NULL; + + /* Handle present or PROT_NONE ptes: */ + if (!is_swap_pte(ptent)) { + page = vm_normal_page(vma, addr, ptent); + if (page) + get_page(page); + return page; + } + + /* + * 'ptent' is now definitely a (non-present) swap + * PTE in this process. Go look for additional + * references to the swap cache. + */ + + /* + * Is it one of the "swap PTEs" that's not really + * swap? Do not try to reclaim those. + */ + entry = pte_to_swp_entry(ptent); + if (non_swap_entry(entry)) + return NULL; + + /* + * The PTE was a true swap entry. The page may be in + * the swap cache. + */ + return lookup_swap_cache(entry, vma, addr); +} + +/* * Schedule all required I/O operations. Do not wait for completion. */ static long madvise_willneed(struct vm_area_struct *vma, @@ -398,13 +444,8 @@ regular_page: for (; addr < end; pte++, addr += PAGE_SIZE) { ptent = *pte; - if (pte_none(ptent)) - continue; - - if (!pte_present(ptent)) - continue; - - page = vm_normal_page(vma, addr, ptent); + /* 'page' can be mapped, in the swap cache or both */ + page = pte_get_reclaim_page(vma, addr, ptent); if (!page) continue; @@ -413,9 +454,10 @@ regular_page: * are sure it's worth. Split it if we are only owner. */ if (PageTransCompound(page)) { - if (page_mapcount(page) != 1) + if (page_mapcount(page) != 1) { + put_page(page); break; - get_page(page); + } if (!trylock_page(page)) { put_page(page); break; @@ -436,12 +478,14 @@ regular_page: } /* Do not interfere with other mappings of this page */ - if (page_mapcount(page) != 1) + if (page_mapcount(page) != 1) { + put_page(page); continue; + } VM_BUG_ON_PAGE(PageTransCompound(page), page); - if (pte_young(ptent)) { + if (!is_swap_pte(ptent) && pte_young(ptent)) { ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); ptent = pte_mkold(ptent); @@ -466,6 +510,8 @@ regular_page: } } else deactivate_page(page); + /* drop ref acquired in pte_get_reclaim_page() */ + put_page(page); } arch_leave_lazy_mmu_mode(); _ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages 2020-03-23 23:41 ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Dave Hansen @ 2020-03-26 6:24 ` Minchan Kim 0 siblings, 0 replies; 7+ messages in thread From: Minchan Kim @ 2020-03-26 6:24 UTC (permalink / raw) To: Dave Hansen; +Cc: linux-kernel, mhocko, jannh, vbabka, dancol, joel, akpm On Mon, Mar 23, 2020 at 04:41:49PM -0700, Dave Hansen wrote: > > From: Dave Hansen <dave.hansen@linux.intel.com> > > tl;dr: MADV_PAGEOUT ignores unmapped swap cache pages. Enable > MADV_PAGEOUT to find and reclaim swap cache. > > The long story: > > Looking for another issue, I wrote a simple test which had two > processes: a parent and a fork()'d child. The parent reads a > memory buffer shared by the fork() and the child calls > madvise(MADV_PAGEOUT) on the same buffer. > > The first call to MADV_PAGEOUT does what is expected: it pages > the memory out and causes faults in the parent. However, after > that, it does not cause any faults in the parent. MADV_PAGEOUT > only works once! This was a surprise. > > The PTEs in the shared buffer start out pte_present()==1 in > both parent and child. The first MADV_PAGEOUT operation replaces > those with pte_present()==0 swap PTEs. The parent process > quickly faults and recreates pte_present()==1. However, the > child process (the one calling MADV_PAGEOUT) never touches the > memory and has retained the non-present swap PTEs. > > This situation could also happen in the case where a single > process had some of its data placed in the swap cache but where > the memory has not yet been reclaimed. > > The MADV_PAGEOUT code has a pte_present()==0 check. It will > essentially ignore any pte_present()==0 pages. This essentially > makes unmapped swap cache immune from MADV_PAGEOUT, which is not > very friendly behavior. > > Enable MADV_PAGEOUT to find and reclaim swap cache. Because > swap cache is not pinned by holding the PTE lock, a reference > must be held until the page is isolated, where a second > reference is obtained. > > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Minchan Kim <minchan@kernel.org> ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared swap cache pages 2020-03-23 23:41 [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache Dave Hansen 2020-03-23 23:41 ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Dave Hansen @ 2020-03-23 23:41 ` Dave Hansen 2020-03-26 6:28 ` Minchan Kim 1 sibling, 1 reply; 7+ messages in thread From: Dave Hansen @ 2020-03-23 23:41 UTC (permalink / raw) To: linux-kernel Cc: Dave Hansen, mhocko, jannh, vbabka, minchan, dancol, joel, akpm From: Dave Hansen <dave.hansen@linux.intel.com> MADV_PAGEOUT might interfere with other processes if it is allowed to reclaim pages shared with other processses. A previous patch tried to avoid this for anonymous pages which were shared by a fork(). It did this by checking page_mapcount(). That works great for mapped pages. But, it can not detect unmapped swap cache pages. This has not been a problem, until the previous patch which added the ability for MADV_PAGEOUT to *find* swap cache pages. A process doing MADV_PAGEOUT which finds an unmapped swap cache page and evicts it might interfere with another process which had the same page mapped. But, such a page would have a page_mapcount() of 1 since the page is only actually mapped in the *other* process. The page_mapcount() test would fail to detect the situation. Thankfully, there is a reference count for swap entries. To fix this, simply consult both page_mapcount() and the swap reference count via page_swapcount(). I rigged up a little test program to try to create these situations. Basically, if the parent "reader" RSS changes in response to MADV_PAGEOUT actions in the child, there is a problem. https://www.sr71.net/~dave/intel/madv-pageout.c Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Colascione <dancol@google.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: Andrew Morton <akpm@linux-foundation.org> --- b/mm/madvise.c | 37 +++++++++++++++++++++++++++++-------- 1 file changed, 29 insertions(+), 8 deletions(-) diff -puN mm/madvise.c~madv-pageout-ignore-shared-swap-cache mm/madvise.c --- a/mm/madvise.c~madv-pageout-ignore-shared-swap-cache 2020-03-23 16:30:52.022385888 -0700 +++ b/mm/madvise.c 2020-03-23 16:41:15.448384333 -0700 @@ -261,6 +261,7 @@ static struct page *pte_get_reclaim_page { swp_entry_t entry; struct page *page; + int nr_page_references = 0; /* Totally empty PTE: */ if (pte_none(ptent)) @@ -271,7 +272,7 @@ static struct page *pte_get_reclaim_page page = vm_normal_page(vma, addr, ptent); if (page) get_page(page); - return page; + goto got_page; } /* @@ -292,7 +293,33 @@ static struct page *pte_get_reclaim_page * The PTE was a true swap entry. The page may be in * the swap cache. */ - return lookup_swap_cache(entry, vma, addr); + page = lookup_swap_cache(entry, vma, addr); + if (!page) + return NULL; +got_page: + /* + * Account for references to the swap entry. These + * might be "upgraded" to a normal mapping at any + * time. + */ + if (PageSwapCache(page)) + nr_page_references += page_swapcount(page); + + /* + * Account for all mappings of the page, including + * when it is in the swap cache. This ensures that + * MADV_PAGOUT not interfere with anything shared + * with another process. + */ + nr_page_references += page_mapcount(page); + + /* Any extra references? Do not reclaim it. */ + if (nr_page_references > 1) { + put_page(page); + return NULL; + } + + return page; } /* @@ -477,12 +504,6 @@ regular_page: continue; } - /* Do not interfere with other mappings of this page */ - if (page_mapcount(page) != 1) { - put_page(page); - continue; - } - VM_BUG_ON_PAGE(PageTransCompound(page), page); if (!is_swap_pte(ptent) && pte_young(ptent)) { _ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared swap cache pages 2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen @ 2020-03-26 6:28 ` Minchan Kim 2020-03-26 23:00 ` Dave Hansen 0 siblings, 1 reply; 7+ messages in thread From: Minchan Kim @ 2020-03-26 6:28 UTC (permalink / raw) To: Dave Hansen; +Cc: linux-kernel, mhocko, jannh, vbabka, dancol, joel, akpm On Mon, Mar 23, 2020 at 04:41:51PM -0700, Dave Hansen wrote: > > From: Dave Hansen <dave.hansen@linux.intel.com> > > MADV_PAGEOUT might interfere with other processes if it is > allowed to reclaim pages shared with other processses. A > previous patch tried to avoid this for anonymous pages > which were shared by a fork(). It did this by checking > page_mapcount(). > > That works great for mapped pages. But, it can not detect > unmapped swap cache pages. This has not been a problem, > until the previous patch which added the ability for > MADV_PAGEOUT to *find* swap cache pages. > > A process doing MADV_PAGEOUT which finds an unmapped swap > cache page and evicts it might interfere with another process > which had the same page mapped. But, such a page would have > a page_mapcount() of 1 since the page is only actually mapped > in the *other* process. The page_mapcount() test would fail > to detect the situation. > > Thankfully, there is a reference count for swap entries. > To fix this, simply consult both page_mapcount() and the swap > reference count via page_swapcount(). > > I rigged up a little test program to try to create these > situations. Basically, if the parent "reader" RSS changes > in response to MADV_PAGEOUT actions in the child, there is > a problem. > > https://www.sr71.net/~dave/intel/madv-pageout.c > > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Jann Horn <jannh@google.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Daniel Colascione <dancol@google.com> > Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > > b/mm/madvise.c | 37 +++++++++++++++++++++++++++++-------- > 1 file changed, 29 insertions(+), 8 deletions(-) > > diff -puN mm/madvise.c~madv-pageout-ignore-shared-swap-cache mm/madvise.c > --- a/mm/madvise.c~madv-pageout-ignore-shared-swap-cache 2020-03-23 16:30:52.022385888 -0700 > +++ b/mm/madvise.c 2020-03-23 16:41:15.448384333 -0700 > @@ -261,6 +261,7 @@ static struct page *pte_get_reclaim_page > { > swp_entry_t entry; > struct page *page; > + int nr_page_references = 0; nit: just 'referenced' would be enough. Acked-by: Minchan Kim <minchan@kernel.org> Thanks, Dave! _ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared swap cache pages 2020-03-26 6:28 ` Minchan Kim @ 2020-03-26 23:00 ` Dave Hansen 2020-03-27 6:42 ` Minchan Kim 0 siblings, 1 reply; 7+ messages in thread From: Dave Hansen @ 2020-03-26 23:00 UTC (permalink / raw) To: Minchan Kim, Dave Hansen Cc: linux-kernel, mhocko, jannh, vbabka, dancol, joel, akpm On 3/25/20 11:28 PM, Minchan Kim wrote: >> diff -puN mm/madvise.c~madv-pageout-ignore-shared-swap-cache mm/madvise.c >> --- a/mm/madvise.c~madv-pageout-ignore-shared-swap-cache 2020-03-23 16:30:52.022385888 -0700 >> +++ b/mm/madvise.c 2020-03-23 16:41:15.448384333 -0700 >> @@ -261,6 +261,7 @@ static struct page *pte_get_reclaim_page >> { >> swp_entry_t entry; >> struct page *page; >> + int nr_page_references = 0; > nit: just 'referenced' would be enough. I guess I could track one bit like that. But, it would require checking both page_mapcount() and page_swapcount() for being >1. This way, I just accumulate the count and have a check at a single place. I think it ends up much simpler this way. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared swap cache pages 2020-03-26 23:00 ` Dave Hansen @ 2020-03-27 6:42 ` Minchan Kim 0 siblings, 0 replies; 7+ messages in thread From: Minchan Kim @ 2020-03-27 6:42 UTC (permalink / raw) To: Dave Hansen Cc: Dave Hansen, linux-kernel, mhocko, jannh, vbabka, dancol, joel, akpm On Thu, Mar 26, 2020 at 04:00:09PM -0700, Dave Hansen wrote: > On 3/25/20 11:28 PM, Minchan Kim wrote: > >> diff -puN mm/madvise.c~madv-pageout-ignore-shared-swap-cache mm/madvise.c > >> --- a/mm/madvise.c~madv-pageout-ignore-shared-swap-cache 2020-03-23 16:30:52.022385888 -0700 > >> +++ b/mm/madvise.c 2020-03-23 16:41:15.448384333 -0700 > >> @@ -261,6 +261,7 @@ static struct page *pte_get_reclaim_page > >> { > >> swp_entry_t entry; > >> struct page *page; > >> + int nr_page_references = 0; > > nit: just 'referenced' would be enough. > > I guess I could track one bit like that. But, it would require checking > both page_mapcount() and page_swapcount() for being >1. This way, I > just accumulate the count and have a check at a single place. > > I think it ends up much simpler this way. I meant the variable name. 'referenced' would be enough for indication like rmap.c and khugepaged.c. Anyway, it's up to you. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-03-27 6:42 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-23 23:41 [PATCH 0/2] mm/madvise: teach MADV_PAGEOUT about swap cache Dave Hansen 2020-03-23 23:41 ` [PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages Dave Hansen 2020-03-26 6:24 ` Minchan Kim 2020-03-23 23:41 ` [PATCH 2/2] mm/madvise: skip MADV_PAGEOUT on shared " Dave Hansen 2020-03-26 6:28 ` Minchan Kim 2020-03-26 23:00 ` Dave Hansen 2020-03-27 6:42 ` Minchan Kim
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.