* [PATCH v2 1/2] mm: make PageReadahead more strict
@ 2020-02-14 19:29 Minchan Kim
2020-02-14 19:29 ` [PATCH v2 2/2] mm: fix long time stall from mm_populate Minchan Kim
0 siblings, 1 reply; 3+ messages in thread
From: Minchan Kim @ 2020-02-14 19:29 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, Jan Kara, Matthew Wilcox, Josef Bacik,
Johannes Weiner, Minchan Kim
Recently, I got some bugreports major page fault takes several seconds
sometime. When I review drop mmap_sem logic, I found several bugs.
CPU 1 CPU 2
mm_populate
for ()
..
ret = populate_vma_page_range
__get_user_pages
faultin_page
handle_mm_fault
filemap_fault
do_async_mmap_readahead
shrink_page_list
pageout
SetPageReclaim(=SetPageReadahead)
writepage
SetPageWriteback
if (PageReadahead(page))
maybe_unlock_mmap_for_io
up_read(mmap_sem)
page_cache_async_readahead()
if (PageWriteback(page))
return;
Here, since ret from populate_vma_page_range is zero, the loop continue
to run with same address with previous iteration. It will repeat the
loop until the page's writeout is done(ie, PG_writeback or PG_reclaim
is clear).
We could fix the above specific case via adding PageWriteback
ret = populate_vma_page_range
...
...
filemap_fault
do_async_mmap_readahead
if (!PageWriteback(page) && PageReadahead(page))
maybe_unlock_mmap_for_io
up_read(mmap_sem)
page_cache_async_readahead()
if (PageWriteback(page))
return;
Furthermore, to prevent potential issues caused by sharing PG_readahead
with PG_reclaim, let's make page flag wrapper for PageReadahead
with description. With that, we could remove PageWriteback check
in page_cache_async_readahead, which is more clear for maintenance/
readability.
Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations")
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/page-flags.h | 28 ++++++++++++++++++++++++++--
mm/readahead.c | 6 ------
2 files changed, 26 insertions(+), 8 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1bf83c8fcaa7..f91a9b2a49bd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -363,8 +363,32 @@ PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
/* PG_readahead is only used for reads; PG_reclaim is only for writes */
PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
-PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
- TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+
+SETPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+CLEARPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+
+/*
+ * Since PG_readahead is shared with PG_reclaim of the page flags,
+ * PageReadahead should double check whether it's readahead marker
+ * or PG_reclaim. It could be done by PageWriteback check because
+ * PG_reclaim is always with PG_writeback.
+ */
+static inline int PageReadahead(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(PageCompound(page), page);
+
+ return (page->flags & (1UL << PG_reclaim | 1UL << PG_writeback)) ==
+ (1UL << PG_reclaim);
+}
+
+/* Clear PG_readahead only if it's PG_readahead, not PG_reclaim */
+static inline int TestClearPageReadahead(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(PageCompound(page), page);
+
+ return !PageWriteback(page) ||
+ test_and_clear_bit(PG_reclaim, &page->flags);
+}
#ifdef CONFIG_HIGHMEM
/*
diff --git a/mm/readahead.c b/mm/readahead.c
index 2fe72cd29b47..85b15e5a1d7b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -553,12 +553,6 @@ page_cache_async_readahead(struct address_space *mapping,
if (!ra->ra_pages)
return;
- /*
- * Same bit is used for PG_readahead and PG_reclaim.
- */
- if (PageWriteback(page))
- return;
-
ClearPageReadahead(page);
/*
--
2.25.0.265.gbab2e86ba0-goog
^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH v2 2/2] mm: fix long time stall from mm_populate
2020-02-14 19:29 [PATCH v2 1/2] mm: make PageReadahead more strict Minchan Kim
@ 2020-02-14 19:29 ` Minchan Kim
2020-02-21 17:50 ` Minchan Kim
0 siblings, 1 reply; 3+ messages in thread
From: Minchan Kim @ 2020-02-14 19:29 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, Jan Kara, Matthew Wilcox, Josef Bacik,
Johannes Weiner, Minchan Kim
Basically, fault handler releases mmap_sem before requesting readahead
and then it is supposed to retry lookup the page from page cache with
FAULT_FLAG_TRIED so that it avoids the live lock of infinite retry.
However, what happens if the fault handler find a page from page
cache and the page has readahead marker but are waiting under
writeback? Plus one more condition, it happens under mm_populate
which repeats faulting unless it encounters error. So let's assemble
conditions below.
CPU 1 CPU 2
- first loop
mm_populate
for ()
..
ret = populate_vma_page_range
__get_user_pages
faultin_page
handle_mm_fault
filemap_fault
do_async_mmap_readahead
if (PageReadahead(pageA))
maybe_unlock_mmap_for_io
up_read(mmap_sem)
shrink_page_list
pageout
SetPageReclaim(=SetPageReadahead)(pageA)
writepage
SetPageWriteback(pageA)
page_cache_async_readahead()
ClearPageReadahead(pageA)
do_async_mmap_readahead
lock_page_maybe_drop_mmap
goto out_retry
the pageA is reclaimed
and new pageB is populated to the file offset
and finally has become PG_readahead
- second loop
__get_user_pages
faultin_page
handle_mm_fault
filemap_fault
do_async_mmap_readahead
if (PageReadahead(pageB))
maybe_unlock_mmap_for_io
up_read(mmap_sem)
shrink_page_list
pageout
SetPageReclaim(=SetPageReadahead)(pageB)
writepage
SetPageWriteback(pageB)
page_cache_async_readahead()
ClearPageReadahead(pageB)
do_async_mmap_readahead
lock_page_maybe_drop_mmap
goto out_retry
It could be repeated forever so it's livelock. without involving reclaim,
it could happens if ra_pages become zero by fadvise/other threads who
have same fd one doing randome while the other one is sequential
because page_cache_async_readahead has following condition check like
PageWriteback and ra_pages are never synchrnized with fadvise and
shrink_readahead_size_eio from other threads.
void page_cache_async_readahead(struct address_space *mapping,
unsigned long req_size)
{
/* no read-ahead */
if (!ra->ra_pages)
return;
Thus, we need to limit fault retry from mm_populate like page
fault handler.
Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
mm/gup.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 1b521e0ac1de..6f6548c63ad5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1133,7 +1133,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
*
* This takes care of mlocking the pages too if VM_LOCKED is set.
*
- * return 0 on success, negative error code on error.
+ * return number of pages pinned on success, negative error code on error.
*
* vma->vm_mm->mmap_sem must be held.
*
@@ -1196,6 +1196,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
struct vm_area_struct *vma = NULL;
int locked = 0;
long ret = 0;
+ bool tried = false;
end = start + len;
@@ -1226,14 +1227,18 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
* double checks the vma flags, so that it won't mlock pages
* if the vma was already munlocked.
*/
- ret = populate_vma_page_range(vma, nstart, nend, &locked);
+ ret = populate_vma_page_range(vma, nstart, nend,
+ tried ? NULL : &locked);
if (ret < 0) {
if (ignore_errors) {
ret = 0;
continue; /* continue at next VMA */
}
break;
- }
+ } else if (ret == 0)
+ tried = true;
+ else
+ tried = false;
nend = nstart + ret * PAGE_SIZE;
ret = 0;
}
--
2.25.0.265.gbab2e86ba0-goog
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v2 2/2] mm: fix long time stall from mm_populate
2020-02-14 19:29 ` [PATCH v2 2/2] mm: fix long time stall from mm_populate Minchan Kim
@ 2020-02-21 17:50 ` Minchan Kim
0 siblings, 0 replies; 3+ messages in thread
From: Minchan Kim @ 2020-02-21 17:50 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, Jan Kara, Matthew Wilcox, Josef Bacik, Johannes Weiner
Bumping up.
On Fri, Feb 14, 2020 at 11:29:51AM -0800, Minchan Kim wrote:
> Basically, fault handler releases mmap_sem before requesting readahead
> and then it is supposed to retry lookup the page from page cache with
> FAULT_FLAG_TRIED so that it avoids the live lock of infinite retry.
>
> However, what happens if the fault handler find a page from page
> cache and the page has readahead marker but are waiting under
> writeback? Plus one more condition, it happens under mm_populate
> which repeats faulting unless it encounters error. So let's assemble
> conditions below.
>
> CPU 1 CPU 2
>
> - first loop
> mm_populate
> for ()
> ..
> ret = populate_vma_page_range
> __get_user_pages
> faultin_page
> handle_mm_fault
> filemap_fault
> do_async_mmap_readahead
> if (PageReadahead(pageA))
> maybe_unlock_mmap_for_io
> up_read(mmap_sem)
> shrink_page_list
> pageout
> SetPageReclaim(=SetPageReadahead)(pageA)
> writepage
> SetPageWriteback(pageA)
>
> page_cache_async_readahead()
> ClearPageReadahead(pageA)
> do_async_mmap_readahead
> lock_page_maybe_drop_mmap
> goto out_retry
>
> the pageA is reclaimed
> and new pageB is populated to the file offset
> and finally has become PG_readahead
>
> - second loop
>
> __get_user_pages
> faultin_page
> handle_mm_fault
> filemap_fault
> do_async_mmap_readahead
> if (PageReadahead(pageB))
> maybe_unlock_mmap_for_io
> up_read(mmap_sem)
> shrink_page_list
> pageout
> SetPageReclaim(=SetPageReadahead)(pageB)
> writepage
> SetPageWriteback(pageB)
>
> page_cache_async_readahead()
> ClearPageReadahead(pageB)
> do_async_mmap_readahead
> lock_page_maybe_drop_mmap
> goto out_retry
>
> It could be repeated forever so it's livelock. without involving reclaim,
> it could happens if ra_pages become zero by fadvise/other threads who
> have same fd one doing randome while the other one is sequential
> because page_cache_async_readahead has following condition check like
> PageWriteback and ra_pages are never synchrnized with fadvise and
> shrink_readahead_size_eio from other threads.
>
> void page_cache_async_readahead(struct address_space *mapping,
> unsigned long req_size)
> {
> /* no read-ahead */
> if (!ra->ra_pages)
> return;
>
> Thus, we need to limit fault retry from mm_populate like page
> fault handler.
>
> Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations")
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> mm/gup.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 1b521e0ac1de..6f6548c63ad5 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1133,7 +1133,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
> *
> * This takes care of mlocking the pages too if VM_LOCKED is set.
> *
> - * return 0 on success, negative error code on error.
> + * return number of pages pinned on success, negative error code on error.
> *
> * vma->vm_mm->mmap_sem must be held.
> *
> @@ -1196,6 +1196,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
> struct vm_area_struct *vma = NULL;
> int locked = 0;
> long ret = 0;
> + bool tried = false;
>
> end = start + len;
>
> @@ -1226,14 +1227,18 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
> * double checks the vma flags, so that it won't mlock pages
> * if the vma was already munlocked.
> */
> - ret = populate_vma_page_range(vma, nstart, nend, &locked);
> + ret = populate_vma_page_range(vma, nstart, nend,
> + tried ? NULL : &locked);
> if (ret < 0) {
> if (ignore_errors) {
> ret = 0;
> continue; /* continue at next VMA */
> }
> break;
> - }
> + } else if (ret == 0)
> + tried = true;
> + else
> + tried = false;
> nend = nstart + ret * PAGE_SIZE;
> ret = 0;
> }
> --
> 2.25.0.265.gbab2e86ba0-goog
>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2020-02-21 17:50 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-14 19:29 [PATCH v2 1/2] mm: make PageReadahead more strict Minchan Kim
2020-02-14 19:29 ` [PATCH v2 2/2] mm: fix long time stall from mm_populate Minchan Kim
2020-02-21 17:50 ` Minchan Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).