* [PATCH 1/3] mm: Don't bother dropping mmap_sem for zero size readahead @ 2020-02-12 22:16 Minchan Kim 2020-02-12 22:16 ` [PATCH 2/3] mm: fix long time stall from mm_populate Minchan Kim 2020-02-12 22:16 ` [PATCH 3/3] mm: make PageReadahead more strict Minchan Kim 0 siblings, 2 replies; 6+ messages in thread From: Minchan Kim @ 2020-02-12 22:16 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, LKML, Jan Kara, Matthew Wilcox, Josef Bacik, Johannes Weiner, Minchan Kim, Robert Stupp From: Jan Kara <jack@suse.cz> When handling a page fault, we drop mmap_sem to start async readahead so that we don't block on IO submission with mmap_sem held. However there's no point to drop mmap_sem in case readahead is disabled. Handle that case to avoid pointless dropping of mmap_sem and retrying the fault. This was actually reported to block mlockall(MCL_CURRENT) indefinitely. Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations") Reported-by: Minchan Kim <minchan@kernel.org> Reported-by: Robert Stupp <snazy@gmx.de> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Minchan Kim <minchan@kernel.org> --- mm/filemap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index 1784478270e1..5bffaa2176cd 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2416,7 +2416,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf, pgoff_t offset = vmf->pgoff; /* If we don't want any read-ahead, don't bother */ - if (vmf->vma->vm_flags & VM_RAND_READ) + if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages) return fpin; if (ra->mmap_miss > 0) ra->mmap_miss--; -- 2.25.0.225.g125e21ebc7-goog ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/3] mm: fix long time stall from mm_populate 2020-02-12 22:16 [PATCH 1/3] mm: Don't bother dropping mmap_sem for zero size readahead Minchan Kim @ 2020-02-12 22:16 ` Minchan Kim 2020-04-21 4:12 ` Andrew Morton 2020-02-12 22:16 ` [PATCH 3/3] mm: make PageReadahead more strict Minchan Kim 1 sibling, 1 reply; 6+ messages in thread From: Minchan Kim @ 2020-02-12 22:16 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, LKML, Jan Kara, Matthew Wilcox, Josef Bacik, Johannes Weiner, Minchan Kim, Robert Stupp Basically, fault handler releases mmap_sem before requesting readahead and then it is supposed to retry lookup the page from page cache with FAULT_FLAG_TRIED so that it avoids the live lock of infinite retry. However, what happens if the fault handler find a page from page cache and the page has readahead marker but are waiting under writeback? Plus one more condition, it happens under mm_populate which repeats faulting unless it encounters error. So let's assemble conditions below. __mm_populate for (...) get_user_pages(faluty_address) handle_mm_fault filemap_fault find a page form page(PG_uptodate|PG_readahead|PG_writeback) it will return VM_FAULT_RETRY continue with faulty_address IOW, it will repeat fault retry logic until the page will be written back in the long run. It makes big spike latency of several seconds. This patch solves the issue by turning off fault retry logic in second trial. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Minchan Kim <minchan@kernel.org> --- mm/gup.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 1b521e0ac1de..b3f825092abf 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1196,6 +1196,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors) struct vm_area_struct *vma = NULL; int locked = 0; long ret = 0; + bool tried = false; end = start + len; @@ -1226,14 +1227,18 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors) * double checks the vma flags, so that it won't mlock pages * if the vma was already munlocked. */ - ret = populate_vma_page_range(vma, nstart, nend, &locked); + ret = populate_vma_page_range(vma, nstart, nend, + tried ? NULL : &locked); if (ret < 0) { if (ignore_errors) { ret = 0; continue; /* continue at next VMA */ } break; - } + } else if (ret == 0) + tried = true; + else + tried = false; nend = nstart + ret * PAGE_SIZE; ret = 0; } -- 2.25.0.225.g125e21ebc7-goog ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 2/3] mm: fix long time stall from mm_populate 2020-02-12 22:16 ` [PATCH 2/3] mm: fix long time stall from mm_populate Minchan Kim @ 2020-04-21 4:12 ` Andrew Morton 0 siblings, 0 replies; 6+ messages in thread From: Andrew Morton @ 2020-04-21 4:12 UTC (permalink / raw) To: Minchan Kim Cc: linux-mm, LKML, Jan Kara, Matthew Wilcox, Josef Bacik, Johannes Weiner, Robert Stupp On Wed, 12 Feb 2020 14:16:13 -0800 Minchan Kim <minchan@kernel.org> wrote: > Basically, fault handler releases mmap_sem before requesting readahead > and then it is supposed to retry lookup the page from page cache with > FAULT_FLAG_TRIED so that it avoids the live lock of infinite retry. > > However, what happens if the fault handler find a page from page > cache and the page has readahead marker but are waiting under > writeback? Plus one more condition, it happens under mm_populate > which repeats faulting unless it encounters error. So let's assemble > conditions below. > > __mm_populate > for (...) > get_user_pages(faluty_address) > handle_mm_fault > filemap_fault > find a page form page(PG_uptodate|PG_readahead|PG_writeback) > it will return VM_FAULT_RETRY > continue with faulty_address > > IOW, it will repeat fault retry logic until the page will be written > back in the long run. It makes big spike latency of several seconds. > > This patch solves the issue by turning off fault retry logic in second > trial. I guess a resend would be helpful, to refresh memories. Please mention in the changelog whether this "long time stall" has been observed in practice. If so, under what conditions, how often and how long was it? etcetera. ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 3/3] mm: make PageReadahead more strict 2020-02-12 22:16 [PATCH 1/3] mm: Don't bother dropping mmap_sem for zero size readahead Minchan Kim 2020-02-12 22:16 ` [PATCH 2/3] mm: fix long time stall from mm_populate Minchan Kim @ 2020-02-12 22:16 ` Minchan Kim 2020-02-17 9:31 ` Jan Kara 1 sibling, 1 reply; 6+ messages in thread From: Minchan Kim @ 2020-02-12 22:16 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, LKML, Jan Kara, Matthew Wilcox, Josef Bacik, Johannes Weiner, Minchan Kim, Robert Stupp PG_readahead flag is shared with PG_reclaim but PG_reclaim is only used in write context while PG_readahead is used for read context. To make it clear, let's introduce PageReadahead wrapper with !PageWriteback so it could make code clear and we could drop PageWriteback check in page_cache_async_readahead, which removes pointless dropping mmap_sem. Signed-off-by: Minchan Kim <minchan@kernel.org> --- include/linux/page-flags.h | 28 ++++++++++++++++++++++++++-- mm/readahead.c | 6 ------ 2 files changed, 26 insertions(+), 8 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 1bf83c8fcaa7..f91a9b2a49bd 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -363,8 +363,32 @@ PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL) /* PG_readahead is only used for reads; PG_reclaim is only for writes */ PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL) TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL) -PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND) - TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND) + +SETPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND) +CLEARPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND) + +/* + * Since PG_readahead is shared with PG_reclaim of the page flags, + * PageReadahead should double check whether it's readahead marker + * or PG_reclaim. It could be done by PageWriteback check because + * PG_reclaim is always with PG_writeback. + */ +static inline int PageReadahead(struct page *page) +{ + VM_BUG_ON_PGFLAGS(PageCompound(page), page); + + return (page->flags & (1UL << PG_reclaim | 1UL << PG_writeback)) == + (1UL << PG_reclaim); +} + +/* Clear PG_readahead only if it's PG_readahead, not PG_reclaim */ +static inline int TestClearPageReadahead(struct page *page) +{ + VM_BUG_ON_PGFLAGS(PageCompound(page), page); + + return !PageWriteback(page) || + test_and_clear_bit(PG_reclaim, &page->flags); +} #ifdef CONFIG_HIGHMEM /* diff --git a/mm/readahead.c b/mm/readahead.c index 2fe72cd29b47..85b15e5a1d7b 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -553,12 +553,6 @@ page_cache_async_readahead(struct address_space *mapping, if (!ra->ra_pages) return; - /* - * Same bit is used for PG_readahead and PG_reclaim. - */ - if (PageWriteback(page)) - return; - ClearPageReadahead(page); /* -- 2.25.0.225.g125e21ebc7-goog ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] mm: make PageReadahead more strict 2020-02-12 22:16 ` [PATCH 3/3] mm: make PageReadahead more strict Minchan Kim @ 2020-02-17 9:31 ` Jan Kara 2020-02-18 22:28 ` Minchan Kim 0 siblings, 1 reply; 6+ messages in thread From: Jan Kara @ 2020-02-17 9:31 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, linux-mm, LKML, Jan Kara, Matthew Wilcox, Josef Bacik, Johannes Weiner, Robert Stupp On Wed 12-02-20 14:16:14, Minchan Kim wrote: > PG_readahead flag is shared with PG_reclaim but PG_reclaim is only > used in write context while PG_readahead is used for read context. > > To make it clear, let's introduce PageReadahead wrapper with > !PageWriteback so it could make code clear and we could drop > PageWriteback check in page_cache_async_readahead, which removes > pointless dropping mmap_sem. > > Signed-off-by: Minchan Kim <minchan@kernel.org> ... > +/* Clear PG_readahead only if it's PG_readahead, not PG_reclaim */ > +static inline int TestClearPageReadahead(struct page *page) > +{ > + VM_BUG_ON_PGFLAGS(PageCompound(page), page); > + > + return !PageWriteback(page) || > + test_and_clear_bit(PG_reclaim, &page->flags); > +} I think this is still wrong - if PageWriteback is not set, it will never clear PG_reclaim bit so effectively the page will stay in PageReadahead state! The logic you really want to implement is: if (PageReadahead(page)) { <- this is your new PageReadahead implementation clear_bit(PG_reclaim, &page->flags); return 1; } return 0; Now this has the problem that it is not atomic. The only way I see to make this fully atomic is using cmpxchg(). If we wanted to make this kinda-sorta OK, the proper condition would look like: return !PageWriteback(page) **&&** test_and_clear_bit(PG_reclaim, &page->flags); Which is similar to what you originally had but different because in C '&&' operator is not commutative due to side-effects committed at sequence points. BTW: I share Andrew's view that we are piling hacks to fix problems caused by older hacks. But I don't see any good option how to unalias PG_readahead and PG_reclaim. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] mm: make PageReadahead more strict 2020-02-17 9:31 ` Jan Kara @ 2020-02-18 22:28 ` Minchan Kim 0 siblings, 0 replies; 6+ messages in thread From: Minchan Kim @ 2020-02-18 22:28 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, linux-mm, LKML, Matthew Wilcox, Josef Bacik, Johannes Weiner, Robert Stupp On Mon, Feb 17, 2020 at 10:31:28AM +0100, Jan Kara wrote: > On Wed 12-02-20 14:16:14, Minchan Kim wrote: > > PG_readahead flag is shared with PG_reclaim but PG_reclaim is only > > used in write context while PG_readahead is used for read context. > > > > To make it clear, let's introduce PageReadahead wrapper with > > !PageWriteback so it could make code clear and we could drop > > PageWriteback check in page_cache_async_readahead, which removes > > pointless dropping mmap_sem. > > > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > ... > > > +/* Clear PG_readahead only if it's PG_readahead, not PG_reclaim */ > > +static inline int TestClearPageReadahead(struct page *page) > > +{ > > + VM_BUG_ON_PGFLAGS(PageCompound(page), page); > > + > > + return !PageWriteback(page) || > > + test_and_clear_bit(PG_reclaim, &page->flags); > > +} > > I think this is still wrong - if PageWriteback is not set, it will never > clear PG_reclaim bit so effectively the page will stay in PageReadahead > state! > > The logic you really want to implement is: > > if (PageReadahead(page)) { <- this is your new PageReadahead > implementation > clear_bit(PG_reclaim, &page->flags); > return 1; > } > return 0; > > Now this has the problem that it is not atomic. The only way I see to make > this fully atomic is using cmpxchg(). If we wanted to make this kinda-sorta > OK, the proper condition would look like: > > return !PageWriteback(page) **&&** > test_and_clear_bit(PG_reclaim, &page->flags); > > Which is similar to what you originally had but different because in C '&&' > operator is not commutative due to side-effects committed at sequence points. It's accurate. Thanks, Jan. > > BTW: I share Andrew's view that we are piling hacks to fix problems caused > by older hacks. But I don't see any good option how to unalias > PG_readahead and PG_reclaim. I believe it's okay to remove PG_writeback check in page_cache_async_readahead. It's merely optmization to accelerate page reclaim when the LRU victim page's writeback is done by VM. If we removes the condition, we just lose few pages at the moment for fast reclaim but finally they could be reclaimed in inactive LRU and it would be *rare* in that that kinds of writeback from reclaim context should be not common and doesn't affect system behavior a lot. > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-04-21 4:12 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-02-12 22:16 [PATCH 1/3] mm: Don't bother dropping mmap_sem for zero size readahead Minchan Kim 2020-02-12 22:16 ` [PATCH 2/3] mm: fix long time stall from mm_populate Minchan Kim 2020-04-21 4:12 ` Andrew Morton 2020-02-12 22:16 ` [PATCH 3/3] mm: make PageReadahead more strict Minchan Kim 2020-02-17 9:31 ` Jan Kara 2020-02-18 22:28 ` Minchan Kim
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).