All of lore.kernel.org
 help / color / mirror / Atom feed
From: "zhaowuyun@wingtech.com" <zhaowuyun@wingtech.com>
To: Hugh Dickins <hughd@google.com>
Cc: akpm <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>,
	mgorman <mgorman@techsingularity.net>,
	minchan <minchan@kernel.org>, vinmenon <vinmenon@codeaurora.org>,
	hannes <hannes@cmpxchg.org>,
	"hillf.zj" <hillf.zj@alibaba-inc.com>,
	linux-mm <linux-mm@kvack.org>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: Re: [PATCH] [PATCH] mm: disable preemption before swapcache_free
Date: Tue, 7 Aug 2018 10:15:40 +0800	[thread overview]
Message-ID: <20180807101540612373235@wingtech.com> (raw)
In-Reply-To: alpine.LSU.2.11.1808041332410.1120@eggly.anvils


>Andrew's msleep(1) may be a good enough bandaid for you. And I share
>Michal's doubt about your design, in which an RT thread meets swap:
>this may not be the last problem you have with that.
>
>But this is a real bug when CONFIG_PREEMPT=y, RT threads or not: we
>just didn't notice, because it's usually hidden by the cond_resched().
>(I think that was added in 3.10, because in 2.6.29 I had been guilty of
>inserting a discard, and wait for completion, in between allocating swap
>and adding to swap cache; but Shao Hua fixed my discard in 3.12.) Thanks
>a lot for making us aware of this bug.
>
>After reminding myself of the issues here, I disagree with much of what
>has been said: we shall "always" want the loop in __read_swap_cache_async()
>(though some of its error handling is probably superfluous now, agreed);
>and your disabling of preemption is not just a bandaid, it's exactly the
>right approach.
>
>We could do something heavier, perhaps rearranging the swapcache tree work
>to be done under swap_lock as well as tree_lock (I'm talking 4.9-language),
>but that's very unlikely to be an improvement. Disabling preemption yokes
>the two spinlocks together in an efficient way, without overhead on other
>paths; on rare occasions we spin around __read_swap_cache_async() instead
>of spinning around to acquire a spinlock.
>
>But your patch is incomplete. The same needs to be done in delete_from_
>swap_cache(), and we would also need to protect against preemption between
>the get_swap_page() and the add_to_swap_cache(), in add_to_swap() and in
>shmem_writepage(). The latter gets messy, but 4.11 (where Tim Chen uses
>SWAP_HAS_CACHE more widely) gives a good hint: __read_swap_cache_async()
>callers are only interested in swap entries that are already in use and
>still in use. (Though swapoff has to be more careful, partly because one
>of its jobs is to clear out swap-cache-only entries, partly because the
>current interface would mistake a NULL for no-entry as out-of-memory.)
>
>Below is the patch I would do for 4.9 (diffed against 4.9.117), and I'm
>showing that because it's the simplest case. Although the principles stay
>the same, the codebase here has gone through several shifts, and 4.19 will
>probably be different again. So I'll test and post a patch against 4.19-rc
>in a few weeks time, and that can then be backported to stable: but will
>need several manual backports because of the intervening changes.
>
>I did wonder whether just to extend the irq-disabled section in
>delete_from_swap_cache() etc: that's easy, and works, and is even better
>protection against spinning too long; but it's not absolutely necessary,
>so all in all, probably better avoided. I did wonder whether to remove
>the cond_resched(), but it's not a bad place for one, so I've left it in.
>
>When checking worst cases of looping around __read_swap_cache_async(),
>after the patch, I was worried for a while. I had naively imagined that
>going more than twice around the loop should be vanishingly rare, but
>that is not so at all. But the bad cases I looked into were all the same:
>after forking, two processes, on HT siblings, each serving do_swap_page(),
>trying to bring the same swap into its mm, with a sparse swapper_space
>tree: one process gets to do all the work of allocating new radix-tree
>nodes and bringing them into D-cache, while the other just spins around
>__read_swap_cache_async() seeing SWAP_HAS_CACHE but not yet the page in
>the radix-tree. That's okay, that makes sense.
>
>Hugh
>---
>
> mm/swap_state.c |   26 +++++++++++++-------------
> mm/swapfile.c   |    8 +++++++-
> mm/vmscan.c     |    3 +++
> 3 files changed, 23 insertions(+), 14 deletions(-)
>
>--- 4.9.117/mm/swap_state.c	2016-12-11 11:17:54.000000000 -0800
>+++ linux/mm/swap_state.c	2018-08-04 11:50:46.577788766 -0700
>@@ -225,9 +225,11 @@ void delete_from_swap_cache(struct page
> address_space = swap_address_space(entry);
> spin_lock_irq(&address_space->tree_lock);
> __delete_from_swap_cache(page);
>+	/* Expedite swapcache_free() to help __read_swap_cache_async() */
>+	preempt_disable();
> spin_unlock_irq(&address_space->tree_lock);
>-
> swapcache_free(entry);
>+	preempt_enable();
> put_page(page);
> }
>
>@@ -337,19 +339,17 @@ struct page *__read_swap_cache_async(swp
> if (err == -EEXIST) {
> radix_tree_preload_end();
> /*
>-	* We might race against get_swap_page() and stumble
>-	* across a SWAP_HAS_CACHE swap_map entry whose page
>-	* has not been brought into the swapcache yet, while
>-	* the other end is scheduled away waiting on discard
>-	* I/O completion at scan_swap_map().
>+	* We might race against __delete_from_swap_cache() and
>+	* stumble across a swap_map entry whose SWAP_HAS_CACHE
>+	* has not yet been cleared: hence preempt_disable()
>+	* in __remove_mapping() and delete_from_swap_cache(),
>+	* so they cannot schedule away before clearing it.
> *
>-	* In order to avoid turning this transitory state
>-	* into a permanent loop around this -EEXIST case
>-	* if !CONFIG_PREEMPT and the I/O completion happens
>-	* to be waiting on the CPU waitqueue where we are now
>-	* busy looping, we just conditionally invoke the
>-	* scheduler here, if there are some more important
>-	* tasks to run.
>+	* We need similar protection against racing calls to
>+	* __read_swap_cache_async(): preempt_disable() before
>+	* swapcache_prepare() above, preempt_enable() after
>+	* __add_to_swap_cache() below: which are already in
>+	* radix_tree_maybe_preload(), radix_tree_preload_end().
> */
> cond_resched();
> continue;
>--- 4.9.117/mm/swapfile.c	2018-08-04 11:40:08.463504848 -0700
>+++ linux/mm/swapfile.c	2018-08-04 11:50:46.577788766 -0700
>@@ -2670,7 +2670,13 @@ static int __swap_duplicate(swp_entry_t
> /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> if (!has_cache && count)
> has_cache = SWAP_HAS_CACHE;
>-	else if (has_cache)	/* someone else added cache */
>+	/*
>+	* __read_swap_cache_async() can usually skip entries without
>+	* real usage (including those in between being allocated and
>+	* added to swap cache); but swapoff (!SWP_WRITEOK) must not.
>+	*/
>+	else if (has_cache &&
>+	(count || !(p->flags & SWP_WRITEOK)))
> err = -EEXIST;
> else	/* no users remaining */
> err = -ENOENT;
>--- 4.9.117/mm/vmscan.c	2018-08-04 11:40:08.471504902 -0700
>+++ linux/mm/vmscan.c	2018-08-04 11:50:46.577788766 -0700
>@@ -709,8 +709,11 @@ static int __remove_mapping(struct addre
> swp_entry_t swap = { .val = page_private(page) };
> mem_cgroup_swapout(page, swap);
> __delete_from_swap_cache(page);
>+	/* Expedite swapcache_free() for __read_swap_cache_async() */
>+	preempt_disable();
> spin_unlock_irqrestore(&mapping->tree_lock, flags);
> swapcache_free(swap);
>+	preempt_enable();
> } else {
> void (*freepage)(struct page *);
> void *shadow = NULL; 


Hi Hugh,

Thanks for affirming the modification of disabling preemption and 
pointing out the incompleteness, delete_from_swap_cache() needs the same protection.
I'm curious about that why don't put swapcache_free(swap) under protection of mapping->tree_lock ??


  reply	other threads:[~2018-08-07  2:16 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-25  6:37 [PATCH] [PATCH] mm: disable preemption before swapcache_free zhaowuyun
2018-07-25  6:40 ` zhaowuyun
2018-07-25  7:40 ` Michal Hocko
2018-07-25  7:57   ` zhaowuyun
2018-07-25  8:21     ` Michal Hocko
2018-07-25  9:53       ` zhaowuyun
2018-07-25 10:34         ` Michal Hocko
2018-07-25 11:17           ` zhaowuyun
2018-07-25 10:32       ` Michal Hocko
2018-07-25 21:16 ` Andrew Morton
2018-07-26  2:21   ` zhaowuyun
2018-07-26  6:06     ` Michal Hocko
2018-07-26  7:03       ` zhaowuyun
2018-07-26  7:44         ` Michal Hocko
2018-07-26 22:11         ` Andrew Morton
2018-07-27  6:07           ` zhaowuyun
2018-08-04 23:07             ` Hugh Dickins
2018-08-07  2:15               ` zhaowuyun [this message]
2018-08-07  3:23                 ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180807101540612373235@wingtech.com \
    --to=zhaowuyun@wingtech.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hillf.zj@alibaba-inc.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=minchan@kernel.org \
    --cc=vinmenon@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.