From: "Huang\, Ying" <ying.huang@intel.com>
To: Minchan Kim <minchan@kernel.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
"Hugh Dickins" <hughd@google.com>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Tim Chen" <tim.c.chen@linux.intel.com>,
"Shaohua Li" <shli@fb.com>,
"Mel Gorman" <mgorman@techsingularity.net>,
"J�r�me Glisse" <jglisse@redhat.com>,
"Michal Hocko" <mhocko@suse.com>,
"Andrea Arcangeli" <aarcange@redhat.com>,
"David Rientjes" <rientjes@google.com>,
"Rik van Riel" <riel@redhat.com>, "Jan Kara" <jack@suse.cz>,
"Dave Jiang" <dave.jiang@intel.com>,
"Aaron Lu" <aaron.lu@intel.com>
Subject: Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
Date: Wed, 13 Dec 2017 16:52:53 +0800 [thread overview]
Message-ID: <87wp1rm2l6.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20171213071556.GA26478@bbox> (Minchan Kim's message of "Wed, 13 Dec 2017 16:15:56 +0900")
Minchan Kim <minchan@kernel.org> writes:
> Hi Huang,
>
> Sorry for the late response. I'm in middle of long vacation.
>
> On Fri, Dec 08, 2017 at 08:32:16PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>
>> > On Fri, Dec 08, 2017 at 04:41:38PM +0800, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >>
>> >> > On Fri, Dec 08, 2017 at 01:41:10PM +0800, Huang, Ying wrote:
>> >> >> Minchan Kim <minchan@kernel.org> writes:
>> >> >>
>> >> >> > On Thu, Dec 07, 2017 at 04:29:37PM -0800, Andrew Morton wrote:
>> >> >> >> On Thu, 7 Dec 2017 09:14:26 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> > When the swapin is performed, after getting the swap entry information
>> >> >> >> > from the page table, the PTL (page table lock) will be released, then
>> >> >> >> > system will go to swap in the swap entry, without any lock held to
>> >> >> >> > prevent the swap device from being swapoff. This may cause the race
>> >> >> >> > like below,
>> >> >> >> >
>> >> >> >> > CPU 1 CPU 2
>> >> >> >> > ----- -----
>> >> >> >> > do_swap_page
>> >> >> >> > swapin_readahead
>> >> >> >> > __read_swap_cache_async
>> >> >> >> > swapoff swapcache_prepare
>> >> >> >> > p->swap_map = NULL __swap_duplicate
>> >> >> >> > p->swap_map[?] /* !!! NULL pointer access */
>> >> >> >> >
>> >> >> >> > Because swap off is usually done when system shutdown only, the race
>> >> >> >> > may not hit many people in practice. But it is still a race need to
>> >> >> >> > be fixed.
>> >> >> >>
>> >> >> >> swapoff is so rare that it's hard to get motivated about any fix which
>> >> >> >> adds overhead to the regular codepaths.
>> >> >> >
>> >> >> > That was my concern, too when I see this patch.
>> >> >> >
>> >> >> >>
>> >> >> >> Is there something we can do to ensure that all the overhead of this
>> >> >> >> fix is placed into the swapoff side? stop_machine() may be a bit
>> >> >> >> brutal, but a surprising amount of code uses it. Any other ideas?
>> >> >> >
>> >> >> > How about this?
>> >> >> >
>> >> >> > I think It's same approach with old where we uses si->lock everywhere
>> >> >> > instead of more fine-grained cluster lock.
>> >> >> >
>> >> >> > The reason I repeated to reset p->max to zero in the loop is to avoid
>> >> >> > using lockdep annotation(maybe, spin_lock_nested(something) to prevent
>> >> >> > false positive.
>> >> >> >
>> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> > index 42fe5653814a..9ce007a42bbc 100644
>> >> >> > --- a/mm/swapfile.c
>> >> >> > +++ b/mm/swapfile.c
>> >> >> > @@ -2644,6 +2644,19 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> >> >> > swap_file = p->swap_file;
>> >> >> > old_block_size = p->old_block_size;
>> >> >> > p->swap_file = NULL;
>> >> >> > +
>> >> >> > + if (p->flags & SWP_SOLIDSTATE) {
>> >> >> > + unsigned long ci, nr_cluster;
>> >> >> > +
>> >> >> > + nr_cluster = DIV_ROUND_UP(p->max, SWAPFILE_CLUSTER);
>> >> >> > + for (ci = 0; ci < nr_cluster; ci++) {
>> >> >> > + struct swap_cluster_info *sci;
>> >> >> > +
>> >> >> > + sci = lock_cluster(p, ci * SWAPFILE_CLUSTER);
>> >> >> > + p->max = 0;
>> >> >> > + unlock_cluster(sci);
>> >> >> > + }
>> >> >> > + }
>> >> >> > p->max = 0;
>> >> >> > swap_map = p->swap_map;
>> >> >> > p->swap_map = NULL;
>> >> >> > @@ -3369,10 +3382,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
>> >> >> > goto bad_file;
>> >> >> > p = swap_info[type];
>> >> >> > offset = swp_offset(entry);
>> >> >> > - if (unlikely(offset >= p->max))
>> >> >> > - goto out;
>> >> >> >
>> >> >> > ci = lock_cluster_or_swap_info(p, offset);
>> >> >> > + if (unlikely(offset >= p->max))
>> >> >> > + goto unlock_out;
>> >> >> >
>> >> >> > count = p->swap_map[offset];
>> >> >> >
>> >> >>
>> >> >> Sorry, this doesn't work, because
>> >> >>
>> >> >> lock_cluster_or_swap_info()
>> >> >>
>> >> >> Need to read p->cluster_info, which may be freed during swapoff too.
>> >> >>
>> >> >>
>> >> >> To reduce the added overhead in regular code path, Maybe we can use SRCU
>> >> >> to implement get_swap_device() and put_swap_device()? There is only
>> >> >> increment/decrement on CPU local variable in srcu_read_lock/unlock().
>> >> >> Should be acceptable in not so hot swap path?
>> >> >>
>> >> >> This needs to select CONFIG_SRCU if CONFIG_SWAP is enabled. But I guess
>> >> >> that should be acceptable too?
>> >> >>
>> >> >
>> >> > Why do we need srcu here? Is it enough with rcu like below?
>> >> >
>> >> > It might have a bug/room to be optimized about performance/naming.
>> >> > I just wanted to show my intention.
>> >>
>> >> Yes. rcu should work too. But if we use rcu, it may need to be called
>> >> several times to make sure the swap device under us doesn't go away, for
>> >> example, when checking si->max in __swp_swapcount() and
>> >
>> > I think it's not a big concern performance pov and benefit is good
>> > abstraction through current locking function so we don't need much churn.
>>
>> I think get/put_something() is common practice in Linux kernel to
>> prevent something to go away under us. That makes the programming model
>> easier to be understood than checking whether swap entry is valid here
>> and there.
>>
>> >> add_swap_count_continuation(). And I found we need rcu to protect swap
>> >> cache radix tree array too. So I think it may be better to use one
>> >
>> > Could you elaborate it more about swap cache arrary problem?
>>
>> Like swap_map, cluster_info, swap cache radix tree array for a swap
>> device will be freed at the end of swapoff. So when we look up swap
>> cache, we need to make sure the swap cache array is valid firstly too.
>>
>
> Thanks for the clarification.
>
> I'm not saying refcount approach you suggested is wrong but just wanted
> to find more easier way with just fixing cold path instead of hot path.
> To me, the thought came from from logical sense for the maintainance
> rather than performan problem.
>
> I still need a time to think over it and it would be made after the vacation
> so don't want to make you stuck. A thing I want to suggest is that let's
> think about maintanaince point of view for solution candidates.
> I don't like to put get/put into out of swap code. Instead, let's
> encapsulate the locking code into swap functions inside so any user
> of swap function doesn't need to know the detail.
>
> I think which approach is best for the solution among several
> approaches depends on that how the solution makes code simple without
> exposing the internal much rather than performance at the moment.
>
> Just my two cents.
> Sorry for the vague review. I'm looking forward to seeing new patches.
Thanks! Yes, long-term maintenance is key point. I just thought that
get/put_swap_device() based method makes the whole picture simpler so
that we don't need to worry about the swap device gone under us. Will
send out a new version soon.
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2017-12-13 8:53 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-07 1:14 [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations Huang, Ying
2017-12-08 0:29 ` Andrew Morton
2017-12-08 1:43 ` Minchan Kim
[not found] ` <87po7pg4jt.fsf@yhuang-dev.intel.com>
2017-12-08 8:26 ` Minchan Kim
2017-12-08 8:41 ` Huang, Ying
2017-12-08 9:10 ` Minchan Kim
2017-12-08 12:32 ` Huang, Ying
2017-12-13 7:15 ` Minchan Kim
2017-12-13 8:52 ` Huang, Ying [this message]
2017-12-08 22:09 ` Andrew Morton
2017-12-11 5:30 ` Huang, Ying
2017-12-11 17:04 ` Paul E. McKenney
2017-12-12 1:12 ` Huang, Ying
2017-12-12 17:11 ` Paul E. McKenney
2017-12-13 2:17 ` Huang, Ying
2017-12-13 3:27 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wp1rm2l6.fsf@yhuang-dev.intel.com \
--to=ying.huang@intel.com \
--cc=aarcange@redhat.com \
--cc=aaron.lu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=dave.jiang@intel.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=jglisse@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=riel@redhat.com \
--cc=rientjes@google.com \
--cc=shli@fb.com \
--cc=tim.c.chen@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).