From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752479AbdLMD1g (ORCPT <rfc822;w@1wt.eu>);
        Tue, 12 Dec 2017 22:27:36 -0500
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:36228 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1752288AbdLMD1c (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 12 Dec 2017 22:27:32 -0500
Date: Tue, 12 Dec 2017 19:27:25 -0800
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Minchan Kim <minchan@kernel.org>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, Hugh Dickins <hughd@google.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Tim Chen <tim.c.chen@linux.intel.com>, Shaohua Li <shli@fb.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        =?utf-8?B?Su+/vXLvv71tZQ==?= Glisse <jglisse@redhat.com>,
        Michal Hocko <mhocko@suse.com>, Andrea Arcangeli <aarcange@redhat.com>,
        David Rientjes <rientjes@google.com>, Rik van Riel <riel@redhat.com>,
        Jan Kara <jack@suse.cz>, Dave Jiang <dave.jiang@intel.com>,
        Aaron Lu <aaron.lu@intel.com>
Subject: Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap
 operations
Reply-To: paulmck@linux.vnet.ibm.com
References: <20171208014346.GA8915@bbox>
 <87po7pg4jt.fsf@yhuang-dev.intel.com>
 <20171208082644.GA14361@bbox>
 <87k1xxbohp.fsf@yhuang-dev.intel.com>
 <20171208140909.4e31ba4f1235b638ae68fd5c@linux-foundation.org>
 <87609dvnl0.fsf@yhuang-dev.intel.com>
 <20171211170449.GS7829@linux.vnet.ibm.com>
 <87374grbpn.fsf@yhuang-dev.intel.com>
 <20171212171133.GC7829@linux.vnet.ibm.com>
 <87indbnzga.fsf@yhuang-dev.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <87indbnzga.fsf@yhuang-dev.intel.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
x-cbid: 17121303-0036-0000-0000-0000029A595D
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00008196; HX=3.00000241; KW=3.00000007;
 PH=3.00000004; SC=3.00000244; SDB=6.00959421; UDB=6.00485214; IPR=6.00739443;
 BA=6.00005738; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000;
 ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00018513; XFM=3.00000015;
 UTC=2017-12-13 03:27:30
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17121303-0037-0000-0000-000042A6AADB
Message-Id: <20171213032725.GJ7829@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-12-13_01:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501
 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0
 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000
 definitions=main-1712130045
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Dec 13, 2017 at 10:17:41AM +0800, Huang, Ying wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> 
> > On Tue, Dec 12, 2017 at 09:12:20AM +0800, Huang, Ying wrote:
> >> Hi, Pual,
> >> 
> >> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> >> 
> >> > On Mon, Dec 11, 2017 at 01:30:03PM +0800, Huang, Ying wrote:
> >> >> Andrew Morton <akpm@linux-foundation.org> writes:
> >> >> 
> >> >> > On Fri, 08 Dec 2017 16:41:38 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
> >> >> >
> >> >> >> > Why do we need srcu here? Is it enough with rcu like below?
> >> >> >> >
> >> >> >> > It might have a bug/room to be optimized about performance/naming.
> >> >> >> > I just wanted to show my intention.
> >> >> >> 
> >> >> >> Yes.  rcu should work too.  But if we use rcu, it may need to be called
> >> >> >> several times to make sure the swap device under us doesn't go away, for
> >> >> >> example, when checking si->max in __swp_swapcount() and
> >> >> >> add_swap_count_continuation().  And I found we need rcu to protect swap
> >> >> >> cache radix tree array too.  So I think it may be better to use one
> >> >> >> calling to srcu_read_lock/unlock() instead of multiple callings to
> >> >> >> rcu_read_lock/unlock().
> >> >> >
> >> >> > Or use stop_machine() ;)  It's very crude but it sure is simple.  Does
> >> >> > anyone have a swapoff-intensive workload?
> >> >> 
> >> >> Sorry, I don't know how to solve the problem with stop_machine().
> >> >> 
> >> >> The problem we try to resolved is that, we have a swap entry, but that
> >> >> swap entry can become invalid because of swappoff between we check it
> >> >> and we use it.  So we need to prevent swapoff to be run between checking
> >> >> and using.
> >> >> 
> >> >> I don't know how to use stop_machine() in swapoff to wait for all users
> >> >> of swap entry to finish.  Anyone can help me on this?
> >> >
> >> > You can think of stop_machine() as being sort of like a reader-writer
> >> > lock.  The readers can be any section of code with preemption disabled,
> >> > and the writer is the function passed to stop_machine().
> >> >
> >> > Users running real-time applications on Linux don't tend to like
> >> > stop_machine() much, but perhaps it is nevertheless the right tool
> >> > for this particular job.
> >> 
> >> Thanks a lot for explanation!  Now I understand this.
> >> 
> >> Another question, for this specific problem, I think both stop_machine()
> >> based solution and rcu_read_lock/unlock() + synchronize_rcu() based
> >> solution work.  If so, what is the difference between them?  I guess rcu
> >> based solution will be a little better for real-time applications?  So
> >> what is the advantage of stop_machine() based solution?
> >
> > The stop_machine() solution places similar restrictions on readers as
> > does rcu_read_lock/unlock() + synchronize_rcu(), if that is what you
> > are asking.
> >
> > More precisely, the stop_machine() solution places exactly the
> > same restrictions on readers as does preempt_disable/enable() and
> > synchronize_sched().
> >
> > I would expect stop_machine() to be faster than either synchronize_rcu()
> > synchronize_sched(), or synchronize_srcu(), but stop_machine() operates
> > by making each CPU spin with interrupts until all the other CPUs arrive.
> > This normally does not make real-time people happy.
> >
> > An compromise position is available in the form of
> > synchronize_rcu_expedited() and synchronize_sched_expedited().  These
> > are faster than their non-expedited counterparts, and only momentarily
> > disturb each CPU, rather than spinning with interrupts disabled.  However,
> > stop_machine() is probably a bit faster.
> >
> > Finally, syncrhonize_srcu_expedited() is reasonably fast, but
> > avoids disturbing other CPUs.  Last I checked, not quite as fast as
> > synchronize_rcu_expedited() and synchronize_sched_expedited(), though.
> >
> > You asked!  ;-)
> 
> Thanks a lot Paul!  That exceeds my expectation!
> 
> The performance of swapoff() isn't very important, probably it's not
> necessary to accelerate it at the cost of realtime.  I think it is
> better to use a rcu or srcu based solution.  I think the cost at reader
> side should be almost same between rcu and srcu?  To use srcu, we need
> to select CONFIG_SRCU when CONFIG_SWAP is enabled in Kconfig.  I think
> that should be OK?

The thing to do is to try SRCU and see if you can see significant
performance degradation.  Given that there is swapping involved, I
would be surprised if the added read-side overhead of SRCU was even
measurable, but then again I have been surprised before.

And yes, just select CONFIG_SRCU when you need it.

							Thanx, Paul