From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753249AbdGCUyI (ORCPT <rfc822;w@1wt.eu>);
        Mon, 3 Jul 2017 16:54:08 -0400
Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:53759 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL)
        by vger.kernel.org with ESMTP id S1753174AbdGCUyE (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 3 Jul 2017 16:54:04 -0400
Date: Mon, 3 Jul 2017 13:53:53 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Alan Stern <stern@rowland.harvard.edu>
Cc: Manfred Spraul <manfred@colorfullife.com>,
        linux-kernel@vger.kernel.org, netfilter-devel@vger.kernel.org,
        netdev@vger.kernel.org, oleg@redhat.com, akpm@linux-foundation.org,
        mingo@redhat.com, dave@stgolabs.net, tj@kernel.org, arnd@arndb.de,
        linux-arch@vger.kernel.org, will.deacon@arm.com, peterz@infradead.org,
        parri.andrea@gmail.com, torvalds@linux-foundation.org,
        Pablo Neira Ayuso <pablo@netfilter.org>,
        Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>,
        Florian Westphal <fw@strlen.de>,
        "David S. Miller" <davem@davemloft.net>, coreteam@netfilter.org
Subject: Re: [PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with
 lock/unlock pair
Reply-To: paulmck@linux.vnet.ibm.com
References: <20170703171420.GC2393@linux.vnet.ibm.com>
 <Pine.LNX.4.44L0.1707031559380.2027-100000@iolanthe.rowland.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.44L0.1707031559380.2027-100000@iolanthe.rowland.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
x-cbid: 17070320-0008-0000-0000-00000254D84E
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00007314; HX=3.00000241; KW=3.00000007;
 PH=3.00000004; SC=3.00000214; SDB=6.00882389; UDB=6.00440089; IPR=6.00662579;
 BA=6.00005450; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000;
 ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016061; XFM=3.00000015;
 UTC=2017-07-03 20:53:58
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17070320-0009-0000-0000-000035E60E73
Message-Id: <20170703205353.GG2393@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-07-03_13:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1703280000
 definitions=main-1707030341
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Jul 03, 2017 at 04:04:14PM -0400, Alan Stern wrote:
> On Mon, 3 Jul 2017, Paul E. McKenney wrote:
> 
> > On Mon, Jul 03, 2017 at 10:39:49AM -0400, Alan Stern wrote:
> > > On Sat, 1 Jul 2017, Manfred Spraul wrote:
> > > 
> > > > As we want to remove spin_unlock_wait() and replace it with explicit
> > > > spin_lock()/spin_unlock() calls, we can use this to simplify the
> > > > locking.
> > > > 
> > > > In addition:
> > > > - Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
> > > > - The new code avoids the backwards loop.
> > > > 
> > > > Only slightly tested, I did not manage to trigger calls to
> > > > nf_conntrack_all_lock().
> > > > 
> > > > Fixes: b16c29191dc8
> > > > Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
> > > > Cc: <stable@vger.kernel.org>
> > > > Cc: Sasha Levin <sasha.levin@oracle.com>
> > > > Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> > > > Cc: netfilter-devel@vger.kernel.org
> > > > ---
> > > >  net/netfilter/nf_conntrack_core.c | 44 +++++++++++++++++++++------------------
> > > >  1 file changed, 24 insertions(+), 20 deletions(-)
> > > > 
> > > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > > > index e847dba..1193565 100644
> > > > --- a/net/netfilter/nf_conntrack_core.c
> > > > +++ b/net/netfilter/nf_conntrack_core.c
> > > > @@ -96,19 +96,24 @@ static struct conntrack_gc_work conntrack_gc_work;
> > > >  
> > > >  void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
> > > >  {
> > > > +	/* 1) Acquire the lock */
> > > >  	spin_lock(lock);
> > > > -	while (unlikely(nf_conntrack_locks_all)) {
> > > > -		spin_unlock(lock);
> > > >  
> > > > -		/*
> > > > -		 * Order the 'nf_conntrack_locks_all' load vs. the
> > > > -		 * spin_unlock_wait() loads below, to ensure
> > > > -		 * that 'nf_conntrack_locks_all_lock' is indeed held:
> > > > -		 */
> > > > -		smp_rmb(); /* spin_lock(&nf_conntrack_locks_all_lock) */
> > > > -		spin_unlock_wait(&nf_conntrack_locks_all_lock);
> > > > -		spin_lock(lock);
> > > > -	}
> > > > +	/* 2) read nf_conntrack_locks_all, with ACQUIRE semantics */
> > > > +	if (likely(smp_load_acquire(&nf_conntrack_locks_all) == false))
> > > > +		return;
> > > 
> > > As far as I can tell, this read does not need to have ACQUIRE
> > > semantics.
> > > 
> > > You need to guarantee that two things can never happen:
> > > 
> > >     (1) We read nf_conntrack_locks_all == false, and this routine's
> > > 	critical section for nf_conntrack_locks[i] runs after the
> > > 	(empty) critical section for that lock in 
> > > 	nf_conntrack_all_lock().
> > > 
> > >     (2) We read nf_conntrack_locks_all == true, and this routine's 
> > > 	critical section for nf_conntrack_locks_all_lock runs before 
> > > 	the critical section in nf_conntrack_all_lock().
> > > 
> > > In fact, neither one can happen even if smp_load_acquire() is replaced
> > > with READ_ONCE().  The reason is simple enough, using this property of
> > > spinlocks:
> > > 
> > > 	If critical section CS1 runs before critical section CS2 (for 
> > > 	the same lock) then: (a) every write coming before CS1's
> > > 	spin_unlock() will be visible to any read coming after CS2's
> > > 	spin_lock(), and (b) no write coming after CS2's spin_lock()
> > > 	will be visible to any read coming before CS1's spin_unlock().
> > > 
> > > Thus for (1), assuming the critical sections run in the order mentioned
> > > above, since nf_conntrack_all_lock() writes to nf_conntrack_locks_all
> > > before releasing nf_conntrack_locks[i], and since nf_conntrack_lock()
> > > acquires nf_conntrack_locks[i] before reading nf_conntrack_locks_all,
> > > by (a) the read will always see the write.
> > > 
> > > Similarly for (2), since nf_conntrack_all_lock() acquires 
> > > nf_conntrack_locks_all_lock before writing to nf_conntrack_locks_all, 
> > > and since nf_conntrack_lock() reads nf_conntrack_locks_all before 
> > > releasing nf_conntrack_locks_all_lock, by (b) the read cannot see the 
> > > write.
> > 
> > And the Linux kernel memory model (https://lwn.net/Articles/718628/
> > and https://lwn.net/Articles/720550/) agrees with Alan.  Here is
> > a litmus test, which emulates spin_lock() with xchg_acquire() and
> > spin_unlock() with smp_store_release():
> > 
> > ------------------------------------------------------------------------
> > 
> > C C-ManfredSpraul-L1G1xchgnr.litmus
> > 
> > (* Expected result: Never.  *)
> > 
> > {
> > }
> > 
> > P0(int *nfcla, spinlock_t *gbl, int *gbl_held, spinlock_t *lcl, int *lcl_held)
> > {
> > 	/* Acquire local lock. */
> > 	r10 = xchg_acquire(lcl, 1);
> > 	r1 = READ_ONCE(*nfcla);
> > 	if (r1) {
> > 		smp_store_release(lcl, 0);
> > 		r11 = xchg_acquire(gbl, 1);
> > 		r12 = xchg_acquire(lcl, 1);
> > 		smp_store_release(gbl, 0);
> > 	}
> > 	r2 = READ_ONCE(*gbl_held);
> > 	WRITE_ONCE(*lcl_held, 1);
> > 	WRITE_ONCE(*lcl_held, 0);
> > 	smp_store_release(lcl, 0);
> > }
> > 
> > P1(int *nfcla, spinlock_t *gbl, int *gbl_held, spinlock_t *lcl, int *lcl_held)
> > {
> > 	/* Acquire global lock. */
> > 	r10 = xchg_acquire(gbl, 1);
> > 	WRITE_ONCE(*nfcla, 1);
> > 	r11 = xchg_acquire(lcl, 1);
> > 	smp_store_release(lcl, 0);
> > 	r2 = READ_ONCE(*lcl_held);
> > 	WRITE_ONCE(*gbl_held, 1);
> 
> This litmus test is incomplete, because it omits the assignment setting
> nf_conntrack_locks_all back to false when the global lock is released.  
> You should insert
> 
> 	smp_store_release(*nfcla, 0);
> 
> right here.
> 
> > 	WRITE_ONCE(*gbl_held, 0);
> > 	smp_store_release(gbl, 0);
> > }
> > 
> > exists
> > ((0:r2=1 \/ 1:r2=1) /\ 0:r10=0 /\ 0:r11=0 /\ 0:r12=0 /\ 1:r10=0 /\ 1:r11=0)
> 
> With that addition, the litmus test fails unless the read of nfcla in 
> P0 is an smp_load_acquire.  So Manfred's patch should not be changed.

Very good!  Aside from updating litmus tests, my work is done!  ;-)

							Thanx, Paul