From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1754319AbdFWQls (ORCPT <rfc822;w@1wt.eu>);
        Fri, 23 Jun 2017 12:41:48 -0400
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:38242 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1752606AbdFWQlr (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 23 Jun 2017 12:41:47 -0400
Date: Fri, 23 Jun 2017 09:41:42 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Tejun Heo <tj@kernel.org>
Cc: jiangshanlai@gmail.com, linux-kernel@vger.kernel.org
Subject: Re: WARN_ON_ONCE() in process_one_work()?
Reply-To: paulmck@linux.vnet.ibm.com
References: <20170613205837.GB7359@htj.duckdns.org>
 <20170613223103.GX3721@linux.vnet.ibm.com>
 <20170614151548.GA14462@linux.vnet.ibm.com>
 <20170615153857.GA27788@linux.vnet.ibm.com>
 <20170616173658.GA451@linux.vnet.ibm.com>
 <20170617115314.GA20758@htj.duckdns.org>
 <20170617173105.GI3721@linux.vnet.ibm.com>
 <20170618104000.GC28042@htj.duckdns.org>
 <20170620164523.GI3721@linux.vnet.ibm.com>
 <20170621153035.GA31181@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170621153035.GA31181@linux.vnet.ibm.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
x-cbid: 17062316-0036-0000-0000-00000234DC65
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00007278; HX=3.00000241; KW=3.00000007;
 PH=3.00000004; SC=3.00000214; SDB=6.00878826; UDB=6.00437942; IPR=6.00658977;
 BA=6.00005438; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000;
 ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00015944; XFM=3.00000015;
 UTC=2017-06-23 16:41:45
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17062316-0037-0000-0000-000040D6FBCC
Message-Id: <20170623164142.GA14685@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-06-23_10:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1703280000
 definitions=main-1706230280
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Jun 21, 2017 at 08:30:35AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote:
> > On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> > > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > > > > Hello,
> > > > > 
> > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > > > > And no test failures from yesterday evening.  So it looks like we get
> > > > > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture
> > > > > > runtime with your printk() in the mix.
> > > > > >
> > > > > > Was the above output from your printk() output of any help?
> > > > > 
> > > > > Yeah, if my suspicion is correct, it'd require new kworker creation
> > > > > racing against CPU offline, which would explain why it's so difficult
> > > > > to repro.  Can you please see whether the following patch resolves the
> > > > > issue?
> > > > 
> > > > That could explain why only Steve Rostedt and I saw the issue.  As far
> > > > as I know, we are the only ones who regularly run CPU-hotplug stress
> > > > tests.  ;-)
> > > 
> > > I was a bit confused.  It has to be racing against either new kworker
> > > being created on the wrong CPU or rescuer trying to migrate to the
> > > CPU, and it looks like we're mostly seeing the rescuer condition, but,
> > > yeah, this would only get triggered rarely.  Another contributing
> > > factor could be the vmstat work putting on a workqueue w/ rescuer
> > > recently.  It runs quite often, so probably has increased the chance
> > > of hitting the right condition.
> > 
> > Sounds like too much fun!  ;-)
> > 
> > But more constructively...  If I understand correctly, it is now possible
> > to take a CPU partially offline and put it back online again.  This should
> > allow much more intense testing of this sort of interaction.
> > 
> > And no, I haven't yet tried this with RCU because I would probably need
> > to do some mix of just-RCU online/offline and full-up online-offline.
> > Plus RCU requires pretty much a full online/offline cycle to fully
> > exercise it.  :-/
> > 
> > > > I have a weekend-long run going, but will give this a shot overnight on
> > > > Monday, Pacific Time.  Thank you for putting it together, looking forward
> > > > to seeing what it does!
> > > 
> > > Thanks a lot for the testing and patience.  Sorry that it took so
> > > long.  I'm not completely sure the patch is correct.  It might have to
> > > be more specifc about which type of migration or require further
> > > synchronization around migration, but hopefully it'll at least be able
> > > to show that this was the cause of the problem.
> > 
> > And last night's tests had no failures.  Which might actually mean
> > something, will get more info when I run without your patch this
> > evening.  ;-)
> 
> And it didn't fail without the patch, either.  45 hours of test vs.
> 60 hours with the patch.  This one is not going to be easy to prove
> either way.  I will try again this evening without the patch and see
> what that gets us.

And another 36 hours (total of 81 hours) without the patch, still no
failure.  Sigh.

In the sense that the patch doesn't cause any new problem:

Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

But I clearly have nothing of statistical significance, so any confidence
in the fix is coming from your reproducer.

							Thanx, Paul