From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752059AbdF0Q1w (ORCPT ); Tue, 27 Jun 2017 12:27:52 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:49487 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751492AbdF0Q1n (ORCPT ); Tue, 27 Jun 2017 12:27:43 -0400 Date: Tue, 27 Jun 2017 09:27:38 -0700 From: "Paul E. McKenney" To: Tejun Heo Cc: jiangshanlai@gmail.com, linux-kernel@vger.kernel.org Subject: Re: WARN_ON_ONCE() in process_one_work()? Reply-To: paulmck@linux.vnet.ibm.com References: <20170613223103.GX3721@linux.vnet.ibm.com> <20170614151548.GA14462@linux.vnet.ibm.com> <20170615153857.GA27788@linux.vnet.ibm.com> <20170616173658.GA451@linux.vnet.ibm.com> <20170617115314.GA20758@htj.duckdns.org> <20170617173105.GI3721@linux.vnet.ibm.com> <20170618104000.GC28042@htj.duckdns.org> <20170620164523.GI3721@linux.vnet.ibm.com> <20170621153035.GA31181@linux.vnet.ibm.com> <20170623164142.GA14685@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170623164142.GA14685@linux.vnet.ibm.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17062716-0052-0000-0000-0000022F88C8 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007286; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000214; SDB=6.00879497; UDB=6.00438346; IPR=6.00659652; BA=6.00005445; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00015977; XFM=3.00000015; UTC=2017-06-27 16:27:41 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17062716-0053-0000-0000-0000511EC18B Message-Id: <20170627162738.GA16289@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-06-27_10:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1703280000 definitions=main-1706270264 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jun 23, 2017 at 09:41:42AM -0700, Paul E. McKenney wrote: > On Wed, Jun 21, 2017 at 08:30:35AM -0700, Paul E. McKenney wrote: > > On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote: > > > On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote: > > > > Hello, > > > > > > > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote: > > > > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > > > > > > Hello, > > > > > > > > > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > > > > > > And no test failures from yesterday evening. So it looks like we get > > > > > > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture > > > > > > > runtime with your printk() in the mix. > > > > > > > > > > > > > > Was the above output from your printk() output of any help? > > > > > > > > > > > > Yeah, if my suspicion is correct, it'd require new kworker creation > > > > > > racing against CPU offline, which would explain why it's so difficult > > > > > > to repro. Can you please see whether the following patch resolves the > > > > > > issue? > > > > > > > > > > That could explain why only Steve Rostedt and I saw the issue. As far > > > > > as I know, we are the only ones who regularly run CPU-hotplug stress > > > > > tests. ;-) > > > > > > > > I was a bit confused. It has to be racing against either new kworker > > > > being created on the wrong CPU or rescuer trying to migrate to the > > > > CPU, and it looks like we're mostly seeing the rescuer condition, but, > > > > yeah, this would only get triggered rarely. Another contributing > > > > factor could be the vmstat work putting on a workqueue w/ rescuer > > > > recently. It runs quite often, so probably has increased the chance > > > > of hitting the right condition. > > > > > > Sounds like too much fun! ;-) > > > > > > But more constructively... If I understand correctly, it is now possible > > > to take a CPU partially offline and put it back online again. This should > > > allow much more intense testing of this sort of interaction. > > > > > > And no, I haven't yet tried this with RCU because I would probably need > > > to do some mix of just-RCU online/offline and full-up online-offline. > > > Plus RCU requires pretty much a full online/offline cycle to fully > > > exercise it. :-/ > > > > > > > > I have a weekend-long run going, but will give this a shot overnight on > > > > > Monday, Pacific Time. Thank you for putting it together, looking forward > > > > > to seeing what it does! > > > > > > > > Thanks a lot for the testing and patience. Sorry that it took so > > > > long. I'm not completely sure the patch is correct. It might have to > > > > be more specifc about which type of migration or require further > > > > synchronization around migration, but hopefully it'll at least be able > > > > to show that this was the cause of the problem. > > > > > > And last night's tests had no failures. Which might actually mean > > > something, will get more info when I run without your patch this > > > evening. ;-) > > > > And it didn't fail without the patch, either. 45 hours of test vs. > > 60 hours with the patch. This one is not going to be easy to prove > > either way. I will try again this evening without the patch and see > > what that gets us. > > And another 36 hours (total of 81 hours) without the patch, still no > failure. Sigh. > > In the sense that the patch doesn't cause any new problem: > > Tested-by: Paul E. McKenney > > But I clearly have nothing of statistical significance, so any confidence > in the fix is coming from your reproducer. And for whatever it is worth, I did finally get a reproduction without the patch. The probability of occurrence is quite low with my test setup, so please queue this patch. I will accumulate test time on it over the months to come. :-/ Thanx, Paul