From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752652AbaKWVmg (ORCPT ); Sun, 23 Nov 2014 16:42:36 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:43673 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752124AbaKWVme (ORCPT ); Sun, 23 Nov 2014 16:42:34 -0500 Date: Sun, 23 Nov 2014 16:42:07 -0500 From: Chris Mason Subject: Re: New crashes walking proc with Saturday's git To: Thomas Gleixner CC: Borislav Petkov , , , Ingo Molnar , Stanislaw Gruszka Message-ID: <1416778927.3019.1@mail.thefacebook.com> In-Reply-To: References: <20141123010239.GA12691@ret.masoncoding.com> <1416758187.24312.12@mail.thefacebook.com> <20141123161120.GB7070@pd.tnic> <1416759411.24312.13@mail.thefacebook.com> <20141123163258.GB6436@pd.tnic> <1416761342.24312.15@mail.thefacebook.com> <1416777079.1732.0@mail.thefacebook.com> X-Mailer: geary/0.8.2 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8"; format=flowed X-Originating-IP: [192.168.16.4] X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.13.68,1.0.28,0.0.0000 definitions=2014-11-23_03:2014-11-21,2014-11-23,1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 kscore.is_bulkscore=0 kscore.compositescore=0 circleOfTrustscore=5.10479649962434 compositescore=0.934716438070631 urlsuspect_oldscore=0.934716438070631 suspectscore=0 recipient_domain_to_sender_totalscore=0 phishscore=0 bulkscore=0 kscore.is_spamscore=0 recipient_to_sender_totalscore=0 recipient_domain_to_sender_domain_totalscore=62764 rbsscore=0.934716438070631 spamscore=0 recipient_to_sender_domain_totalscore=4 urlsuspectscore=0.9 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1411230183 X-FB-Internal: deliver Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Nov 23, 2014 at 4:38 PM, Thomas Gleixner wrote: > On Sun, 23 Nov 2014, Chris Mason wrote: >> On Sun, Nov 23, 2014 at 4:05 PM, Thomas Gleixner >> wrote: >> > On Sun, 23 Nov 2014, Chris Mason wrote: >> > > On Sun, Nov 23, 2014 at 11:32 AM, Borislav Petkov >> wrote: >> > > > On Sun, Nov 23, 2014 at 11:16:51AM -0500, Chris Mason wrote: >> > > > > It must be: >> > > > > >> > > > > commit 6e998916dfe327e785e7c2447959b2c1a3ea4930 >> > > > > Author: Stanislaw Gruszka >> > > > > Date: Wed Nov 12 16:58:44 2014 +0100 >> > > > > >> > > > > sched/cputime: Fix clock_nanosleep()/clock_gettime() >> > > inconsistency >> > > > > >> > > > > I'll do two runs to confirm, but it's the only related >> patch between >> > > rc5 >> > > > > and >> > > > > now. >> > > >> > > I've adding Ingo and Stanislaw to the cc. With >> > > 6e998916dfe327e785e7c2447959b2c1a3ea4930 reverted, I'm no >> longer >> > > crashing. >> > > >> > > Repeating the stack trace for the new cc list. I see the >> crash with atop >> > > or >> > > similar walkers of /proc racing against exiting programs. >> Given the NULL >> > > rip, >> > > this line from the patch is probably broken, but it really >> feels like we >> > > should be falling over on p->sched_class and not on the >> update_curr func. >> > > >> > > + p->sched_class->update_curr(rq); >> > > >> > > I'm leaving my fork bomb running on two machines with the >> patch reverted >> > > to >> > > make sure. >> > >> > The sched_class instances which do not have update_curr are >> stop_task >> > and idle. Patch below. >> > >> > I'm sure nobody thought about the stats read code path here. >> > >> > [ 1053.759741] [] do_task_stat+0x8b8/0xb00 >> > >> > do_task_stat(() >> > thread_group_cputime_adjusted() >> > thread_group_cputime() >> > task_cputime() >> > task_sched_runtime() >> > if (task_current(rq, p) && task_on_rq_queued(p)) { >> > update_rq_clock(rq); >> > p->sched_class->update_curr(rq); >> > } >> > >> > Now if the stats are read for a stomp machine task, aka >> 'migration/N' >> > and that task is current on its cpu. Ooops. >> > >> > I added the callback for idle tasks as well for completeness sake. >> >> This does make sense, but it doesn't match with the crash being >> much more >> likely during the fork bomb. The difference is crashing within a >> few hours vs >> crashing within 5 minutes. > > The fork bomb will kick the migration task pretty often into life, so > the probablity of do_task_stat() to hit a running migration thread is > higher than on a normaly loaded machine. Fair enough, I just had crashes_in_proc == races_with_exit stuck in my head ;) I've got a new xfstests run on the second machine to be sure, but this is definitely better. -chris