From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 556E2C6FA82 for ; Thu, 22 Sep 2022 05:41:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229635AbiIVFk7 (ORCPT ); Thu, 22 Sep 2022 01:40:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34040 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229554AbiIVFk5 (ORCPT ); Thu, 22 Sep 2022 01:40:57 -0400 Received: from mailgw02.mediatek.com (unknown [210.61.82.184]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E13AD81B0A; Wed, 21 Sep 2022 22:40:54 -0700 (PDT) X-UUID: 802ff2ee5b574f9095d791aa7211dc99-20220922 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mediatek.com; s=dk; h=Content-Transfer-Encoding:MIME-Version:Content-Type:References:In-Reply-To:Date:CC:To:From:Subject:Message-ID; bh=V/4+Moo1S3nQcLY9jDjtjnWEDd9L/Kw2aufNRHGosCA=; b=Zo90xoefOKJHvtfEFcDTAqWFMMlSWhYdihn9YUwakBoC7hSj65xFQOnkygZWjRWo+0MaOfGzMIuyKHdO04SqksKYo04Q/Ql+ZbID6f3GpTsr00YT5rq3AD7qAoZiR0Fhzx7za2FPjitpJ6vUJJ8lOcjWUAYbbUvnHd5+FHK+MKU=; X-CID-P-RULE: Release_Ham X-CID-O-INFO: VERSION:1.1.11,REQID:93a320cb-2d0a-420e-a931-84fcd3502661,IP:0,U RL:0,TC:0,Content:0,EDM:0,RT:0,SF:45,FILE:0,BULK:0,RULE:Release_Ham,ACTION :release,TS:45 X-CID-INFO: VERSION:1.1.11,REQID:93a320cb-2d0a-420e-a931-84fcd3502661,IP:0,URL :0,TC:0,Content:0,EDM:0,RT:0,SF:45,FILE:0,BULK:0,RULE:Release_Ham,ACTION:r elease,TS:45 X-CID-META: VersionHash:39a5ff1,CLOUDID:696eb1a2-dc04-435c-b19b-71e131a5fc35,B ulkID:2209221340524T07RBDH,BulkQuantity:1,Recheck:0,SF:28|17|19|48|823|824 ,TC:nil,Content:0,EDM:-3,IP:nil,URL:0,File:nil,Bulk:40,QS:nil,BEC:nil,COL: 0 X-UUID: 802ff2ee5b574f9095d791aa7211dc99-20220922 Received: from mtkcas11.mediatek.inc [(172.21.101.40)] by mailgw02.mediatek.com (envelope-from ) (Generic MTA with TLSv1.2 ECDHE-RSA-AES256-SHA384 256/256) with ESMTP id 1364894657; Thu, 22 Sep 2022 13:40:48 +0800 Received: from mtkcas11.mediatek.inc (172.21.101.40) by mtkmbs10n2.mediatek.inc (172.21.101.183) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.2.792.3; Thu, 22 Sep 2022 13:40:47 +0800 Received: from mtksdccf07 (172.21.84.99) by mtkcas11.mediatek.inc (172.21.101.73) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Thu, 22 Sep 2022 13:40:47 +0800 Message-ID: <93f4ce9486ec4b856ba0f3bfe956fc9b2d3cb4cf.camel@mediatek.com> Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete From: Jing-Ting Wu To: Hillf Danton CC: Peter Zijlstra , , , Waiman Long , ValentinSchneider , TejunHeo , , , , , "chris.redpath@arm.com" , Dietmar Eggemann , "Vincent Donnefort" , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Christian Brauner , , , Date: Thu, 22 Sep 2022 13:40:47 +0800 In-Reply-To: <20220907000741.2496-1-hdanton@sina.com> References: <20220907000741.2496-1-hdanton@sina.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.28.5-0ubuntu0.18.04.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-MTK: N Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2022-09-07 at 08:07 +0800, Hillf Danton wrote: > On 5 Sep 2022 10:47:36 +0800 Jing-Ting Wu > wrote > > > > We meet the HANG_DETECT happened in T SW version with kernel-5.15. > > Many tasks have been blocked for a long time. > > > > Root cause: > > migration_cpu_stop() is not complete due to > > is_migration_disabled(p) is > > true, complete is false and complete_all() never get executed. > > It let other task wait the rwsem. > > See if handing task over to stopper again in case of migration > disabled > could survive your tests. > > Hillf > > --- linux-5.15/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2322,9 +2322,7 @@ static int migration_cpu_stop(void *data > * holding rq->lock, if p->on_rq == 0 it cannot get enqueued > because > * we're holding p->pi_lock. > */ > - if (task_rq(p) == rq) { > - if (is_migration_disabled(p)) > - goto out; > + if (task_rq(p) == rq && !is_migration_disabled(p)) { > > if (pending) { > p->migration_pending = NULL; Because Peter have some concern for patch by Waiman. We add Hillf's patch to our stability test. But there are side effects after patched. The warning appear once < two weeks. Backtrace as follows: [name:panic&]WARNING: CPU: 6 PID: 32583 at affine_move_task pc : affine_move_task lr : __set_cpus_allowed_ptr_locked Call trace: affine_move_task __set_cpus_allowed_ptr_locked migrate_enable __cgroup_bpf_run_filter_skb ip_finish_output ip_output The root cause is when is_migration_disabled(p) is true,the patched version will set p->migration_pending to NULL by migration_cpu_stop. And in affine_move_task will raise a WARN_ON_ONCE(!pending). Kernel-5.15/kernel/sched/core.c: static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flags *rf, int dest_cpu, unsigned int flags) { ... If (WARN_ON_ONCE(!pending)) { Task_rq_unlock(rq,p,fr); return -EINVAL; } ... } But the tasks have not been migrated to the new affinity CPU, so there should be pending tasks to be processed, so p->migration_pending should not be NULL. Without patch: When is_migration_disabled is true, then goto out and not set p- >migration_pending to NULL. static int migration_cpu_stop(void *data) { ... If (task_rq(p) == rq) { if (is_migration_disabled(p)) goto out; } ... } With patch: When is_migration_disabled is true and pending is true, goto else if flow. Because p->cpus_ptr not updated when migrate_disable, so this condition is always true and p->migration_pending will set to NULL. static int migration_cpu_stop(void *data) { ... If (task_rq(p) == rq && !is_migration_disabled(p) ) { ... } else if (pending) { ... If (cpumask_test_cpu(task_cpu(p), p-> cpus_ ptr)) { p->migration_pending = NULL; complete = true; goto out; } ... } Best regards, Jing-Ting Wu From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id E893CC6FA8B for ; Thu, 22 Sep 2022 05:52:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Date:CC:To:From:Subject:Message-ID:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=1m1t7260hS/sK7Etj67Zt6IC23K28JFZbXqvEwAiHRI=; b=pUnoAzsjvkYmFe UMZHl/GDBfwrELeQN9hWfoKxXVnKIyDDDsyyoLrVoQX7iunJIHGrR0nL9JSqfHHr2SH/oRJRuhxIT vYwnzHD+HVSdhEg9UtTuulZ2OKuV0e7oFw7nmo05qZAwBwes23mfv0KyqHEYR9lLj3zXJjRAhXOMF wuOnJeWGUe5z9jpkN4URguK1C4aMchLvXV2lDJr6zH2s43B63CENhYUzhpe4cxMWwpDC0ZvKbVzLB Umn+z5TRTgLx33BnjATsGyCR9l9buznxdbShqMQL7KK1Y9ybBqCxkZORMMGsiWQAVL6UGH+2aBQ9t s+5u7qCi/PMn/1BLMoeQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1obF7f-00DSuf-2Q; Thu, 22 Sep 2022 05:51:43 +0000 Received: from mailgw02.mediatek.com ([216.200.240.185]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1obF7Z-00DSrD-S5; Thu, 22 Sep 2022 05:51:40 +0000 X-UUID: 6bfc14f3880a4988b888c56b28fd7f21-20220921 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mediatek.com; s=dk; h=Content-Transfer-Encoding:MIME-Version:Content-Type:References:In-Reply-To:Date:CC:To:From:Subject:Message-ID; bh=V/4+Moo1S3nQcLY9jDjtjnWEDd9L/Kw2aufNRHGosCA=; b=Zo90xoefOKJHvtfEFcDTAqWFMMlSWhYdihn9YUwakBoC7hSj65xFQOnkygZWjRWo+0MaOfGzMIuyKHdO04SqksKYo04Q/Ql+ZbID6f3GpTsr00YT5rq3AD7qAoZiR0Fhzx7za2FPjitpJ6vUJJ8lOcjWUAYbbUvnHd5+FHK+MKU=; X-CID-P-RULE: Release_Ham X-CID-O-INFO: VERSION:1.1.11,REQID:04bf92d0-f89d-4265-ba74-6378ad223090,IP:0,U RL:0,TC:0,Content:0,EDM:0,RT:0,SF:0,FILE:0,BULK:0,RULE:Release_Ham,ACTION: release,TS:0 X-CID-META: VersionHash:39a5ff1,CLOUDID:7d73d8e3-87f9-4bb0-97b6-34957dc0fbbe,B ulkID:nil,BulkQuantity:0,Recheck:0,SF:nil,TC:nil,Content:0,EDM:-3,IP:nil,U RL:0,File:nil,Bulk:nil,QS:nil,BEC:nil,COL:0 X-UUID: 6bfc14f3880a4988b888c56b28fd7f21-20220921 Received: from mtkmbs11n1.mediatek.inc [(172.21.101.185)] by mailgw02.mediatek.com (envelope-from ) (musrelay.mediatek.com ESMTP with TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384 256/256) with ESMTP id 1371594570; Wed, 21 Sep 2022 22:51:20 -0700 Received: from mtkcas11.mediatek.inc (172.21.101.40) by mtkmbs10n2.mediatek.inc (172.21.101.183) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.2.792.3; Thu, 22 Sep 2022 13:40:47 +0800 Received: from mtksdccf07 (172.21.84.99) by mtkcas11.mediatek.inc (172.21.101.73) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Thu, 22 Sep 2022 13:40:47 +0800 Message-ID: <93f4ce9486ec4b856ba0f3bfe956fc9b2d3cb4cf.camel@mediatek.com> Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete From: Jing-Ting Wu To: Hillf Danton CC: Peter Zijlstra , , , Waiman Long , ValentinSchneider , TejunHeo , , , , , "chris.redpath@arm.com" , Dietmar Eggemann , "Vincent Donnefort" , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Christian Brauner , , , Date: Thu, 22 Sep 2022 13:40:47 +0800 In-Reply-To: <20220907000741.2496-1-hdanton@sina.com> References: <20220907000741.2496-1-hdanton@sina.com> X-Mailer: Evolution 3.28.5-0ubuntu0.18.04.2 MIME-Version: 1.0 X-MTK: N X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220921_225137_932229_C4938638 X-CRM114-Status: GOOD ( 20.13 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org T24gV2VkLCAyMDIyLTA5LTA3IGF0IDA4OjA3ICswODAwLCBIaWxsZiBEYW50b24gd3JvdGU6Cj4g T24gNSBTZXAgMjAyMiAxMDo0NzozNiArMDgwMCBKaW5nLVRpbmcgV3UgPGppbmctdGluZy53dUBt ZWRpYXRlay5jb20+Cj4gd3JvdGUKPiA+IAo+ID4gV2UgbWVldCB0aGUgSEFOR19ERVRFQ1QgaGFw cGVuZWQgaW4gVCBTVyB2ZXJzaW9uIHdpdGgga2VybmVsLTUuMTUuCj4gPiBNYW55IHRhc2tzIGhh dmUgYmVlbiBibG9ja2VkIGZvciBhIGxvbmcgdGltZS4KPiA+IAo+ID4gUm9vdCBjYXVzZToKPiA+ IG1pZ3JhdGlvbl9jcHVfc3RvcCgpIGlzIG5vdCBjb21wbGV0ZSBkdWUgdG8KPiA+IGlzX21pZ3Jh dGlvbl9kaXNhYmxlZChwKSBpcwo+ID4gdHJ1ZSwgY29tcGxldGUgaXMgZmFsc2UgYW5kIGNvbXBs ZXRlX2FsbCgpIG5ldmVyIGdldCBleGVjdXRlZC4KPiA+IEl0IGxldCBvdGhlciB0YXNrIHdhaXQg dGhlIHJ3c2VtLgo+IAo+IFNlZSBpZiBoYW5kaW5nIHRhc2sgb3ZlciB0byBzdG9wcGVyIGFnYWlu IGluIGNhc2Ugb2YgbWlncmF0aW9uCj4gZGlzYWJsZWQKPiBjb3VsZCBzdXJ2aXZlIHlvdXIgdGVz dHMuCj4gCj4gSGlsbGYKPiAKPiAtLS0gbGludXgtNS4xNS9rZXJuZWwvc2NoZWQvY29yZS5jCj4g KysrIGIva2VybmVsL3NjaGVkL2NvcmUuYwo+IEBAIC0yMzIyLDkgKzIzMjIsNyBAQCBzdGF0aWMg aW50IG1pZ3JhdGlvbl9jcHVfc3RvcCh2b2lkICpkYXRhCj4gIAkgKiBob2xkaW5nIHJxLT5sb2Nr LCBpZiBwLT5vbl9ycSA9PSAwIGl0IGNhbm5vdCBnZXQgZW5xdWV1ZWQKPiBiZWNhdXNlCj4gIAkg KiB3ZSdyZSBob2xkaW5nIHAtPnBpX2xvY2suCj4gIAkgKi8KPiAtCWlmICh0YXNrX3JxKHApID09 IHJxKSB7Cj4gLQkJaWYgKGlzX21pZ3JhdGlvbl9kaXNhYmxlZChwKSkKPiAtCQkJZ290byBvdXQ7 Cj4gKwlpZiAodGFza19ycShwKSA9PSBycSAmJiAhaXNfbWlncmF0aW9uX2Rpc2FibGVkKHApKSB7 Cj4gIAo+ICAJCWlmIChwZW5kaW5nKSB7Cj4gIAkJCXAtPm1pZ3JhdGlvbl9wZW5kaW5nID0gTlVM TDsKCkJlY2F1c2UgUGV0ZXIgaGF2ZSBzb21lIGNvbmNlcm4gZm9yIHBhdGNoIGJ5IFdhaW1hbi4K V2UgYWRkIEhpbGxmJ3MgcGF0Y2ggdG8gb3VyIHN0YWJpbGl0eSB0ZXN0LgpCdXQgdGhlcmUgYXJl IHNpZGUgZWZmZWN0cyBhZnRlciBwYXRjaGVkLgpUaGUgd2FybmluZyBhcHBlYXIgb25jZSA8IHR3 byB3ZWVrcy4gCgpCYWNrdHJhY2UgYXMgZm9sbG93czoKW25hbWU6cGFuaWMmXVdBUk5JTkc6IENQ VTogNiBQSUQ6IDMyNTgzIGF0IGFmZmluZV9tb3ZlX3Rhc2sKcGMgOiBhZmZpbmVfbW92ZV90YXNr CmxyIDogX19zZXRfY3B1c19hbGxvd2VkX3B0cl9sb2NrZWQKQ2FsbCB0cmFjZToKYWZmaW5lX21v dmVfdGFzawpfX3NldF9jcHVzX2FsbG93ZWRfcHRyX2xvY2tlZAptaWdyYXRlX2VuYWJsZQpfX2Nn cm91cF9icGZfcnVuX2ZpbHRlcl9za2IKaXBfZmluaXNoX291dHB1dAppcF9vdXRwdXQKCgpUaGUg cm9vdCBjYXVzZSBpcyB3aGVuIGlzX21pZ3JhdGlvbl9kaXNhYmxlZChwKSBpcyB0cnVl77yMdGhl IHBhdGNoZWQKdmVyc2lvbiB3aWxsIHNldCBwLT5taWdyYXRpb25fcGVuZGluZyB0byBOVUxMIGJ5 IG1pZ3JhdGlvbl9jcHVfc3RvcC4KQW5kIGluIGFmZmluZV9tb3ZlX3Rhc2sgd2lsbCByYWlzZSBh IFdBUk5fT05fT05DRSghcGVuZGluZykuCgpLZXJuZWwtNS4xNS9rZXJuZWwvc2NoZWQvY29yZS5j OgpzdGF0aWMgaW50IGFmZmluZV9tb3ZlX3Rhc2soc3RydWN0IHJxICpycSwgc3RydWN0IHRhc2tf c3RydWN0ICpwLApzdHJ1Y3QgcnFfZmxhZ3MgKnJmLCBpbnQgZGVzdF9jcHUsIHVuc2lnbmVkIGlu dCBmbGFncykgewouLi4KCUlmIChXQVJOX09OX09OQ0UoIXBlbmRpbmcpKSB7CiAJICBUYXNrX3Jx X3VubG9jayhycSxwLGZyKTsKICAJICByZXR1cm4gLUVJTlZBTDsKCX0KLi4uCn0KCkJ1dCB0aGUg dGFza3MgaGF2ZSBub3QgYmVlbiBtaWdyYXRlZCB0byB0aGUgbmV3IGFmZmluaXR5IENQVSwgc28g dGhlcmUKc2hvdWxkIGJlIHBlbmRpbmcgdGFza3MgdG8gYmUgcHJvY2Vzc2VkLCBzbyBwLT5taWdy YXRpb25fcGVuZGluZyBzaG91bGQKbm90IGJlIE5VTEwuCgoKCldpdGhvdXQgcGF0Y2g6CldoZW4g aXNfbWlncmF0aW9uX2Rpc2FibGVkIGlzIHRydWUsIHRoZW4gZ290byBvdXQgYW5kIG5vdCBzZXQg cC0KPm1pZ3JhdGlvbl9wZW5kaW5nIHRvIE5VTEwuCgpzdGF0aWMgaW50IG1pZ3JhdGlvbl9jcHVf c3RvcCh2b2lkICpkYXRhKSB7Ci4uLgoJSWYgKHRhc2tfcnEocCkgPT0gcnEpIHsKICAgICAgICAJ ICAgICBpZiAoaXNfbWlncmF0aW9uX2Rpc2FibGVkKHApKQogICAgICAgICAgICAgICAgCSAgICAg ICAgICAgZ290byBvdXQ7Cgl9Ci4uLgp9CgoKV2l0aCBwYXRjaDoKV2hlbiBpc19taWdyYXRpb25f ZGlzYWJsZWQgaXMgdHJ1ZSBhbmQgcGVuZGluZyBpcyB0cnVlLCBnb3RvIGVsc2UgaWYKZmxvdy4g QmVjYXVzZSBwLT5jcHVzX3B0ciBub3QgdXBkYXRlZCB3aGVuIG1pZ3JhdGVfZGlzYWJsZSwgc28g dGhpcwpjb25kaXRpb24gaXMgYWx3YXlzIHRydWUgYW5kIHAtPm1pZ3JhdGlvbl9wZW5kaW5nIHdp bGwgc2V0IHRvIE5VTEwuCgpzdGF0aWMgaW50IG1pZ3JhdGlvbl9jcHVfc3RvcCh2b2lkICpkYXRh KSB7Ci4uLgoJSWYgKHRhc2tfcnEocCkgPT0gcnEgJiYgIWlzX21pZ3JhdGlvbl9kaXNhYmxlZChw KSApIHsKIAkgIC4uLgoJfSBlbHNlIGlmIChwZW5kaW5nKSB7CgkgIC4uLgoJICBJZiAoY3B1bWFz a190ZXN0X2NwdSh0YXNrX2NwdShwKSwgcC0+IGNwdXNfIHB0cikpIHsgCiAgICAgICAgCXAtPm1p Z3JhdGlvbl9wZW5kaW5nID0gTlVMTDsKICAgICAgCQkgY29tcGxldGUgPSB0cnVlOwogICAgICAJ CSBnb3RvIG91dDsKCX0KLi4uCn0KCgoKCgoKQmVzdCByZWdhcmRzLApKaW5nLVRpbmcgV3UKCgoK X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KbGludXgtYXJt LWtlcm5lbCBtYWlsaW5nIGxpc3QKbGludXgtYXJtLWtlcm5lbEBsaXN0cy5pbmZyYWRlYWQub3Jn Cmh0dHA6Ly9saXN0cy5pbmZyYWRlYWQub3JnL21haWxtYW4vbGlzdGluZm8vbGludXgtYXJtLWtl cm5lbAo= From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jing-Ting Wu Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete Date: Thu, 22 Sep 2022 13:40:47 +0800 Message-ID: <93f4ce9486ec4b856ba0f3bfe956fc9b2d3cb4cf.camel@mediatek.com> References: <20220907000741.2496-1-hdanton@sina.com> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mediatek.com; s=dk; h=Content-Transfer-Encoding:MIME-Version:Content-Type:References:In-Reply-To:Date:CC:To:From:Subject:Message-ID; bh=V/4+Moo1S3nQcLY9jDjtjnWEDd9L/Kw2aufNRHGosCA=; b=Zo90xoefOKJHvtfEFcDTAqWFMMlSWhYdihn9YUwakBoC7hSj65xFQOnkygZWjRWo+0MaOfGzMIuyKHdO04SqksKYo04Q/Ql+ZbID6f3GpTsr00YT5rq3AD7qAoZiR0Fhzx7za2FPjitpJ6vUJJ8lOcjWUAYbbUvnHd5+FHK+MKU=; In-Reply-To: <20220907000741.2496-1-hdanton-k+cT0dCbe1g@public.gmane.org> List-ID: Content-Type: text/plain; charset="windows-1252" To: Hillf Danton Cc: Peter Zijlstra , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Waiman Long , ValentinSchneider , TejunHeo , wsd_upstream-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, linux-mediatek-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, Jonathan.JMChen-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org, "chris.redpath-5wv7dgnIgG8@public.gmane.org" , Dietmar Eggemann , Vincent Donnefort , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Christian Brauner On Wed, 2022-09-07 at 08:07 +0800, Hillf Danton wrote: > On 5 Sep 2022 10:47:36 +0800 Jing-Ting Wu > wrote > >=20 > > We meet the HANG_DETECT happened in T SW version with kernel-5.15. > > Many tasks have been blocked for a long time. > >=20 > > Root cause: > > migration_cpu_stop() is not complete due to > > is_migration_disabled(p) is > > true, complete is false and complete_all() never get executed. > > It let other task wait the rwsem. >=20 > See if handing task over to stopper again in case of migration > disabled > could survive your tests. >=20 > Hillf >=20 > --- linux-5.15/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2322,9 +2322,7 @@ static int migration_cpu_stop(void *data > * holding rq->lock, if p->on_rq =3D=3D 0 it cannot get enqueued > because > * we're holding p->pi_lock. > */ > - if (task_rq(p) =3D=3D rq) { > - if (is_migration_disabled(p)) > - goto out; > + if (task_rq(p) =3D=3D rq && !is_migration_disabled(p)) { > =20 > if (pending) { > p->migration_pending =3D NULL; Because Peter have some concern for patch by Waiman. We add Hillf's patch to our stability test. But there are side effects after patched. The warning appear once < two weeks.=20 Backtrace as follows: [name:panic&]WARNING: CPU: 6 PID: 32583 at affine_move_task pc : affine_move_task lr : __set_cpus_allowed_ptr_locked Call trace: affine_move_task __set_cpus_allowed_ptr_locked migrate_enable __cgroup_bpf_run_filter_skb ip_finish_output ip_output The root cause is when is_migration_disabled(p) is true=EF=BC=8Cthe patched version will set p->migration_pending to NULL by migration_cpu_stop. And in affine_move_task will raise a WARN_ON_ONCE(!pending). Kernel-5.15/kernel/sched/core.c: static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flags *rf, int dest_cpu, unsigned int flags) { ... If (WARN_ON_ONCE(!pending)) { Task_rq_unlock(rq,p,fr); return -EINVAL; } ... } But the tasks have not been migrated to the new affinity CPU, so there should be pending tasks to be processed, so p->migration_pending should not be NULL. Without patch: When is_migration_disabled is true, then goto out and not set p- >migration_pending to NULL. static int migration_cpu_stop(void *data) { ... If (task_rq(p) =3D=3D rq) { if (is_migration_disabled(p)) goto out; } ... } With patch: When is_migration_disabled is true and pending is true, goto else if flow. Because p->cpus_ptr not updated when migrate_disable, so this condition is always true and p->migration_pending will set to NULL. static int migration_cpu_stop(void *data) { ... If (task_rq(p) =3D=3D rq && !is_migration_disabled(p) ) { ... } else if (pending) { ... If (cpumask_test_cpu(task_cpu(p), p-> cpus_ ptr)) {=20 p->migration_pending =3D NULL; complete =3D true; goto out; } ... } Best regards, Jing-Ting Wu