From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14F00ECAAD5 for ; Mon, 5 Sep 2022 06:45:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235617AbiIEGpM (ORCPT ); Mon, 5 Sep 2022 02:45:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38780 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235768AbiIEGpI (ORCPT ); Mon, 5 Sep 2022 02:45:08 -0400 Received: from mx0a-0031df01.pphosted.com (mx0a-0031df01.pphosted.com [205.220.168.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E502AE88; Sun, 4 Sep 2022 23:45:05 -0700 (PDT) Received: from pps.filterd (m0279867.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2855tSmt032206; Mon, 5 Sep 2022 06:44:40 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h=message-id : date : mime-version : subject : to : cc : references : from : in-reply-to : content-type : content-transfer-encoding; s=qcppdkim1; bh=gW1gPnN7kNxb4y8nk4Ee1k+14vlMGf9KzqKgmTTkmfk=; b=ndjjL+KyQbOBGlfcl4LoT+aXBtejkdmx2C8FujeGUh7i3DChmOtIY2eqeYe52MsagCHU h2A4R3+TPgWjc614GcylobB1OU6D7Sf1GyYaHY5AVL21NKC7LZ3fTqaR3umpdEF+GLPT oHBj2/Ww7uD2nMQsnCCXj3xzTqoFP9kzBbc7gMfwdKIRzzSw8wovJI/GUbXdJC0wxM21 DfmTW4KDKq/QV9xrZGk3XVLPy3CDaSUwuQ5jbei6F7sFdM97vvHeIISqAkVWIWN/+gV+ 5Wl6wj3zGqR+84Ao+OvYUoXnsofAx0ZuD6t3uuO81wyjlFQ+Y+H7Uya7w0D3/zx5ndmC QQ== Received: from nasanppmta02.qualcomm.com (i-global254.qualcomm.com [199.106.103.254]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3jbvmnbm4d-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 05 Sep 2022 06:44:40 +0000 Received: from pps.filterd (NASANPPMTA02.qualcomm.com [127.0.0.1]) by NASANPPMTA02.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTP id 2856e8RS004289; Mon, 5 Sep 2022 06:44:39 GMT Received: from pps.reinject (localhost [127.0.0.1]) by NASANPPMTA02.qualcomm.com (PPS) with ESMTPS id 3jc00kyg7r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 05 Sep 2022 06:44:39 +0000 Received: from NASANPPMTA02.qualcomm.com (NASANPPMTA02.qualcomm.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2856idaj012725; Mon, 5 Sep 2022 06:44:39 GMT Received: from nasanex01c.na.qualcomm.com (corens_vlan604_snip.qualcomm.com [10.53.140.1]) by NASANPPMTA02.qualcomm.com (PPS) with ESMTPS id 2856idT6012689 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 05 Sep 2022 06:44:39 +0000 Received: from [10.216.60.176] (10.80.80.8) by nasanex01c.na.qualcomm.com (10.45.79.139) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.29; Sun, 4 Sep 2022 23:44:30 -0700 Message-ID: Date: Mon, 5 Sep 2022 12:14:19 +0530 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.13.0 Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete Content-Language: en-US To: Jing-Ting Wu , Peter Zijlstra , Valentin Schneider , Tejun Heo CC: , , , , , "chris.redpath@arm.com" , Dietmar Eggemann , Vincent Donnefort , Ingo Molnar , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Christian Brauner , , , References: <88b2910181bda955ac46011b695c53f7da39ac47.camel@mediatek.com> From: Mukesh Ojha In-Reply-To: <88b2910181bda955ac46011b695c53f7da39ac47.camel@mediatek.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01b.na.qualcomm.com (10.46.141.250) To nasanex01c.na.qualcomm.com (10.45.79.139) X-QCInternal: smtphost X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-GUID: gDtQSrWiUSF1RBPYBKPWmgoWj36ZEP6c X-Proofpoint-ORIG-GUID: gDtQSrWiUSF1RBPYBKPWmgoWj36ZEP6c X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-09-05_04,2022-09-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 spamscore=0 priorityscore=1501 bulkscore=0 adultscore=0 malwarescore=0 mlxscore=0 mlxlogscore=628 lowpriorityscore=0 phishscore=0 clxscore=1011 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2209050032 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is fixed by this. https://lore.kernel.org/lkml/YvrWaml3F+x9Dk+T@slm.duckdns.org/ -Mukesh On 9/5/2022 8:17 AM, Jing-Ting Wu wrote: > Hi, > > We meet the HANG_DETECT happened in T SW version with kernel-5.15. > Many tasks have been blocked for a long time. > > > Root cause: > migration_cpu_stop() is not complete due to is_migration_disabled(p) is > true, complete is false and complete_all() never get executed. > It let other task wait the rwsem. > > Detail: > system_server waiting for cgroup_threadgroup_rwsem. > OomAdjuster is holding the cgroup_threadgroup_rwsem and waiting for > cpuset_rwsem. > cpuset_hotplug_workfn is holding the cpuset_rwsem and waiting for > affine_move_task() complete. > affine_move_task() waiting for migration_cpu_stop() complete. > > The backtrace of system_server: > __switch_to > __schedule > schedule > percpu_rwsem_wait > __percpu_down_read > cgroup_css_set_fork => wait for cgroup_threadgroup_rwsem > cgroup_can_fork > copy_process > kernel_clone > > The backtrace of OomAdjuster: > __switch_to > __schedule > schedule > percpu_rwsem_wait > percpu_down_write > cpuset_can_attach => wait for cpuset_rwsem > cgroup_migrate_execute > cgroup_attach_task > __cgroup1_procs_write => hold cgroup_threadgroup_rwsem > cgroup1_procs_write > cgroup_file_write > kernfs_fop_write_iter > vfs_write > ksys_write > > The backtrace of cpuset_hotplug_workfn: > __switch_to > __schedule > schedule > schedule_timeout > wait_for_common > affine_move_task => wait for complete > __set_cpus_allowed_ptr_locked > update_tasks_cpumask > cpuset_hotplug_update_tasks => hold cpuset_rwsem > cpuset_hotplug_workfn > process_one_work > worker_thread > kthread > > > In affine_move_task() will call migration_cpu_stop() and wait for it > complete. > In normal case, if migration_cpu_stop() complete it will inform > everyone that he is done. > But there is an exception case that will not notify. > If is_migration_disabled(p) is true and complete will always is false, > then complete_all() never get executed. > > static int migration_cpu_stop(void *data) > { > ... > bool complete = false; > ... > > if (task_rq(p) == rq) { > if (is_migration_disabled(p)) > goto out; => is_migration_disabled(p) = true, > so complete = false. > ... > } > ... > > out: > ... > if (complete) => complete = false, > so complete_all() never get executed. > complete_all(&pending->done); > > return 0; > } > > > Review the code, we found that there are many places can change > is_migration_disabled() value. > (such as: __rt_spin_lock(), rt_read_lock(), rt_write_lock(), ...) > > Do you have any suggestion for this issue? > Thank you. > > > > > Best regards, > Jing-Ting Wu > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 8053FECAAD5 for ; Mon, 5 Sep 2022 07:01:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:Content-Type: Content-Transfer-Encoding:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:In-Reply-To:From:References:CC:To:Subject: MIME-Version:Date:Message-ID:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=9wC2aOIYLSD1tFJJk6hxltwQ712Qd2wZEMdAAgoEE9g=; b=uPOLxEeI/G79VI BSPkRQheRwZ6ZD8uep2E4KgE+M46l49r1iNccIdu1Xb8vw9N5mVRNRQR1p8YQXCM2+K9LWiFyNlFU 2srGD5L8T7DR9F3QosBa5pfZbs2THLQDEPEXw4ZGlq3/ZnM2HSXdgG3nqplPDcilCeQInGyLzRf48 3cTbRHo3O5h4ynEPZ28vCUgdkhBt5yy4egIO64sci/V24EiM5eWcdKmknd2JKGw8MzUrjwBodVjIY FsFxn0wSil2P2LmqPIFXT+ScY8gyWiI+0MrDITXcgIf0mDVc1OI8tAFGt2GApDyBtjAG3E7Jj5gMy 0fPtfgAD6IVA/qkuKfgw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1oV65L-00CmEz-H2; Mon, 05 Sep 2022 06:59:56 +0000 Received: from mx0a-0031df01.pphosted.com ([205.220.168.131]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1oV5qt-00CW9F-5V; Mon, 05 Sep 2022 06:45:01 +0000 Received: from pps.filterd (m0279867.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2855tSmt032206; Mon, 5 Sep 2022 06:44:40 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h=message-id : date : mime-version : subject : to : cc : references : from : in-reply-to : content-type : content-transfer-encoding; s=qcppdkim1; bh=gW1gPnN7kNxb4y8nk4Ee1k+14vlMGf9KzqKgmTTkmfk=; b=ndjjL+KyQbOBGlfcl4LoT+aXBtejkdmx2C8FujeGUh7i3DChmOtIY2eqeYe52MsagCHU h2A4R3+TPgWjc614GcylobB1OU6D7Sf1GyYaHY5AVL21NKC7LZ3fTqaR3umpdEF+GLPT oHBj2/Ww7uD2nMQsnCCXj3xzTqoFP9kzBbc7gMfwdKIRzzSw8wovJI/GUbXdJC0wxM21 DfmTW4KDKq/QV9xrZGk3XVLPy3CDaSUwuQ5jbei6F7sFdM97vvHeIISqAkVWIWN/+gV+ 5Wl6wj3zGqR+84Ao+OvYUoXnsofAx0ZuD6t3uuO81wyjlFQ+Y+H7Uya7w0D3/zx5ndmC QQ== Received: from nasanppmta02.qualcomm.com (i-global254.qualcomm.com [199.106.103.254]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3jbvmnbm4d-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 05 Sep 2022 06:44:40 +0000 Received: from pps.filterd (NASANPPMTA02.qualcomm.com [127.0.0.1]) by NASANPPMTA02.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTP id 2856e8RS004289; Mon, 5 Sep 2022 06:44:39 GMT Received: from pps.reinject (localhost [127.0.0.1]) by NASANPPMTA02.qualcomm.com (PPS) with ESMTPS id 3jc00kyg7r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 05 Sep 2022 06:44:39 +0000 Received: from NASANPPMTA02.qualcomm.com (NASANPPMTA02.qualcomm.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2856idaj012725; Mon, 5 Sep 2022 06:44:39 GMT Received: from nasanex01c.na.qualcomm.com (corens_vlan604_snip.qualcomm.com [10.53.140.1]) by NASANPPMTA02.qualcomm.com (PPS) with ESMTPS id 2856idT6012689 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 05 Sep 2022 06:44:39 +0000 Received: from [10.216.60.176] (10.80.80.8) by nasanex01c.na.qualcomm.com (10.45.79.139) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.29; Sun, 4 Sep 2022 23:44:30 -0700 Message-ID: Date: Mon, 5 Sep 2022 12:14:19 +0530 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.13.0 Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete Content-Language: en-US To: Jing-Ting Wu , Peter Zijlstra , Valentin Schneider , Tejun Heo CC: , , , , , "chris.redpath@arm.com" , Dietmar Eggemann , Vincent Donnefort , Ingo Molnar , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Christian Brauner , , , References: <88b2910181bda955ac46011b695c53f7da39ac47.camel@mediatek.com> From: Mukesh Ojha In-Reply-To: <88b2910181bda955ac46011b695c53f7da39ac47.camel@mediatek.com> X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01b.na.qualcomm.com (10.46.141.250) To nasanex01c.na.qualcomm.com (10.45.79.139) X-QCInternal: smtphost X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-GUID: gDtQSrWiUSF1RBPYBKPWmgoWj36ZEP6c X-Proofpoint-ORIG-GUID: gDtQSrWiUSF1RBPYBKPWmgoWj36ZEP6c X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-09-05_04,2022-09-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 spamscore=0 priorityscore=1501 bulkscore=0 adultscore=0 malwarescore=0 mlxscore=0 mlxlogscore=628 lowpriorityscore=0 phishscore=0 clxscore=1011 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2209050032 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220904_234459_324635_78495A50 X-CRM114-Status: GOOD ( 22.07 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org This is fixed by this. https://lore.kernel.org/lkml/YvrWaml3F+x9Dk+T@slm.duckdns.org/ -Mukesh On 9/5/2022 8:17 AM, Jing-Ting Wu wrote: > Hi, > > We meet the HANG_DETECT happened in T SW version with kernel-5.15. > Many tasks have been blocked for a long time. > > > Root cause: > migration_cpu_stop() is not complete due to is_migration_disabled(p) is > true, complete is false and complete_all() never get executed. > It let other task wait the rwsem. > > Detail: > system_server waiting for cgroup_threadgroup_rwsem. > OomAdjuster is holding the cgroup_threadgroup_rwsem and waiting for > cpuset_rwsem. > cpuset_hotplug_workfn is holding the cpuset_rwsem and waiting for > affine_move_task() complete. > affine_move_task() waiting for migration_cpu_stop() complete. > > The backtrace of system_server: > __switch_to > __schedule > schedule > percpu_rwsem_wait > __percpu_down_read > cgroup_css_set_fork => wait for cgroup_threadgroup_rwsem > cgroup_can_fork > copy_process > kernel_clone > > The backtrace of OomAdjuster: > __switch_to > __schedule > schedule > percpu_rwsem_wait > percpu_down_write > cpuset_can_attach => wait for cpuset_rwsem > cgroup_migrate_execute > cgroup_attach_task > __cgroup1_procs_write => hold cgroup_threadgroup_rwsem > cgroup1_procs_write > cgroup_file_write > kernfs_fop_write_iter > vfs_write > ksys_write > > The backtrace of cpuset_hotplug_workfn: > __switch_to > __schedule > schedule > schedule_timeout > wait_for_common > affine_move_task => wait for complete > __set_cpus_allowed_ptr_locked > update_tasks_cpumask > cpuset_hotplug_update_tasks => hold cpuset_rwsem > cpuset_hotplug_workfn > process_one_work > worker_thread > kthread > > > In affine_move_task() will call migration_cpu_stop() and wait for it > complete. > In normal case, if migration_cpu_stop() complete it will inform > everyone that he is done. > But there is an exception case that will not notify. > If is_migration_disabled(p) is true and complete will always is false, > then complete_all() never get executed. > > static int migration_cpu_stop(void *data) > { > ... > bool complete = false; > ... > > if (task_rq(p) == rq) { > if (is_migration_disabled(p)) > goto out; => is_migration_disabled(p) = true, > so complete = false. > ... > } > ... > > out: > ... > if (complete) => complete = false, > so complete_all() never get executed. > complete_all(&pending->done); > > return 0; > } > > > Review the code, we found that there are many places can change > is_migration_disabled() value. > (such as: __rt_spin_lock(), rt_read_lock(), rt_write_lock(), ...) > > Do you have any suggestion for this issue? > Thank you. > > > > > Best regards, > Jing-Ting Wu > > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mukesh Ojha Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete Date: Mon, 5 Sep 2022 12:14:19 +0530 Message-ID: References: <88b2910181bda955ac46011b695c53f7da39ac47.camel@mediatek.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h=message-id : date : mime-version : subject : to : cc : references : from : in-reply-to : content-type : content-transfer-encoding; s=qcppdkim1; bh=gW1gPnN7kNxb4y8nk4Ee1k+14vlMGf9KzqKgmTTkmfk=; b=ndjjL+KyQbOBGlfcl4LoT+aXBtejkdmx2C8FujeGUh7i3DChmOtIY2eqeYe52MsagCHU h2A4R3+TPgWjc614GcylobB1OU6D7Sf1GyYaHY5AVL21NKC7LZ3fTqaR3umpdEF+GLPT oHBj2/Ww7uD2nMQsnCCXj3xzTqoFP9kzBbc7gMfwdKIRzzSw8wovJI/GUbXdJC0wxM21 DfmTW4KDKq/QV9xrZGk3XVLPy3CDaSUwuQ5jbei6F7sFdM97vvHeIISqAkVWIWN/+gV+ 5Wl6wj3zGqR+84Ao+OvYUoXnsofAx0ZuD6t3uuO81wyjlFQ+Y+H7Uya7w0D3/zx5ndmC QQ== Content-Language: en-US In-Reply-To: <88b2910181bda955ac46011b695c53f7da39ac47.camel-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org> List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Jing-Ting Wu , Peter Zijlstra , Valentin Schneider , Tejun Heo Cc: wsd_upstream-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, linux-mediatek-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, Jonathan.JMChen-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org, "chris.redpath-5wv7dgnIgG8@public.gmane.org" , Dietmar Eggemann , Vincent Donnefort , Ingo Molnar , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Christian Brauner , cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lixiong.liu-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org, wenju.xu-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org This is fixed by this. https://lore.kernel.org/lkml/YvrWaml3F+x9Dk+T-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org/ -Mukesh On 9/5/2022 8:17 AM, Jing-Ting Wu wrote: > Hi, > > We meet the HANG_DETECT happened in T SW version with kernel-5.15. > Many tasks have been blocked for a long time. > > > Root cause: > migration_cpu_stop() is not complete due to is_migration_disabled(p) is > true, complete is false and complete_all() never get executed. > It let other task wait the rwsem. > > Detail: > system_server waiting for cgroup_threadgroup_rwsem. > OomAdjuster is holding the cgroup_threadgroup_rwsem and waiting for > cpuset_rwsem. > cpuset_hotplug_workfn is holding the cpuset_rwsem and waiting for > affine_move_task() complete. > affine_move_task() waiting for migration_cpu_stop() complete. > > The backtrace of system_server: > __switch_to > __schedule > schedule > percpu_rwsem_wait > __percpu_down_read > cgroup_css_set_fork => wait for cgroup_threadgroup_rwsem > cgroup_can_fork > copy_process > kernel_clone > > The backtrace of OomAdjuster: > __switch_to > __schedule > schedule > percpu_rwsem_wait > percpu_down_write > cpuset_can_attach => wait for cpuset_rwsem > cgroup_migrate_execute > cgroup_attach_task > __cgroup1_procs_write => hold cgroup_threadgroup_rwsem > cgroup1_procs_write > cgroup_file_write > kernfs_fop_write_iter > vfs_write > ksys_write > > The backtrace of cpuset_hotplug_workfn: > __switch_to > __schedule > schedule > schedule_timeout > wait_for_common > affine_move_task => wait for complete > __set_cpus_allowed_ptr_locked > update_tasks_cpumask > cpuset_hotplug_update_tasks => hold cpuset_rwsem > cpuset_hotplug_workfn > process_one_work > worker_thread > kthread > > > In affine_move_task() will call migration_cpu_stop() and wait for it > complete. > In normal case, if migration_cpu_stop() complete it will inform > everyone that he is done. > But there is an exception case that will not notify. > If is_migration_disabled(p) is true and complete will always is false, > then complete_all() never get executed. > > static int migration_cpu_stop(void *data) > { > ... > bool complete = false; > ... > > if (task_rq(p) == rq) { > if (is_migration_disabled(p)) > goto out; => is_migration_disabled(p) = true, > so complete = false. > ... > } > ... > > out: > ... > if (complete) => complete = false, > so complete_all() never get executed. > complete_all(&pending->done); > > return 0; > } > > > Review the code, we found that there are many places can change > is_migration_disabled() value. > (such as: __rt_spin_lock(), rt_read_lock(), rt_write_lock(), ...) > > Do you have any suggestion for this issue? > Thank you. > > > > > Best regards, > Jing-Ting Wu > >