From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F3954C4332F
	for <linux-kernel@archiver.kernel.org>; Wed, 23 Nov 2022 08:23:52 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235990AbiKWIXw (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 23 Nov 2022 03:23:52 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59294 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236528AbiKWIXp (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 23 Nov 2022 03:23:45 -0500
Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com [IPv6:2607:f8b0:4864:20::631])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D8C76FBA9F
        for <linux-kernel@vger.kernel.org>; Wed, 23 Nov 2022 00:23:42 -0800 (PST)
Received: by mail-pl1-x631.google.com with SMTP id b21so15980545plc.9
        for <linux-kernel@vger.kernel.org>; Wed, 23 Nov 2022 00:23:42 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=shopee.com; s=shopee.com;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=yo2K84hmvJF9oKmVUE8rVWn5muORVTM6UmA/7exBFUM=;
        b=VmEnvFprI1mHKVy+07lphANsK+RLSjSH8ZZGVFzeZ0SQCYsAvYfb8tQwV+GaQs4jMd
         Fhfh4ZC4WvVGyUc3TlX3vl0ZqD5kHFvh/BahqrDF0S8wapyOl1nAVvuhgDLPI6yPojVw
         LuFg+XX0OxUSgPe8Zq2RVjTjeKnvONyFn0cZRTsjOuIaEo7QUKwVu9hXgIFK9LgJ14Ym
         u3w+77ooTwpLJsCE55lO0l2qS8XmJ7T/0uHynGwdGMPgSlRm37X1Cfy9zcnQgo28/bZd
         G3Tcu28J0Id40ep5xIQ7m6QU0kNdS5nxA2SEjOkuuftUUfY2jdqRM7I4BdF79Ew+iIXI
         HGnA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=yo2K84hmvJF9oKmVUE8rVWn5muORVTM6UmA/7exBFUM=;
        b=sRJaOW7Unc06FRsDLJv3mrHwwdk3AQNfy0PQ/dhJPABmmdCtjhZH7ci2EczFcf/JnA
         4auWpxZnYdMafgQ//P8036go3rm+raNCAHERpGf8m6v6v4tcuSoabGhpNOhSWeoZksPf
         DzOqfCOzFuVJ4iOXpn8nixAwQtdjSD720XGyN7F7UfcH4cDLW8NLwrBRv5dFNCPkerNL
         OUUzkv7weklnhnKPVENrl51yaolTm3CJyXWk5g41UtzkNAUaVsEy5wGFSEjhoRQDQ6dO
         1KBLoNUNpVD4JaVdc1eRMme1JD6r20WhzwS254FhpZDRKrApBufwUot5KtFeJZSkIy53
         /z/g==
X-Gm-Message-State: ANoB5pkhJPTJ3Oe+dEFZ+Yyn4mUiTUmgoaoy6Ri22EcOdD4UjgRAxQgG
        6UfmDHbQK8Cpix7GiBsdbKX2Fg==
X-Google-Smtp-Source: AA0mqf4FevgVKHdekbRNTFvbZELocM4lIaNfW3Nt4RsG+HYtzkflHLBVWeGWOkbsn/f2CXfPF6KLSA==
X-Received: by 2002:a17:90b:2688:b0:218:b9e1:ebef with SMTP id pl8-20020a17090b268800b00218b9e1ebefmr12795766pjb.65.1669191822401;
        Wed, 23 Nov 2022 00:23:42 -0800 (PST)
Received: from ubuntu-haifeng.default.svc.cluster.local ([101.127.248.173])
        by smtp.gmail.com with ESMTPSA id h17-20020aa79f51000000b0056c2e497b02sm12454017pfr.173.2022.11.23.00.23.40
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 23 Nov 2022 00:23:42 -0800 (PST)
From:   "haifeng.xu" <haifeng.xu@shopee.com>
To:     longman@redhat.com
Cc:     lizefan.x@bytedance.com, tj@kernel.org, hannes@cmpxchg.org,
        cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
        "haifeng.xu" <haifeng.xu@shopee.com>
Subject: [PATCH] cgroup/cpuset: Optimize update_tasks_nodemask()
Date:   Wed, 23 Nov 2022 08:21:57 +0000
Message-Id: <20221123082157.71326-1-haifeng.xu@shopee.com>
X-Mailer: git-send-email 2.25.1
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

When change the 'cpuset.mems' under some cgroup, system will hung
for a long time. From the dmesg, many processes or theads are
stuck in fork/exit. The reason is show as follows.

thread A:
cpuset_write_resmask /* takes cpuset_rwsem */
  ...
    update_tasks_nodemask
      mpol_rebind_mm /* waits mmap_lock */

thread B:
worker_thread
  ...
    cpuset_migrate_mm_workfn
      do_migrate_pages /* takes mmap_lock */

thread C:
cgroup_procs_write /* takes cgroup_mutex and cgroup_threadgroup_rwsem */
  ...
    cpuset_can_attach
      percpu_down_write /* waits cpuset_rwsem */

Once update the nodemasks of cpuset, thread A wakes up thread B to
migrate mm. But when thread A iterates through all tasks, including
child threads and group leader, it has to wait the mmap_lock which
has been take by thread B. Unfortunately, thread C wants to migrate
tasks into cgroup at this moment, it must wait thread A to release
cpuset_rwsem. If thread B spends much time to migrate mm, the
fork/exit which acquire cgroup_threadgroup_rwsem also need to
wait for a long time.

There is no need to migrate the mm of child threads which is
shared with group leader. Just iterate through the group
leader only.

Signed-off-by: haifeng.xu <haifeng.xu@shopee.com>
---
 kernel/cgroup/cpuset.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 589827ccda8b..43cbd09546d0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset *cs)
 
 		cpuset_change_task_nodemask(task, &newmems);
 
+		if (!thread_group_leader(task))
+			continue;
+
 		mm = get_task_mm(task);
 		if (!mm)
 			continue;
-- 
2.25.1


From mboxrd@z Thu Jan  1 00:00:00 1970
From: "haifeng.xu" <haifeng.xu-LL2PKPoSiP3QT0dZR+AlfA@public.gmane.org>
Subject: [PATCH] cgroup/cpuset: Optimize update_tasks_nodemask()
Date: Wed, 23 Nov 2022 08:21:57 +0000
Message-ID: <20221123082157.71326-1-haifeng.xu@shopee.com>
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
Return-path: <cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=shopee.com; s=shopee.com;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=yo2K84hmvJF9oKmVUE8rVWn5muORVTM6UmA/7exBFUM=;
        b=VmEnvFprI1mHKVy+07lphANsK+RLSjSH8ZZGVFzeZ0SQCYsAvYfb8tQwV+GaQs4jMd
         Fhfh4ZC4WvVGyUc3TlX3vl0ZqD5kHFvh/BahqrDF0S8wapyOl1nAVvuhgDLPI6yPojVw
         LuFg+XX0OxUSgPe8Zq2RVjTjeKnvONyFn0cZRTsjOuIaEo7QUKwVu9hXgIFK9LgJ14Ym
         u3w+77ooTwpLJsCE55lO0l2qS8XmJ7T/0uHynGwdGMPgSlRm37X1Cfy9zcnQgo28/bZd
         G3Tcu28J0Id40ep5xIQ7m6QU0kNdS5nxA2SEjOkuuftUUfY2jdqRM7I4BdF79Ew+iIXI
         HGnA==
List-ID: <cgroups.vger.kernel.org>
Content-Type: text/plain; charset="us-ascii"
To: longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
Cc: lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, "haifeng.xu" <haifeng.xu-LL2PKPoSiP3QT0dZR+AlfA@public.gmane.org>

When change the 'cpuset.mems' under some cgroup, system will hung
for a long time. From the dmesg, many processes or theads are
stuck in fork/exit. The reason is show as follows.

thread A:
cpuset_write_resmask /* takes cpuset_rwsem */
  ...
    update_tasks_nodemask
      mpol_rebind_mm /* waits mmap_lock */

thread B:
worker_thread
  ...
    cpuset_migrate_mm_workfn
      do_migrate_pages /* takes mmap_lock */

thread C:
cgroup_procs_write /* takes cgroup_mutex and cgroup_threadgroup_rwsem */
  ...
    cpuset_can_attach
      percpu_down_write /* waits cpuset_rwsem */

Once update the nodemasks of cpuset, thread A wakes up thread B to
migrate mm. But when thread A iterates through all tasks, including
child threads and group leader, it has to wait the mmap_lock which
has been take by thread B. Unfortunately, thread C wants to migrate
tasks into cgroup at this moment, it must wait thread A to release
cpuset_rwsem. If thread B spends much time to migrate mm, the
fork/exit which acquire cgroup_threadgroup_rwsem also need to
wait for a long time.

There is no need to migrate the mm of child threads which is
shared with group leader. Just iterate through the group
leader only.

Signed-off-by: haifeng.xu <haifeng.xu-LL2PKPoSiP3QT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup/cpuset.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 589827ccda8b..43cbd09546d0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset *cs)
 
 		cpuset_change_task_nodemask(task, &newmems);
 
+		if (!thread_group_leader(task))
+			continue;
+
 		mm = get_task_mm(task);
 		if (!mm)
 			continue;
-- 
2.25.1