From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E4279C433FE
	for <linux-kernel@archiver.kernel.org>; Wed, 23 Nov 2022 17:05:14 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S237983AbiKWRFN (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 23 Nov 2022 12:05:13 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55124 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236452AbiKWRFJ (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 23 Nov 2022 12:05:09 -0500
Received: from mail-pg1-x535.google.com (mail-pg1-x535.google.com [IPv6:2607:f8b0:4864:20::535])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 44DFC6868D;
        Wed, 23 Nov 2022 09:05:08 -0800 (PST)
Received: by mail-pg1-x535.google.com with SMTP id f3so17291221pgc.2;
        Wed, 23 Nov 2022 09:05:08 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id
         :reply-to;
        bh=EztbpzZ3Kt/PihSs3M5nNP63NaXEehzBdLCvtUKvG5E=;
        b=RKa7MkTVo23uF7O9o9uVprUaLAkBBOnLupy8+9eq20WsELWHC3gn7EWXE5QZZKHWgE
         W5gpisnVSt2PfdQ7o92wW2ZgQOV2rO1+3Vp37Ti4G1tz2xNLFT7FdKUezFaDTEydRocP
         6J6urHkFaCS43MXSYtQP6FWwUqNcuGr90JoGwco/EBCVJCXNfQi0jRRzoEeIzgbNgwQ+
         fiL7/oB351BuHm4WO7aOCpXQ88TnjwPch218oKonZo8m81RUIAsKhRr4oThK/j4x/Pfd
         v+z6sXCqjjMEyf55xjUOAgWso7nGWdw00Lk/NqavHQsNO6YNwSU5UbZKd6YcuEoisdD1
         vX7w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=EztbpzZ3Kt/PihSs3M5nNP63NaXEehzBdLCvtUKvG5E=;
        b=iPTvwqhL98lPsVbMsa9xUKuC5JbhaQKY+JfgwoX89VE4mmSmzrkTtDJBzt1AHmGEl2
         59EbU9Zq/UPJjghGeihzqLd6SygfTGE6h8eI1xqIeFCJqLnuaj+C/wn6KonyJTMGj7/v
         A5jJzEtU/NoxuJh0AjIx+7+eJ7MpMKqH0szkcmp1wffL4E1vvC9oVwn30Hj0lckEcI2k
         uCo8RRiI789DQRY6qWeCFqfgWLmyfQi+7Guae2detdsw1C99O2WF2YHd8kFROOBWxUkb
         kYNHgV5xOBVPEloqaS5flJ22VhF3rC+OAtS5P80mPz1Xnz56v7i5CBT+JhUKhUZEcere
         1zXw==
X-Gm-Message-State: ANoB5pnjVH2WeOdfQdiRRBmy0OmC11J/iuOt6qTDp/RMupfN/YnWuxbm
        qO2qakS3L9Nej5f3AbSqw0I=
X-Google-Smtp-Source: AA0mqf5w0WRAp1P0Br3o1b8NnntWzk1RV7xP89lexxIhdmhmFM0mnvmmoPBFNvQkNB/JYx4CfnIbLg==
X-Received: by 2002:a63:ec10:0:b0:477:b359:f03c with SMTP id j16-20020a63ec10000000b00477b359f03cmr5533123pgh.32.1669223107325;
        Wed, 23 Nov 2022 09:05:07 -0800 (PST)
Received: from localhost (2603-800c-1a02-1bae-a7fa-157f-969a-4cde.res6.spectrum.com. [2603:800c:1a02:1bae:a7fa:157f:969a:4cde])
        by smtp.gmail.com with ESMTPSA id 124-20020a620482000000b0057294f480casm13327967pfe.97.2022.11.23.09.05.06
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 23 Nov 2022 09:05:06 -0800 (PST)
Sender: Tejun Heo <htejun@gmail.com>
Date:   Wed, 23 Nov 2022 07:05:05 -1000
From:   Tejun Heo <tj@kernel.org>
To:     "haifeng.xu" <haifeng.xu@shopee.com>
Cc:     longman@redhat.com, lizefan.x@bytedance.com, hannes@cmpxchg.org,
        cgroups@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] cgroup/cpuset: Optimize update_tasks_nodemask()
Message-ID: <Y35Swdpq+rJe+Tu3@slm.duckdns.org>
References: <20221123082157.71326-1-haifeng.xu@shopee.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20221123082157.71326-1-haifeng.xu@shopee.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Nov 23, 2022 at 08:21:57AM +0000, haifeng.xu wrote:
> When change the 'cpuset.mems' under some cgroup, system will hung
> for a long time. From the dmesg, many processes or theads are
> stuck in fork/exit. The reason is show as follows.
> 
> thread A:
> cpuset_write_resmask /* takes cpuset_rwsem */
>   ...
>     update_tasks_nodemask
>       mpol_rebind_mm /* waits mmap_lock */
> 
> thread B:
> worker_thread
>   ...
>     cpuset_migrate_mm_workfn
>       do_migrate_pages /* takes mmap_lock */
> 
> thread C:
> cgroup_procs_write /* takes cgroup_mutex and cgroup_threadgroup_rwsem */
>   ...
>     cpuset_can_attach
>       percpu_down_write /* waits cpuset_rwsem */
> 
> Once update the nodemasks of cpuset, thread A wakes up thread B to
> migrate mm. But when thread A iterates through all tasks, including
> child threads and group leader, it has to wait the mmap_lock which
> has been take by thread B. Unfortunately, thread C wants to migrate
> tasks into cgroup at this moment, it must wait thread A to release
> cpuset_rwsem. If thread B spends much time to migrate mm, the
> fork/exit which acquire cgroup_threadgroup_rwsem also need to
> wait for a long time.
> 
> There is no need to migrate the mm of child threads which is
> shared with group leader. 

This is only a problem in cgroup1 and cgroup1 doesn't require the threads of
a given task to be in the same cgroup. I don't think you can optimize it
this way.

Thanks.

-- 
tejun

From mboxrd@z Thu Jan  1 00:00:00 1970
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Subject: Re: [PATCH] cgroup/cpuset: Optimize update_tasks_nodemask()
Date: Wed, 23 Nov 2022 07:05:05 -1000
Message-ID: <Y35Swdpq+rJe+Tu3@slm.duckdns.org>
References: <20221123082157.71326-1-haifeng.xu@shopee.com>
Mime-Version: 1.0
Return-path: <cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id
         :reply-to;
        bh=EztbpzZ3Kt/PihSs3M5nNP63NaXEehzBdLCvtUKvG5E=;
        b=RKa7MkTVo23uF7O9o9uVprUaLAkBBOnLupy8+9eq20WsELWHC3gn7EWXE5QZZKHWgE
         W5gpisnVSt2PfdQ7o92wW2ZgQOV2rO1+3Vp37Ti4G1tz2xNLFT7FdKUezFaDTEydRocP
         6J6urHkFaCS43MXSYtQP6FWwUqNcuGr90JoGwco/EBCVJCXNfQi0jRRzoEeIzgbNgwQ+
         fiL7/oB351BuHm4WO7aOCpXQ88TnjwPch218oKonZo8m81RUIAsKhRr4oThK/j4x/Pfd
         v+z6sXCqjjMEyf55xjUOAgWso7nGWdw00Lk/NqavHQsNO6YNwSU5UbZKd6YcuEoisdD1
         vX7w==
Sender: Tejun Heo <htejun-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Content-Disposition: inline
In-Reply-To: <20221123082157.71326-1-haifeng.xu-LL2PKPoSiP3QT0dZR+AlfA@public.gmane.org>
List-ID: <cgroups.vger.kernel.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: "haifeng.xu" <haifeng.xu-LL2PKPoSiP3QT0dZR+AlfA@public.gmane.org>
Cc: longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org, hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Wed, Nov 23, 2022 at 08:21:57AM +0000, haifeng.xu wrote:
> When change the 'cpuset.mems' under some cgroup, system will hung
> for a long time. From the dmesg, many processes or theads are
> stuck in fork/exit. The reason is show as follows.
> 
> thread A:
> cpuset_write_resmask /* takes cpuset_rwsem */
>   ...
>     update_tasks_nodemask
>       mpol_rebind_mm /* waits mmap_lock */
> 
> thread B:
> worker_thread
>   ...
>     cpuset_migrate_mm_workfn
>       do_migrate_pages /* takes mmap_lock */
> 
> thread C:
> cgroup_procs_write /* takes cgroup_mutex and cgroup_threadgroup_rwsem */
>   ...
>     cpuset_can_attach
>       percpu_down_write /* waits cpuset_rwsem */
> 
> Once update the nodemasks of cpuset, thread A wakes up thread B to
> migrate mm. But when thread A iterates through all tasks, including
> child threads and group leader, it has to wait the mmap_lock which
> has been take by thread B. Unfortunately, thread C wants to migrate
> tasks into cgroup at this moment, it must wait thread A to release
> cpuset_rwsem. If thread B spends much time to migrate mm, the
> fork/exit which acquire cgroup_threadgroup_rwsem also need to
> wait for a long time.
> 
> There is no need to migrate the mm of child threads which is
> shared with group leader. 

This is only a problem in cgroup1 and cgroup1 doesn't require the threads of
a given task to be in the same cgroup. I don't think you can optimize it
this way.

Thanks.

-- 
tejun