All of lore.kernel.org
 help / color / mirror / Atom feed
From: Valentin Schneider <vschneid@redhat.com>
To: Thomas Gleixner <tglx@linutronix.de>,
	Frederic Weisbecker <frederic@kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Anna-Maria Behnsen <anna-maria@linutronix.de>,
	Alex Shi <alexs@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Barry Song <song.bao.hua@hisilicon.com>,
	Tejun Heo <tj@kernel.org>
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)
Date: Thu, 28 Mar 2024 21:39:56 +0100	[thread overview]
Message-ID: <xhsmhle62ay5f.mognet@vschneid-thinkpadt14sgen2i.remote.csb> (raw)
In-Reply-To: <87cyrfe7a3.ffs@tglx>

On 27/03/24 21:42, Thomas Gleixner wrote:
> On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
>> On 22/03/24 14:22, Frederic Weisbecker wrote:
>>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
>> Now, on top of the above, there's one more thing worth noting:
>>   cpu_up_down_serialize_trainwrecks()
>>
>> This just flushes the cpuset work, so after that the sched_domain topology
>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>> around this vs. lock ordering aren't very reassuring however, so I need to
>> look into this more.
>
> commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
> the change log.
>
> TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
> cpu_hotplug_lock in the hotplug path for whatever silly reason. So
> trying to flush the work from within the cpu hotplug lock write held
> region is guaranteed to dead lock.
>
> That's why all of this got deferred to a work queue. The flush at the
> end of the hotplug code after dropping the hotplug lock is there to
> prevent that user space sees the CPU hotplug uevent before the work is
> done. But of course between bringing the CPU offline and the flush the
> kernel internal state is inconsistent.
>

Thanks for the summary!

> There was an attempt to make the CPU hotplug path synchronous in commit
> a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
> reverted with commit 2b729fe7f3 for the very wrong reason:
>
> https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw/T/#u
>
>  (cpu_hotplug_lock){++++}-{0:0}:
>   lock_acquire+0xe4/0x25c
>   cpus_read_lock+0x50/0x154
>   static_key_slow_inc+0x18/0x30
>   mem_cgroup_css_alloc+0x824/0x8b0
>   cgroup_apply_control_enable+0x1d8/0x56c
>   cgroup_apply_control+0x40/0x344
>   cgroup_subtree_control_write+0x664/0x69c
>   cgroup_file_write+0x130/0x2e8
>   kernfs_fop_write+0x228/0x32c
>   __vfs_write+0x84/0x1d8
>   vfs_write+0x13c/0x1b4
>   ksys_write+0xb0/0x120
>
> Instead of the revert this should have been fixed by doing:
>
> +  cpus_read_lock();
>    mutex_lock();
>    mem_cgroup_css_alloc();
> -  static_key_slow_inc();
> +  static_key_slow_inc_cpuslocked();
>

So looking at the state of things now, writing to the
cgroup.subtree_control file looks like: 

  cgroup_file_write()
  `\
    cgroup_subtree_control_write()
    `\
      cgroup_kn_lock_live()
      `\
      | cgroup_lock()
      | `\
      |   mutex_lock(&cgroup_mutex);
      |
      cgroup_apply_control()
      `\
        cgroup_apply_control_enable()
        `\
          css_create()
          `\
            ss->css_alloc() [mem_cgroup_css_alloc()]
            `\
              static_branch_inc()        

and same with cgroup_mkdir(). So if we want to fix the ordering that caused
the revert, we'd probably want to go for:

  static inline void cgroup_lock(void)
  {
+       cpus_read_lock();
	mutex_lock(&cgroup_mutex);
  }

  static inline void cgroup_unlock(void)
  {
	mutex_unlock(&cgroup_mutex);
-       cpus_read_unlock();        
  }

+ a handful of +_cpuslocked() changes.

As for cpuset, it looks like it's following this lock order:

	cpus_read_lock();
	mutex_lock(&cpuset_mutex);

Which AFAICT is what we want.

> Sorry that I did not notice this back then because I was too focussed on
> fixing that uevent nonsense in a halfways sane way.
>
> After that revert cpuset locking became completely insane.
>
> cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
> advanced part of that insanity. Seriously this commit is just tasteless
> and disgusting demonstration of how to paper over a problem which is not
> fully understood.
>
> After staring at it some more (including the history which led up to
> these insanities) I really think that the CPU hotplug path can be made
> truly synchronous and the memory hotplug path should just get the lock
> ordering correct.
>
> Can we please fix this for real and get all of this insanity out of the
> way

Yes please!

> including the nonsensical comments in the cpuset code:
>
>  * Call with cpuset_mutex held.  Takes cpus_read_lock().
>
>  ...
>
>        lockdep_assert_cpus_held();
>        lockdep_assert_held(&cpuset_mutex);
>
> Oh well...
>
> Thanks,
>
>         tglx


  reply	other threads:[~2024-03-28 20:40 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-18 23:07 [PATCH 0/2] timers: More fixes Frederic Weisbecker
2024-03-18 23:07 ` [PATCH 1/2] timers/migration: Fix endless timer requeue after idle interrupts Frederic Weisbecker
2024-03-21 11:24   ` [tip: timers/urgent] " tip-bot2 for Frederic Weisbecker
2024-03-18 23:07 ` [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full Frederic Weisbecker
2024-03-19  9:18   ` Paul E. McKenney
2024-03-20 11:14     ` Paul E. McKenney
2024-03-20 16:15       ` Frederic Weisbecker
2024-03-20 22:55         ` Paul E. McKenney
2024-03-21 11:42           ` Frederic Weisbecker
2024-03-21 12:47             ` Paul E. McKenney
2024-03-22 11:32               ` Frederic Weisbecker
2024-03-22 13:22                 ` for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full) Frederic Weisbecker
2024-03-26 16:46                   ` Valentin Schneider
2024-03-27 12:42                     ` Frederic Weisbecker
2024-03-27 14:28                       ` Valentin Schneider
2024-03-28 14:08                         ` Valentin Schneider
2024-03-28 16:58                           ` Frederic Weisbecker
2024-03-28 20:31                             ` Valentin Schneider
2024-03-27 20:42                     ` Thomas Gleixner
2024-03-28 20:39                       ` Valentin Schneider [this message]
2024-03-29  2:08                         ` Tejun Heo
2024-03-29 17:06                           ` Waiman Long
2024-04-01 21:26               ` [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full Paul E. McKenney
2024-04-01 21:56                 ` Frederic Weisbecker
2024-04-02  0:04                   ` Paul E. McKenney
2024-04-02 16:47                     ` Paul E. McKenney
2024-04-03 18:05                       ` Paul E. McKenney
2024-03-21 11:24   ` [tip: timers/urgent] " tip-bot2 for Frederic Weisbecker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xhsmhle62ay5f.mognet@vschneid-thinkpadt14sgen2i.remote.csb \
    --to=vschneid@redhat.com \
    --cc=alexs@kernel.org \
    --cc=anna-maria@linutronix.de \
    --cc=frederic@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=song.bao.hua@hisilicon.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.