linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Unbounded priority inversion while assigning tasks into cgroups.
@ 2021-10-25  9:43 Ronny Meeus
  2021-10-27 16:58 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 5+ messages in thread
From: Ronny Meeus @ 2021-10-25  9:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-rt-users

Hello

an unbounded priority inversion is observed when moving tasks into cgroups.
In my case I'm using the cpu and cpuacct cgroups but the issue is
independent of this.

Kernel version: 4.9.79
CPU: Dual core Cavium Octeon (MIPS)
Kernel configured with CONFIG_PREEMPT=y

I have a small application running at RT priority 92.
Its job is to move high CPU consuming applications into a cgroup when
the system is under high load.
Under extreme load conditions (meaning a lot of script processing
(process clone / exec / exit) and high application load), sometimes
the application hangs for a long time (can be a couple of seconds but
also hangs of 2 minutes are observed already).

Extending the kernel with traces (see below) showed that the
root-cause of the blocking is the global rwsem
"cgroup_threadgroup_rwsem".
While adding a task into the cgroup (__cgroup_procs_write), the write
lock is taken which will have to wait until all writers and readers
have completed their critical section which can take very long.
Especially since there are many of them running at a much lower
priority and we have also applications running at medium priority
running with a very high load.

As an initial attempt I tried applying the RT patch but this does not
resolve the issue.

The second attempt was to replace the cgroup_threadgroup_rwsem by a
rt_mutex (which offers priority inheritance).
After this change the issue seems to be resolved.
A disadvantage of this approach is that all accesses to the critical
section are serialized on all cores (writes to assign tasks to cgroups
and reads to create/exec/exit processes).

For the moment I do not see any other alternative to resolve this problem.
Any advice on the right way forward would be appreciated.

Best regards,
Ronny


Relevant part of the instrumented code of function: __cgroup_procs_write:

trace_cgroup_lock(1000);
percpu_down_write(&cgroup_threadgroup_rwsem);
trace_cgroup_lock(2000);
rcu_read_lock();

A normal trace looks like:
resource_monito-18855 [001] ....  2685.097016: cgroup_lock: idx=2
resource_monito-18855 [001] ....  2685.097017: cgroup_lock: idx=1000
resource_monito-18855 [001] ....  2685.097018: cgroup_lock: idx=2000
resource_monito-18855 [001] ....  2685.097018: cgroup_lock: idx=101

A trace of a blocked application looks like:
resource_monito-18855 [001] ....  2689.736364: cgroup_lock: idx=2
resource_monito-18855 [001] ....  2689.736365: cgroup_lock: idx=1000
resource_monito-18855 [001] ....  2693.780339: cgroup_lock: idx=2000
resource_monito-18855 [001] ....  2693.780339: cgroup_lock: idx=101

In the problematic case above, the resource_monitor application was
blocked for 4s waiting for the write lock on the cgroup.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unbounded priority inversion while assigning tasks into cgroups.
  2021-10-25  9:43 Unbounded priority inversion while assigning tasks into cgroups Ronny Meeus
@ 2021-10-27 16:58 ` Sebastian Andrzej Siewior
       [not found]   ` <CAMJ=MEfkQ9VaphaNS_qbWMOANo7P6h2Ln6iYg4JLWbWzxp85mA@mail.gmail.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-27 16:58 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: linux-kernel, linux-rt-users

On 2021-10-25 11:43:52 [+0200], Ronny Meeus wrote:
> Hello
Hi,

> an unbounded priority inversion is observed when moving tasks into cgroups.
> In my case I'm using the cpu and cpuacct cgroups but the issue is
> independent of this.
> 
> Kernel version: 4.9.79
> CPU: Dual core Cavium Octeon (MIPS)
> Kernel configured with CONFIG_PREEMPT=y
> 
> I have a small application running at RT priority 92.
> Its job is to move high CPU consuming applications into a cgroup when
> the system is under high load.
> Under extreme load conditions (meaning a lot of script processing
> (process clone / exec / exit) and high application load), sometimes
> the application hangs for a long time (can be a couple of seconds but
> also hangs of 2 minutes are observed already).
> 
> Extending the kernel with traces (see below) showed that the
> root-cause of the blocking is the global rwsem
> "cgroup_threadgroup_rwsem".
> While adding a task into the cgroup (__cgroup_procs_write), the write
> lock is taken which will have to wait until all writers and readers
> have completed their critical section which can take very long.
> Especially since there are many of them running at a much lower
> priority and we have also applications running at medium priority
> running with a very high load.
> 
> As an initial attempt I tried applying the RT patch but this does not
> resolve the issue.
> 
> The second attempt was to replace the cgroup_threadgroup_rwsem by a
> rt_mutex (which offers priority inheritance).
> After this change the issue seems to be resolved.
> A disadvantage of this approach is that all accesses to the critical
> section are serialized on all cores (writes to assign tasks to cgroups
> and reads to create/exec/exit processes).
> 
> For the moment I do not see any other alternative to resolve this problem.
> Any advice on the right way forward would be appreciated.

From a looking at percpu_rw_semaphore implementation, no new readers are
allowed as long as there is a writer pending. The writer has
(unfortunately) to wait until all readers are out. But then I doubt that
this takes up to two minutes for all existing readers to leave the
critical section.
Looking at v4.9.84, at least the RT implementation of rw_semaphore
allows new readers if a writer is pending. So this could be culprit as
you would have to wait until all reader are gone and the writer needs to
grab the lock before another reader shows up. But then this shouldn't be
the case for the generic implementation and new reader should wait until
the writer got its chance.

> Best regards,
> Ronny

Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unbounded priority inversion while assigning tasks into cgroups.
       [not found]   ` <CAMJ=MEfkQ9VaphaNS_qbWMOANo7P6h2Ln6iYg4JLWbWzxp85mA@mail.gmail.com>
@ 2021-10-28  8:46     ` Sebastian Andrzej Siewior
  2021-10-29  9:42       ` Ronny Meeus
  0 siblings, 1 reply; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-28  8:46 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: linux-kernel, linux-rt-users

On 2021-10-27 22:54:33 [+0200], Ronny Meeus wrote:
> > From a looking at percpu_rw_semaphore implementation, no new readers are
> > allowed as long as there is a writer pending. The writer has
> > (unfortunately) to wait until all readers are out. But then I doubt that
> > this takes up to two minutes for all existing readers to leave the
> > critical section.
> >
> 
> The readers can be running at low priority while there can be other threads
> with a medium priority will consume the complete cpu. So the low prio
> readers are just waiting to be scheduled and by that also block the high
> prio thread.

Hmm. So you have say, 5 reads stuck in the RW semaphore while preempted
be medium tasks and high-prio writer is then stuck on semaphore, waiting
for the MED tasks to finish so the low-prio threads can leave the
criticial section?

> Looking at v4.9.84, at least the RT implementation of rw_semaphore
> > allows new readers if a writer is pending. So this could be culprit as
> > you would have to wait until all reader are gone and the writer needs to
> > grab the lock before another reader shows up. But then this shouldn't be
> > the case for the generic implementation and new reader should wait until
> > the writer got its chance.
> >
> 
> So what do you suggest for the v4.9 kernel as a solution? Move to the RT
> version of the rw_semaphore and hope for the best?

I don't think it will help. Based on what you wrote above it appears
that the problem is that the readers are preempted and are not leaving
the critical section soon enough.

How many CPUs do you have? Maybe using a rtmutex here and allowing only
one reader at a time isn't that bad in your case. With one CPU for
instance, there isn't much space for multiple readers I guess.

Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unbounded priority inversion while assigning tasks into cgroups.
  2021-10-28  8:46     ` Sebastian Andrzej Siewior
@ 2021-10-29  9:42       ` Ronny Meeus
  2021-10-29 16:48         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 5+ messages in thread
From: Ronny Meeus @ 2021-10-29  9:42 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-kernel, linux-rt-users

Op do 28 okt. 2021 om 10:46 schreef Sebastian Andrzej Siewior
<bigeasy@linutronix.de>:
>
> On 2021-10-27 22:54:33 [+0200], Ronny Meeus wrote:
> > > From a looking at percpu_rw_semaphore implementation, no new readers are
> > > allowed as long as there is a writer pending. The writer has
> > > (unfortunately) to wait until all readers are out. But then I doubt that
> > > this takes up to two minutes for all existing readers to leave the
> > > critical section.
> > >
> >
> > The readers can be running at low priority while there can be other threads
> > with a medium priority will consume the complete cpu. So the low prio
> > readers are just waiting to be scheduled and by that also block the high
> > prio thread.
>
> Hmm. So you have say, 5 reads stuck in the RW semaphore while preempted
> be medium tasks and high-prio writer is then stuck on semaphore, waiting
> for the MED tasks to finish so the low-prio threads can leave the
> criticial section?

Correct. Note that 1 thread stuck in the read is already sufficient to
get into this.
Most of the heavy processing is done at medium priority and the
background tasks are running at the low priority.
Since the background tasks are implemented by scripts, a lot of
accesses to the read part are done at low prio.

> > Looking at v4.9.84, at least the RT implementation of rw_semaphore
> > > allows new readers if a writer is pending. So this could be culprit as
> > > you would have to wait until all reader are gone and the writer needs to
> > > grab the lock before another reader shows up. But then this shouldn't be
> > > the case for the generic implementation and new reader should wait until
> > > the writer got its chance.
> > >
> >
> > So what do you suggest for the v4.9 kernel as a solution? Move to the RT
> > version of the rw_semaphore and hope for the best?
>
> I don't think it will help. Based on what you wrote above it appears
> that the problem is that the readers are preempted and are not leaving
> the critical section soon enough.
>
> How many CPUs do you have? Maybe using a rtmutex here and allowing only
> one reader at a time isn't that bad in your case. With one CPU for
> instance, there isn't much space for multiple readers I guess.
>

The current system has 1 CPU with 2 cores but we have also devices
with 14 cores on which the impact will be bigger of course.
Note that with the rtmutex solution all accesses (read + write) will
be serialized.

I wonder why other people do not see this issue since it is present in
all kernel versions.
And, especially in systems with strict deadlines, I consider this a
serious issue.

Ronny

> Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unbounded priority inversion while assigning tasks into cgroups.
  2021-10-29  9:42       ` Ronny Meeus
@ 2021-10-29 16:48         ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-29 16:48 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: linux-kernel, linux-rt-users, cgroups

On 2021-10-29 11:42:25 [+0200], Ronny Meeus wrote:
> Op do 28 okt. 2021 om 10:46 schreef Sebastian Andrzej Siewior
> <bigeasy@linutronix.de>:
> >
> > On 2021-10-27 22:54:33 [+0200], Ronny Meeus wrote:
> > > > From a looking at percpu_rw_semaphore implementation, no new readers are
> > > > allowed as long as there is a writer pending. The writer has
> > > > (unfortunately) to wait until all readers are out. But then I doubt that
> > > > this takes up to two minutes for all existing readers to leave the
> > > > critical section.
> > > >
> > >
> > > The readers can be running at low priority while there can be other threads
> > > with a medium priority will consume the complete cpu. So the low prio
> > > readers are just waiting to be scheduled and by that also block the high
> > > prio thread.
> >
> > Hmm. So you have say, 5 reads stuck in the RW semaphore while preempted
> > be medium tasks and high-prio writer is then stuck on semaphore, waiting
> > for the MED tasks to finish so the low-prio threads can leave the
> > criticial section?
> 
> Correct. Note that 1 thread stuck in the read is already sufficient to
> get into this.
> Most of the heavy processing is done at medium priority and the
> background tasks are running at the low priority.
> Since the background tasks are implemented by scripts, a lot of
> accesses to the read part are done at low prio.

Yeah, one is enough. My guess would be that it is more visible on the
small ones because on the bigger ones it is more likely that the thread
gets migrated to another core.

> > > Looking at v4.9.84, at least the RT implementation of rw_semaphore
> > > > allows new readers if a writer is pending. So this could be culprit as
> > > > you would have to wait until all reader are gone and the writer needs to
> > > > grab the lock before another reader shows up. But then this shouldn't be
> > > > the case for the generic implementation and new reader should wait until
> > > > the writer got its chance.
> > > >
> > >
> > > So what do you suggest for the v4.9 kernel as a solution? Move to the RT
> > > version of the rw_semaphore and hope for the best?
> >
> > I don't think it will help. Based on what you wrote above it appears
> > that the problem is that the readers are preempted and are not leaving
> > the critical section soon enough.
> >
> > How many CPUs do you have? Maybe using a rtmutex here and allowing only
> > one reader at a time isn't that bad in your case. With one CPU for
> > instance, there isn't much space for multiple readers I guess.
> >
> 
> The current system has 1 CPU with 2 cores but we have also devices
> with 14 cores on which the impact will be bigger of course.
> Note that with the rtmutex solution all accesses (read + write) will
> be serialized.

So for the 1/2 core it should make no difference if you use an RTmutex
instead. For the bigger ones it might not be optimal.

> I wonder why other people do not see this issue since it is present in
> all kernel versions.
> And, especially in systems with strict deadlines, I consider this a
> serious issue.

My guess here is that most people don't use RT priorities and don't see
this problem _or_ they have enough cores. Or they simply don't use that
way.

From PREEMPT_RT perspective, rwsem/rwlock used to be single-reader until
late 3.x or early 4.x series (I don't remember exactly when it changed).
Boosting multiple readers was tried once but didn't really work. 
PREEMPT_RT has also this problem where multiple low-prio reader can
block the high-prio writer but fortunately most critical rwsem users
moved to RCU so it is not much of problem.

> Ronny
 
Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-10-29 16:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-25  9:43 Unbounded priority inversion while assigning tasks into cgroups Ronny Meeus
2021-10-27 16:58 ` Sebastian Andrzej Siewior
     [not found]   ` <CAMJ=MEfkQ9VaphaNS_qbWMOANo7P6h2Ln6iYg4JLWbWzxp85mA@mail.gmail.com>
2021-10-28  8:46     ` Sebastian Andrzej Siewior
2021-10-29  9:42       ` Ronny Meeus
2021-10-29 16:48         ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).