Re: Cgroups "pids" controller does not update "pids.current" count immediately

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Cgroups "pids" controller does not update "pids.current" count immediately
       [not found] <77af3805-e912-2664-f347-e30c0919d0c4@icdsoft.com>
@ 2018-06-14 17:26 ` Aleksa Sarai
  2018-06-14 17:27   ` Aleksa Sarai
       [not found] ` <20180614150650.GU1351649@devbig577.frc2.facebook.com>
  1 sibling, 1 reply; 9+ messages in thread
From: Aleksa Sarai @ 2018-06-14 17:26 UTC (permalink / raw)
  To: Ivan Zahariev; +Cc: cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1628 bytes --]

On 2018-06-14, Ivan Zahariev <famzah@icdsoft.com> wrote:
> I posted a kernel bug about this a month ago but it did not receive any
> attention: https://bugzilla.kernel.org/show_bug.cgi?id=199713

I believe that very few people watch the kernel bugzilla -- it's almost
always better to send a mail to LKML (speaking of which, you should
always include <linux-kernel@vger.kernel.org> in Cc).

> I've tested this on 4.14.27 and 4.4.0-124-generic Ubuntu.
> 
> If I start a couple of processes which exit very quickly (like a simple Bash
> script with many commands in it), the reported value in "pids.current" is
> not updated immediately when processes exit. This leads to too many
> processes incorrectly accounted in "pids.current" which hits the "pids.max"
> prematurely.

One possible reason for this might be related to zombie processes.
cgroup.procs doesn't include any zombie processes (tasks are removed
when they exit(2)), but the pids controller does track zombies (tasks
are removed when the 'struct task' is put'd). This could explain why
there's a discrepancy which clears itself up after a short period of
time -- though I am not sure that your reproducer will actually produce
zombies (I only took a quick look at it).

> The "memory" controller, for example, works as expected and does not suffer
> from this asynchronous lag.

I'm not sure what makes the memory controller and the pids controller
comparable in this aspect -- there is no "pids.current" for the memory
controller.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cgroups "pids" controller does not update "pids.current" count immediately
  2018-06-14 17:26 ` Cgroups "pids" controller does not update "pids.current" count immediately Aleksa Sarai
@ 2018-06-14 17:27   ` Aleksa Sarai
  0 siblings, 0 replies; 9+ messages in thread
From: Aleksa Sarai @ 2018-06-14 17:27 UTC (permalink / raw)
  To: Ivan Zahariev; +Cc: cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]

On 2018-06-15, Aleksa Sarai <asarai@suse.de> wrote:
> > I've tested this on 4.14.27 and 4.4.0-124-generic Ubuntu.
> > 
> > If I start a couple of processes which exit very quickly (like a simple Bash
> > script with many commands in it), the reported value in "pids.current" is
> > not updated immediately when processes exit. This leads to too many
> > processes incorrectly accounted in "pids.current" which hits the "pids.max"
> > prematurely.
> 
> One possible reason for this might be related to zombie processes.
> cgroup.procs doesn't include any zombie processes (tasks are removed
> when they exit(2)), but the pids controller does track zombies (tasks
> are removed when the 'struct task' is put'd). This could explain why
> there's a discrepancy which clears itself up after a short period of
> time -- though I am not sure that your reproducer will actually produce
> zombies (I only took a quick look at it).

Scratch that -- it can happen even without zombies. Basically it just
depends on when the 'task struct' is freed (which could happen
arbitrarily later than the process exit(2)-ed).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cgroups "pids" controller does not update "pids.current" count immediately
       [not found] ` <20180614150650.GU1351649@devbig577.frc2.facebook.com>
@ 2018-06-15 14:26   ` Ivan Zahariev
  2018-06-15 15:41     ` Tejun Heo
  0 siblings, 1 reply; 9+ messages in thread
From: Ivan Zahariev @ 2018-06-15 14:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups, linux-kernel

On 14.6.2018 г. 18:06 ч., Tejun Heo wrote:
> On Thu, Jun 14, 2018 at 02:56:00PM +0300, Ivan Zahariev wrote:
>> I posted a kernel bug about this a month ago but it did not receive
>> any attention: https://bugzilla.kernel.org/show_bug.cgi?id=199713
>>
>> Here is a copy of the bug report and I hope that this is the correct
>> place to discuss this:
> Well, for now at least, that's the expected behavior.  It's not
> supposed to be able to account all changes immediately (the kernel
> doesn't free a lot of things immediately for performance and other
> reasons).  The intended use is setting up a reasonable upperbound with
> some buffer space.

If that's by design, it's a bit disappointing and at least the docs 
should mention it.

The standard RLIMIT_NPROC does not suffer from such accounting 
discrepancies at any time. The "memory" cgroups controller also does not 
suffer from any discrepancies -- it accounts memory usage in real time 
without any lag on process start or exit. The "tasks" file list is also 
always up-to-date.

Is it really technically not possible to make "pids.current" do 
accounting properly like RLIMIT_NPROC does? We were hoping to replace 
RLIMIT_NPROC with the "pids" controller.

--Ivan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cgroups "pids" controller does not update "pids.current" count immediately
  2018-06-15 14:26   ` Ivan Zahariev
@ 2018-06-15 15:41     ` Tejun Heo
  2018-06-15 16:07       ` Ivan Zahariev
  0 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2018-06-15 15:41 UTC (permalink / raw)
  To: Ivan Zahariev; +Cc: cgroups, linux-kernel

Hello,

On Fri, Jun 15, 2018 at 05:26:04PM +0300, Ivan Zahariev wrote:
> The standard RLIMIT_NPROC does not suffer from such accounting
> discrepancies at any time.

RLIMIT_NPROC uses a dedicated atomic counter which is updated when the
process is getting reaped; however, that doesn't actually coincide
with the pid being freed.  The base pid ref is put then but there can
be other refs and even after that it has to go through RCU grace
period to be actually freed.

They seem equivalent but serve a bit different purposes.  RLIMIT_NPROC
is primarily about limiting what the user can do and doesn't guarantee
that that actually matches resource (pid here) consumption.  pid
controller's primary role is limiting pid consumption - ie. no matter
what happens the cgroup must not be able to take away more than the
specified number from the available pool, which has to account for the
lazy release and draining refs and stuff.

> The "memory" cgroups controller also does
> not suffer from any discrepancies -- it accounts memory usage in
> real time without any lag on process start or exit. The "tasks" file
> list is also always up-to-date.

The memory controller does the same thing, actually way more
extensively.  It's just less noticeable because people generally don't
try to control at individual page level.

> Is it really technically not possible to make "pids.current" do
> accounting properly like RLIMIT_NPROC does? We were hoping to
> replace RLIMIT_NPROC with the "pids" controller.

It is of course possible but at a cost.  The cost (getting rid of lazy
release optimizations) is just not justifiable for most cases.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cgroups "pids" controller does not update "pids.current" count immediately
  2018-06-15 15:41     ` Tejun Heo
@ 2018-06-15 16:07       ` Ivan Zahariev
  2018-06-15 16:16         ` Tejun Heo
  0 siblings, 1 reply; 9+ messages in thread
From: Ivan Zahariev @ 2018-06-15 16:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups, linux-kernel

Hi,

Thank you for the quick and insightful reply. I have one suggestion below:

On 15.6.2018 г. 18:41 ч., Tejun Heo wrote:
> On Fri, Jun 15, 2018 at 05:26:04PM +0300, Ivan Zahariev wrote:
>> The standard RLIMIT_NPROC does not suffer from such accounting
>> discrepancies at any time.
> They seem equivalent but serve a bit different purposes.  RLIMIT_NPROC
> is primarily about limiting what the user can do and doesn't guarantee
> that that actually matches resource (pid here) consumption.
>
>> Is it really technically not possible to make "pids.current" do
>> accounting properly like RLIMIT_NPROC does? We were hoping to
>> replace RLIMIT_NPROC with the "pids" controller.
> It is of course possible but at a cost.  The cost (getting rid of lazy
> release optimizations) is just not justifiable for most cases.

I understand all concerns and design decisions. However, having 
RLIMIT_NPROC support combined with "cgroups" hierarchy would be very handy.

Does it make sense that you introduce "nproc.current" and "nproc.max" 
metrics which work in the same atomic, real-time way like RLIMIT_NPROC? 
Or make this in a new "nproc" controller?

--
Ivan
--

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cgroups "pids" controller does not update "pids.current" count immediately
  2018-06-15 16:07       ` Ivan Zahariev
@ 2018-06-15 16:16         ` Tejun Heo
  2018-06-15 17:40           ` Ivan Zahariev
  0 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2018-06-15 16:16 UTC (permalink / raw)
  To: Ivan Zahariev; +Cc: cgroups, linux-kernel

Hello,

On Fri, Jun 15, 2018 at 07:07:27PM +0300, Ivan Zahariev wrote:
> I understand all concerns and design decisions. However, having
> RLIMIT_NPROC support combined with "cgroups" hierarchy would be very
> handy.
> 
> Does it make sense that you introduce "nproc.current" and
> "nproc.max" metrics which work in the same atomic, real-time way
> like RLIMIT_NPROC? Or make this in a new "nproc" controller?

I'm skeptical for two reasons.

1. That doesn't sound much like a resource control problem but more of
   a policy enforcement problem.

2. and it's difficult to see why such policies would need to be that
   strict.  Where is the requirement coming from?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cgroups "pids" controller does not update "pids.current" count immediately
  2018-06-15 16:16         ` Tejun Heo
@ 2018-06-15 17:40           ` Ivan Zahariev
  2018-06-15 19:07             ` Tejun Heo
  0 siblings, 1 reply; 9+ messages in thread
From: Ivan Zahariev @ 2018-06-15 17:40 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups, linux-kernel

Hello,

On 15.6.2018 г. 19:16 ч., Tejun Heo wrote:
> On Fri, Jun 15, 2018 at 07:07:27PM +0300, Ivan Zahariev wrote:
>> I understand all concerns and design decisions. However, having
>> RLIMIT_NPROC support combined with "cgroups" hierarchy would be very
>> handy.
>>
>> Does it make sense that you introduce "nproc.current" and
>> "nproc.max" metrics which work in the same atomic, real-time way
>> like RLIMIT_NPROC? Or make this in a new "nproc" controller?
> I'm skeptical for two reasons.
>
> 1. That doesn't sound much like a resource control problem but more of
>     a policy enforcement problem.
>
> 2. and it's difficult to see why such policies would need to be that
>     strict.  Where is the requirement coming from?
>

The lazy pids accounting + modern fast CPUs makes the "pids.current" 
metric practically unusable for resource limiting in our case. For a 
test, when we started and ended one single process very quickly, we saw 
"pids.current" equal up to 185 (while the correct value at all time is 
either 0 or 1). If we want that a "cgroup" can spawn maximum 50 
processes, we should use some high value like 300 for "pids.max", in 
order to compensate the pids uncharge lag (and this depends on the speed 
of the CPU and how busy the system is).

Our use-case is for a shared web hosting service. Our customers start a 
CGI process for each PHP web request and therefore process start/end 
happens at a very high rate. We don't want customers to be able to 
launch too many CGI processes (NPROC limit) because this exhausts the 
web & database servers, and probably obsesses Linux kernel resources 
(like total "opened files" per user). Furthermore, some users are 
malicious and launch fork-bombs and other resource-exhaustion attacks.

You may be right that we enforce a policy rather than resource control. 
This has worked for us for 15+ years now. The motivation is that a 
global RLIMIT_NPROC easily let's us limit all system and Linux kernel 
resources "per customer" ("cgroups" allows us to limit only certain 
system resources). Additionally, not all user-space daemons allow for a 
granular "per user" limit or proper grouping (for example, MySQL has 
only users, and no "per customer" groups support). Now we want to have 
different "cgroups" hierarchies for a customer (SSH, CGI, Crond), each 
with their own RLIMIT_NPROC, and a total RLIMIT_NPROC for the parent 
"per customer" cgroup.

Excuse me for the lengthy post :-)

--
Ivan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cgroups "pids" controller does not update "pids.current" count immediately
  2018-06-15 17:40           ` Ivan Zahariev
@ 2018-06-15 19:07             ` Tejun Heo
  2018-06-15 19:38               ` Ivan Zahariev
  0 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2018-06-15 19:07 UTC (permalink / raw)
  To: Ivan Zahariev; +Cc: cgroups, linux-kernel

Hello, Ivan.

On Fri, Jun 15, 2018 at 08:40:02PM +0300, Ivan Zahariev wrote:
> The lazy pids accounting + modern fast CPUs makes the "pids.current"
> metric practically unusable for resource limiting in our case. For a
> test, when we started and ended one single process very quickly, we
> saw "pids.current" equal up to 185 (while the correct value at all
> time is either 0 or 1). If we want that a "cgroup" can spawn maximum
> 50 processes, we should use some high value like 300 for "pids.max",
> in order to compensate the pids uncharge lag (and this depends on
> the speed of the CPU and how busy the system is).

Yeah, that actually makes a lot of sense.  We can't keep everything
synchronous for obvious performance reasons but we definitely can wait
for RCU grace period before failing.  Forking might become a bit
slower while pids are draining but shouldn't fail and that shouldn't
incur any performance overhead in normal conditions when pids aren't
constrained.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cgroups "pids" controller does not update "pids.current" count immediately
  2018-06-15 19:07             ` Tejun Heo
@ 2018-06-15 19:38               ` Ivan Zahariev
  0 siblings, 0 replies; 9+ messages in thread
From: Ivan Zahariev @ 2018-06-15 19:38 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups, linux-kernel

Hello,


On 15.6.2018 г. 22:07 ч., Tejun Heo wrote:
> On Fri, Jun 15, 2018 at 08:40:02PM +0300, Ivan Zahariev wrote:
>> The lazy pids accounting + modern fast CPUs makes the "pids.current"
>> metric practically unusable for resource limiting in our case. For a
>> test, when we started and ended one single process very quickly, we
>> saw "pids.current" equal up to 185 (while the correct value at all
>> time is either 0 or 1). If we want that a "cgroup" can spawn maximum
>> 50 processes, we should use some high value like 300 for "pids.max",
>> in order to compensate the pids uncharge lag (and this depends on
>> the speed of the CPU and how busy the system is).
> Yeah, that actually makes a lot of sense.  We can't keep everything
> synchronous for obvious performance reasons but we definitely can wait
> for RCU grace period before failing.  Forking might become a bit
> slower while pids are draining but shouldn't fail and that shouldn't
> incur any performance overhead in normal conditions when pids aren't
> constrained.

I lack expertise to comment on this. As a system administrator, I can 
only remind that nowadays machines with 80+ CPU cores are something 
usual. I don't know how the RCU grace period scales with an increasing 
number of CPUs.

If you develop a patch for this, we can try it in production and give 
you feedback. Just send me an email notification.

Thank you for your time and attention!

--
Ivan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-06-15 19:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <77af3805-e912-2664-f347-e30c0919d0c4@icdsoft.com>
2018-06-14 17:26 ` Cgroups "pids" controller does not update "pids.current" count immediately Aleksa Sarai
2018-06-14 17:27   ` Aleksa Sarai
     [not found] ` <20180614150650.GU1351649@devbig577.frc2.facebook.com>
2018-06-15 14:26   ` Ivan Zahariev
2018-06-15 15:41     ` Tejun Heo
2018-06-15 16:07       ` Ivan Zahariev
2018-06-15 16:16         ` Tejun Heo
2018-06-15 17:40           ` Ivan Zahariev
2018-06-15 19:07             ` Tejun Heo
2018-06-15 19:38               ` Ivan Zahariev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).