* Re: Cgroups "pids" controller does not update "pids.current" count immediately [not found] <77af3805-e912-2664-f347-e30c0919d0c4@icdsoft.com> @ 2018-06-14 17:26 ` Aleksa Sarai 2018-06-14 17:27 ` Aleksa Sarai [not found] ` <20180614150650.GU1351649@devbig577.frc2.facebook.com> 1 sibling, 1 reply; 9+ messages in thread From: Aleksa Sarai @ 2018-06-14 17:26 UTC (permalink / raw) To: Ivan Zahariev; +Cc: cgroups, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1628 bytes --] On 2018-06-14, Ivan Zahariev <famzah@icdsoft.com> wrote: > I posted a kernel bug about this a month ago but it did not receive any > attention: https://bugzilla.kernel.org/show_bug.cgi?id=199713 I believe that very few people watch the kernel bugzilla -- it's almost always better to send a mail to LKML (speaking of which, you should always include <linux-kernel@vger.kernel.org> in Cc). > I've tested this on 4.14.27 and 4.4.0-124-generic Ubuntu. > > If I start a couple of processes which exit very quickly (like a simple Bash > script with many commands in it), the reported value in "pids.current" is > not updated immediately when processes exit. This leads to too many > processes incorrectly accounted in "pids.current" which hits the "pids.max" > prematurely. One possible reason for this might be related to zombie processes. cgroup.procs doesn't include any zombie processes (tasks are removed when they exit(2)), but the pids controller does track zombies (tasks are removed when the 'struct task' is put'd). This could explain why there's a discrepancy which clears itself up after a short period of time -- though I am not sure that your reproducer will actually produce zombies (I only took a quick look at it). > The "memory" controller, for example, works as expected and does not suffer > from this asynchronous lag. I'm not sure what makes the memory controller and the pids controller comparable in this aspect -- there is no "pids.current" for the memory controller. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Cgroups "pids" controller does not update "pids.current" count immediately 2018-06-14 17:26 ` Cgroups "pids" controller does not update "pids.current" count immediately Aleksa Sarai @ 2018-06-14 17:27 ` Aleksa Sarai 0 siblings, 0 replies; 9+ messages in thread From: Aleksa Sarai @ 2018-06-14 17:27 UTC (permalink / raw) To: Ivan Zahariev; +Cc: cgroups, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1228 bytes --] On 2018-06-15, Aleksa Sarai <asarai@suse.de> wrote: > > I've tested this on 4.14.27 and 4.4.0-124-generic Ubuntu. > > > > If I start a couple of processes which exit very quickly (like a simple Bash > > script with many commands in it), the reported value in "pids.current" is > > not updated immediately when processes exit. This leads to too many > > processes incorrectly accounted in "pids.current" which hits the "pids.max" > > prematurely. > > One possible reason for this might be related to zombie processes. > cgroup.procs doesn't include any zombie processes (tasks are removed > when they exit(2)), but the pids controller does track zombies (tasks > are removed when the 'struct task' is put'd). This could explain why > there's a discrepancy which clears itself up after a short period of > time -- though I am not sure that your reproducer will actually produce > zombies (I only took a quick look at it). Scratch that -- it can happen even without zombies. Basically it just depends on when the 'task struct' is freed (which could happen arbitrarily later than the process exit(2)-ed). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20180614150650.GU1351649@devbig577.frc2.facebook.com>]
* Re: Cgroups "pids" controller does not update "pids.current" count immediately [not found] ` <20180614150650.GU1351649@devbig577.frc2.facebook.com> @ 2018-06-15 14:26 ` Ivan Zahariev 2018-06-15 15:41 ` Tejun Heo 0 siblings, 1 reply; 9+ messages in thread From: Ivan Zahariev @ 2018-06-15 14:26 UTC (permalink / raw) To: Tejun Heo; +Cc: cgroups, linux-kernel On 14.6.2018 г. 18:06 ч., Tejun Heo wrote: > On Thu, Jun 14, 2018 at 02:56:00PM +0300, Ivan Zahariev wrote: >> I posted a kernel bug about this a month ago but it did not receive >> any attention: https://bugzilla.kernel.org/show_bug.cgi?id=199713 >> >> Here is a copy of the bug report and I hope that this is the correct >> place to discuss this: > Well, for now at least, that's the expected behavior. It's not > supposed to be able to account all changes immediately (the kernel > doesn't free a lot of things immediately for performance and other > reasons). The intended use is setting up a reasonable upperbound with > some buffer space. If that's by design, it's a bit disappointing and at least the docs should mention it. The standard RLIMIT_NPROC does not suffer from such accounting discrepancies at any time. The "memory" cgroups controller also does not suffer from any discrepancies -- it accounts memory usage in real time without any lag on process start or exit. The "tasks" file list is also always up-to-date. Is it really technically not possible to make "pids.current" do accounting properly like RLIMIT_NPROC does? We were hoping to replace RLIMIT_NPROC with the "pids" controller. --Ivan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Cgroups "pids" controller does not update "pids.current" count immediately 2018-06-15 14:26 ` Ivan Zahariev @ 2018-06-15 15:41 ` Tejun Heo 2018-06-15 16:07 ` Ivan Zahariev 0 siblings, 1 reply; 9+ messages in thread From: Tejun Heo @ 2018-06-15 15:41 UTC (permalink / raw) To: Ivan Zahariev; +Cc: cgroups, linux-kernel Hello, On Fri, Jun 15, 2018 at 05:26:04PM +0300, Ivan Zahariev wrote: > The standard RLIMIT_NPROC does not suffer from such accounting > discrepancies at any time. RLIMIT_NPROC uses a dedicated atomic counter which is updated when the process is getting reaped; however, that doesn't actually coincide with the pid being freed. The base pid ref is put then but there can be other refs and even after that it has to go through RCU grace period to be actually freed. They seem equivalent but serve a bit different purposes. RLIMIT_NPROC is primarily about limiting what the user can do and doesn't guarantee that that actually matches resource (pid here) consumption. pid controller's primary role is limiting pid consumption - ie. no matter what happens the cgroup must not be able to take away more than the specified number from the available pool, which has to account for the lazy release and draining refs and stuff. > The "memory" cgroups controller also does > not suffer from any discrepancies -- it accounts memory usage in > real time without any lag on process start or exit. The "tasks" file > list is also always up-to-date. The memory controller does the same thing, actually way more extensively. It's just less noticeable because people generally don't try to control at individual page level. > Is it really technically not possible to make "pids.current" do > accounting properly like RLIMIT_NPROC does? We were hoping to > replace RLIMIT_NPROC with the "pids" controller. It is of course possible but at a cost. The cost (getting rid of lazy release optimizations) is just not justifiable for most cases. Thanks. -- tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Cgroups "pids" controller does not update "pids.current" count immediately 2018-06-15 15:41 ` Tejun Heo @ 2018-06-15 16:07 ` Ivan Zahariev 2018-06-15 16:16 ` Tejun Heo 0 siblings, 1 reply; 9+ messages in thread From: Ivan Zahariev @ 2018-06-15 16:07 UTC (permalink / raw) To: Tejun Heo; +Cc: cgroups, linux-kernel Hi, Thank you for the quick and insightful reply. I have one suggestion below: On 15.6.2018 г. 18:41 ч., Tejun Heo wrote: > On Fri, Jun 15, 2018 at 05:26:04PM +0300, Ivan Zahariev wrote: >> The standard RLIMIT_NPROC does not suffer from such accounting >> discrepancies at any time. > They seem equivalent but serve a bit different purposes. RLIMIT_NPROC > is primarily about limiting what the user can do and doesn't guarantee > that that actually matches resource (pid here) consumption. > >> Is it really technically not possible to make "pids.current" do >> accounting properly like RLIMIT_NPROC does? We were hoping to >> replace RLIMIT_NPROC with the "pids" controller. > It is of course possible but at a cost. The cost (getting rid of lazy > release optimizations) is just not justifiable for most cases. I understand all concerns and design decisions. However, having RLIMIT_NPROC support combined with "cgroups" hierarchy would be very handy. Does it make sense that you introduce "nproc.current" and "nproc.max" metrics which work in the same atomic, real-time way like RLIMIT_NPROC? Or make this in a new "nproc" controller? -- Ivan -- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Cgroups "pids" controller does not update "pids.current" count immediately 2018-06-15 16:07 ` Ivan Zahariev @ 2018-06-15 16:16 ` Tejun Heo 2018-06-15 17:40 ` Ivan Zahariev 0 siblings, 1 reply; 9+ messages in thread From: Tejun Heo @ 2018-06-15 16:16 UTC (permalink / raw) To: Ivan Zahariev; +Cc: cgroups, linux-kernel Hello, On Fri, Jun 15, 2018 at 07:07:27PM +0300, Ivan Zahariev wrote: > I understand all concerns and design decisions. However, having > RLIMIT_NPROC support combined with "cgroups" hierarchy would be very > handy. > > Does it make sense that you introduce "nproc.current" and > "nproc.max" metrics which work in the same atomic, real-time way > like RLIMIT_NPROC? Or make this in a new "nproc" controller? I'm skeptical for two reasons. 1. That doesn't sound much like a resource control problem but more of a policy enforcement problem. 2. and it's difficult to see why such policies would need to be that strict. Where is the requirement coming from? Thanks. -- tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Cgroups "pids" controller does not update "pids.current" count immediately 2018-06-15 16:16 ` Tejun Heo @ 2018-06-15 17:40 ` Ivan Zahariev 2018-06-15 19:07 ` Tejun Heo 0 siblings, 1 reply; 9+ messages in thread From: Ivan Zahariev @ 2018-06-15 17:40 UTC (permalink / raw) To: Tejun Heo; +Cc: cgroups, linux-kernel Hello, On 15.6.2018 г. 19:16 ч., Tejun Heo wrote: > On Fri, Jun 15, 2018 at 07:07:27PM +0300, Ivan Zahariev wrote: >> I understand all concerns and design decisions. However, having >> RLIMIT_NPROC support combined with "cgroups" hierarchy would be very >> handy. >> >> Does it make sense that you introduce "nproc.current" and >> "nproc.max" metrics which work in the same atomic, real-time way >> like RLIMIT_NPROC? Or make this in a new "nproc" controller? > I'm skeptical for two reasons. > > 1. That doesn't sound much like a resource control problem but more of > a policy enforcement problem. > > 2. and it's difficult to see why such policies would need to be that > strict. Where is the requirement coming from? > The lazy pids accounting + modern fast CPUs makes the "pids.current" metric practically unusable for resource limiting in our case. For a test, when we started and ended one single process very quickly, we saw "pids.current" equal up to 185 (while the correct value at all time is either 0 or 1). If we want that a "cgroup" can spawn maximum 50 processes, we should use some high value like 300 for "pids.max", in order to compensate the pids uncharge lag (and this depends on the speed of the CPU and how busy the system is). Our use-case is for a shared web hosting service. Our customers start a CGI process for each PHP web request and therefore process start/end happens at a very high rate. We don't want customers to be able to launch too many CGI processes (NPROC limit) because this exhausts the web & database servers, and probably obsesses Linux kernel resources (like total "opened files" per user). Furthermore, some users are malicious and launch fork-bombs and other resource-exhaustion attacks. You may be right that we enforce a policy rather than resource control. This has worked for us for 15+ years now. The motivation is that a global RLIMIT_NPROC easily let's us limit all system and Linux kernel resources "per customer" ("cgroups" allows us to limit only certain system resources). Additionally, not all user-space daemons allow for a granular "per user" limit or proper grouping (for example, MySQL has only users, and no "per customer" groups support). Now we want to have different "cgroups" hierarchies for a customer (SSH, CGI, Crond), each with their own RLIMIT_NPROC, and a total RLIMIT_NPROC for the parent "per customer" cgroup. Excuse me for the lengthy post :-) -- Ivan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Cgroups "pids" controller does not update "pids.current" count immediately 2018-06-15 17:40 ` Ivan Zahariev @ 2018-06-15 19:07 ` Tejun Heo 2018-06-15 19:38 ` Ivan Zahariev 0 siblings, 1 reply; 9+ messages in thread From: Tejun Heo @ 2018-06-15 19:07 UTC (permalink / raw) To: Ivan Zahariev; +Cc: cgroups, linux-kernel Hello, Ivan. On Fri, Jun 15, 2018 at 08:40:02PM +0300, Ivan Zahariev wrote: > The lazy pids accounting + modern fast CPUs makes the "pids.current" > metric practically unusable for resource limiting in our case. For a > test, when we started and ended one single process very quickly, we > saw "pids.current" equal up to 185 (while the correct value at all > time is either 0 or 1). If we want that a "cgroup" can spawn maximum > 50 processes, we should use some high value like 300 for "pids.max", > in order to compensate the pids uncharge lag (and this depends on > the speed of the CPU and how busy the system is). Yeah, that actually makes a lot of sense. We can't keep everything synchronous for obvious performance reasons but we definitely can wait for RCU grace period before failing. Forking might become a bit slower while pids are draining but shouldn't fail and that shouldn't incur any performance overhead in normal conditions when pids aren't constrained. Thanks. -- tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Cgroups "pids" controller does not update "pids.current" count immediately 2018-06-15 19:07 ` Tejun Heo @ 2018-06-15 19:38 ` Ivan Zahariev 0 siblings, 0 replies; 9+ messages in thread From: Ivan Zahariev @ 2018-06-15 19:38 UTC (permalink / raw) To: Tejun Heo; +Cc: cgroups, linux-kernel Hello, On 15.6.2018 г. 22:07 ч., Tejun Heo wrote: > On Fri, Jun 15, 2018 at 08:40:02PM +0300, Ivan Zahariev wrote: >> The lazy pids accounting + modern fast CPUs makes the "pids.current" >> metric practically unusable for resource limiting in our case. For a >> test, when we started and ended one single process very quickly, we >> saw "pids.current" equal up to 185 (while the correct value at all >> time is either 0 or 1). If we want that a "cgroup" can spawn maximum >> 50 processes, we should use some high value like 300 for "pids.max", >> in order to compensate the pids uncharge lag (and this depends on >> the speed of the CPU and how busy the system is). > Yeah, that actually makes a lot of sense. We can't keep everything > synchronous for obvious performance reasons but we definitely can wait > for RCU grace period before failing. Forking might become a bit > slower while pids are draining but shouldn't fail and that shouldn't > incur any performance overhead in normal conditions when pids aren't > constrained. I lack expertise to comment on this. As a system administrator, I can only remind that nowadays machines with 80+ CPU cores are something usual. I don't know how the RCU grace period scales with an increasing number of CPUs. If you develop a patch for this, we can try it in production and give you feedback. Just send me an email notification. Thank you for your time and attention! -- Ivan ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2018-06-15 19:38 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <77af3805-e912-2664-f347-e30c0919d0c4@icdsoft.com> 2018-06-14 17:26 ` Cgroups "pids" controller does not update "pids.current" count immediately Aleksa Sarai 2018-06-14 17:27 ` Aleksa Sarai [not found] ` <20180614150650.GU1351649@devbig577.frc2.facebook.com> 2018-06-15 14:26 ` Ivan Zahariev 2018-06-15 15:41 ` Tejun Heo 2018-06-15 16:07 ` Ivan Zahariev 2018-06-15 16:16 ` Tejun Heo 2018-06-15 17:40 ` Ivan Zahariev 2018-06-15 19:07 ` Tejun Heo 2018-06-15 19:38 ` Ivan Zahariev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).