From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH 00/14] Present useful limits to user (v2) To: Topi Miettinen , bsingharora@gmail.com References: <1468578983-28229-1-git-send-email-toiwoton@gmail.com> <20160715130458.GB21685@350D> <41b6ca51-1358-0fd7-b45a-dc29a1344865@gmail.com> Cc: linux-kernel@vger.kernel.org, Jonathan Corbet , Tony Luck , Fenghua Yu , Alexander Graf , Paolo Bonzini , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Benjamin Herrenschmidt , Paul Mackerras , Michael Ellerman , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , Sean Hefty , Hal Rosenstock , Mike Marciniszyn , Dennis Dalessandro , Christian Benvenuti , Dave Goodell , Sudeep Dutt , Ashutosh Dixit , Alex Williamson , Alexander Viro , Tejun Heo , Li Zefan , Johannes Weiner , Peter Zijlstra , Alexei Starovoitov , Arnaldo Carvalho de Melo , Alexander Shishkin , Markus Elfring , "David S. Miller" , Nicolas Dichtel , Andrew Morton , Konstantin Khlebnikov , Jiri Slaby , Cyrill Gorcunov , Michal Hocko , Vlastimil Babka , Dave Hansen , Greg Kroah-Hartman , Dan Carpenter , Michael Kerrisk , "Kirill A. Shutemov" , Marcus Gelderie , Vladimir Davydov , Joe Perches , Frederic Weisbecker , Andrea Arcangeli , "Eric W. Biederman" , Andi Kleen , Oleg Nesterov , Stas Sergeev , "Amanieu d'Antras" , Richard Weinberger , Wang Xiaoqiang , Helge Deller , Mateusz Guzik , Alex Thorlton , Ben Segall , John Stultz , Rik van Riel , Eric B Munson , Alexey Klimov , Chen Gang , Andrey Ryabinin , David Rientjes , Hugh Dickins , Alexander Kuleshov , "open list:DOCUMENTATION" , "open list:IA64 (Itanium) PLATFORM" , "open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC" , "open list:KERNEL VIRTUAL MACHINE (KVM)" , "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , "open list:INFINIBAND SUBSYSTEM" , "open list:FILESYSTEMS (VFS and infrastructure)" , "open list:CONTROL GROUP (CGROUP)" , "open list:BPF (Safe dynamic programs and tools)" , "open list:MEMORY MANAGEMENT" From: Doug Ledford Message-ID: Date: Mon, 18 Jul 2016 18:05:31 -0400 MIME-Version: 1.0 In-Reply-To: <41b6ca51-1358-0fd7-b45a-dc29a1344865@gmail.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="F9jkiFT0bsWgPk4OVF6FuXMes50sEeJEW" Sender: owner-linux-mm@kvack.org List-ID: This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --F9jkiFT0bsWgPk4OVF6FuXMes50sEeJEW Content-Type: multipart/mixed; boundary="8Nu7KMlW20LvvH3vroJakcrp9lO3XsH14" From: Doug Ledford To: Topi Miettinen , bsingharora@gmail.com Cc: linux-kernel@vger.kernel.org, Jonathan Corbet , Tony Luck , Fenghua Yu , Alexander Graf , Paolo Bonzini , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Benjamin Herrenschmidt , Paul Mackerras , Michael Ellerman , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , Sean Hefty , Hal Rosenstock , Mike Marciniszyn , Dennis Dalessandro , Christian Benvenuti , Dave Goodell , Sudeep Dutt , Ashutosh Dixit , Alex Williamson , Alexander Viro , Tejun Heo , Li Zefan , Johannes Weiner , Peter Zijlstra , Alexei Starovoitov , Arnaldo Carvalho de Melo , Alexander Shishkin , Markus Elfring , "David S. Miller" , Nicolas Dichtel , Andrew Morton , Konstantin Khlebnikov , Jiri Slaby , Cyrill Gorcunov , Michal Hocko , Vlastimil Babka , Dave Hansen , Greg Kroah-Hartman , Dan Carpenter , Michael Kerrisk , "Kirill A. Shutemov" , Marcus Gelderie , Vladimir Davydov , Joe Perches , Frederic Weisbecker , Andrea Arcangeli , "Eric W. Biederman" , Andi Kleen , Oleg Nesterov , Stas Sergeev , Amanieu d'Antras , Richard Weinberger , Wang Xiaoqiang , Helge Deller , Mateusz Guzik , Alex Thorlton , Ben Segall , John Stultz , Rik van Riel , Eric B Munson , Alexey Klimov , Chen Gang , Andrey Ryabinin , David Rientjes , Hugh Dickins , Alexander Kuleshov , "open list:DOCUMENTATION" , "open list:IA64 (Itanium) PLATFORM" , "open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC" , "open list:KERNEL VIRTUAL MACHINE (KVM)" , "open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)" , "open list:INFINIBAND SUBSYSTEM" , "open list:FILESYSTEMS (VFS and infrastructure)" , "open list:CONTROL GROUP (CGROUP)" , "open list:BPF (Safe dynamic programs and tools)" , "open list:MEMORY MANAGEMENT" Message-ID: Subject: Re: [PATCH 00/14] Present useful limits to user (v2) References: <1468578983-28229-1-git-send-email-toiwoton@gmail.com> <20160715130458.GB21685@350D> <41b6ca51-1358-0fd7-b45a-dc29a1344865@gmail.com> In-Reply-To: <41b6ca51-1358-0fd7-b45a-dc29a1344865@gmail.com> --8Nu7KMlW20LvvH3vroJakcrp9lO3XsH14 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable On 7/15/2016 12:35 PM, Topi Miettinen wrote: > On 07/15/16 13:04, Balbir Singh wrote: >> On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote: >>> Hello, >>> >>> There are many basic ways to control processes, including capabilitie= s, >>> cgroups and resource limits. However, there are far fewer ways to fin= d out >>> useful values for the limits, except blind trial and error. >>> >>> This patch series attempts to fix that by giving at least a nice star= ting >>> point from the highwater mark values of the resources in question. >>> I looked where each limit is checked and added a call to update the m= ark >>> nearby. >>> >>> Example run of program from Documentation/accounting/getdelauys.c: >>> >>> ./getdelays -R -p `pidof smartd` >>> printing resource accounting >>> RLIMIT_CPU=3D0 >>> RLIMIT_FSIZE=3D0 >>> RLIMIT_DATA=3D18198528 >>> RLIMIT_STACK=3D135168 >>> RLIMIT_CORE=3D0 >>> RLIMIT_RSS=3D0 >>> RLIMIT_NPROC=3D1 >>> RLIMIT_NOFILE=3D55 >>> RLIMIT_MEMLOCK=3D0 >>> RLIMIT_AS=3D130879488 >>> RLIMIT_LOCKS=3D0 >>> RLIMIT_SIGPENDING=3D0 >>> RLIMIT_MSGQUEUE=3D0 >>> RLIMIT_NICE=3D0 >>> RLIMIT_RTPRIO=3D0 >>> RLIMIT_RTTIME=3D0 >>> >>> ./getdelays -R -C /sys/fs/cgroup/systemd/system.slice/smartd.service/= >>> printing resource accounting >>> sleeping 1, blocked 0, running 0, stopped 0, uninterruptible 0 >>> RLIMIT_CPU=3D0 >>> RLIMIT_FSIZE=3D0 >>> RLIMIT_DATA=3D18198528 >>> RLIMIT_STACK=3D135168 >>> RLIMIT_CORE=3D0 >>> RLIMIT_RSS=3D0 >>> RLIMIT_NPROC=3D1 >>> RLIMIT_NOFILE=3D55 >>> RLIMIT_MEMLOCK=3D0 >>> RLIMIT_AS=3D130879488 >>> RLIMIT_LOCKS=3D0 >>> RLIMIT_SIGPENDING=3D0 >>> RLIMIT_MSGQUEUE=3D0 >>> RLIMIT_NICE=3D0 >>> RLIMIT_RTPRIO=3D0 >>> RLIMIT_RTTIME=3D0 >> >> Does this mean that rlimit_data and rlimit_stack should be set to the >> values as specified by the data above? >=20 > My plan is that either system administrator, distro maintainer or even > upstream developer can get reasonable values for the limits. They may > still be wrong, but things would be better than without any help to > configure the system. This is not necessarily true. It seems like there is a disconnect between what these various values are for and what you are positioning them as. Most of these limits are meant to protect the system from resource starvation crashes. They aren't meant to be any sort of double check on a specific application. The vast majority of applications can have bugs, leak resources, and do all sorts of other bad things and still not hit these limits. A program that leaks a file handle an hour but only normally has 50 handles in use would take 950 hours of constant leaking before these limits would kick in to bring the program under control. That's over a month. What's more though, the kernel couldn't really care less that a single application leaked files until it got to 1000 open. The real point of the limit on file handles (since they are cheap) is just not to let the system get brought down. Someone could maliciously fire up 1000 processes, and they could all attempt to open up as many files as possible in order to drown the system in open inodes. The combination of the limit on maximum user processes and maximum files per process are intended to prevent this. They are not intended to prevent a single, properly running application from operating. In fact, there are very few applications that are likely to break the 1000 file per process limit. It is outrageously high for most applications. They will leak files and do all sorts of bad things without this ever stopping them. But it does stop malicious programs. And the process limit stops malicious users too. The max locked memory is used by almost no processes, and for the very few that use it, the default is more than enough. The major exception is the RDMA stack, which uses it so much that we just disable it on large systems because it's impossible to predict how much we'll need and we don't want a job to get killed because it couldn't get the memory it needs for buffers. The limit on POSIX message queues is another one where it's more than enough for most applications which don't use this feature at all, and the few systems that use this feature adjust the limit to something sane on their system (we can't make the default sane for these special systems or else it becomes an avenue for Denial of Service attack, so the default must stay low and servers that make extensive use of this feature must up their limit on a case by case basis). >> >> Do we expect a smart user space daemon to then tweak the RLIMIT values= ? >=20 > Someone could write an autotuning daemon that checks if the system has > changed (for example due to upgrade) and then run some tests to > reconfigure the system. But the limits are a bit too fragile, or rather= , > applications can't handle failure, so I don't know if that would really= > work. This misses the point of most of these limits. They aren't there to keep normal processes and normal users in check. They are there to stop runaway use. This runaway situation might be accidental, or it might be a nefarious users. They are generally set exceedingly high for those things every application uses, and fairly low for those things that almost no application uses but which could be abused by the nefarious user crowd. Moreover, for a large percentage of applications, the highwatermark is a source of great trickery. For instance, if you have a web server that is hosting web pages written in python, and therefore are using mod_python in the httpd server (assuming apache here), then your highwatermark will never be a reliable, stable thing. If you get 1000 web requests in a minute, all utilizing the mod_python resource in the web server, and you don't have your httpd configured to restart after every few hundred requests handled, then mod_python in your httpd process will grow seemingly without limit. It will consume tons of memory. And the only limit on how much memory it will consume is determined by how many web requests it handles in between its garbage collection intervals * how much memory it allocates per request. If you don't happen to catch the absolute highest amount while you are gathering your watermarks, then when you actually switch the system to enforcing the limits you learned from all your highwatermarks (you are planning on doing that aren't you?....I didn't see a copy of the patch 1/14, so I don't know if this infrastructure ever goes back to enforcing the limits or not, but I would assume so, what point is there in learning what the limits should be if you then never turn around and enforce them?), load spikes will cause random program failures. Really, this looks like a solution in search of a problem. Right now, the limits are set where they are because they do two things: 1) Stay out of the way of the vast majority of applications. Those applications that get tripped up by the defaults (like RDMA applications getting stopped by memlock settings) have setup guides that spell out which limits need changed and hints on what to change them too. 2) Stop nefarious users or errant applications from a total runaway situation on a machine. If your applications run without fail unless they have already failed, and the whole machine doesn't go down with your failed application, then the limits are working as designed. If your typical machine configuration includes 256GB of RAM, then you could probably stand to increase some of the limits safely if you wanted to. But unless you have applications getting killed because of these limits, why would you? Right now, I'm inclined to NAK the patch set. I've only seen patch 9/14 since you didn't Cc: everyone on the patch 1/14 that added the infrastructure. But, as I mentioned in another email, I think this can be accomplished via a systemtap script instead so we keep the clutter out of the kernel. And more importantly, these patches seem to be thinking about these limits as though they are supposed to be some sort of tight fitting container around applications that catch an errant application as soon as it steps out of bounds. Nothing could be further from the truth, and if we actually implemented something of that sort, programs susceptible to high resource usage during load spikes would suddenly start failing on a frequent basis. The proof that these limits are working is given by the fact that we rarely hear from users about their programs being killed for resource consumption, and yet we also don't hear from users about their systems going down due to runaway applications. From what I can tell from these patches, I would suspect complaints from one of those two issues to increase once these patches are in place and put in use, and that doesn't seem like a good thing. --=20 Doug Ledford GPG Key ID: 0E572FDD --8Nu7KMlW20LvvH3vroJakcrp9lO3XsH14-- --F9jkiFT0bsWgPk4OVF6FuXMes50sEeJEW Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJXjVKsAAoJELgmozMOVy/d8OkP/29+ozde5uLxvxBQIOOLfOwQ esN5JxLREIBpDmtKgYGZAKj/fgiOCGY24AL7N+f6aauTa7VLi8Vc7pHiDYjRFRk3 AXKYCW5hnApVRygKqRxkpuq5r/nze5B87icZ93BZNrPEkEhsKZT2mshHIa0EiBLT CLfSrfVbIztHujUK7pDrhtK80E9VhK3RIAVX7SqQLBgYFpP6NQglR342T8WBXsTI PdjeYQnxKAzDC7iyUVsWSYf+7DlUSK4Kw7mWf7mAQekdfaRQzt7tlKGYMSGmvHNm 7a4CcdQe/rPHVSAshYfVBv2SUK/OFmbufzNxDPbWA+vm/yCcwfyaFNm/gg/zOdMs K1Gru6JKg9CReCn7L9iXVkrjZy9ZoXeSZZjdWini+NOO/w+VPfWYspofA1cWa0Yk 6EYaM7VGrG0F9xXb1DMt1elQ636bajvB+AErkTxI7kKHmm082MomCIyjNVh08arU kxexdC5SD0fdHFIQwH9w6Hbt8N+lr21fQsAc1BTUPwQaeUC+I+jJywcq4TSioALL ahC0boOVxn45Uoq7SnrfaGcFw6HWNw96EmqC1YtG3sIIKNuxTFsZECdsqz6Dlq8B bX5b3vgZEzFUunbNfqGUGR+6tQislHNYZXm9dtuit7xMWwY4gVlNSWm7BzkPsAV6 P8fiVsdUPEvoigV3LOTG =JZfy -----END PGP SIGNATURE----- --F9jkiFT0bsWgPk4OVF6FuXMes50sEeJEW-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org