Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB

From: Jens Axboe <axboe@kernel.dk>
To: David Hildenbrand <david@redhat.com>,
	Andrew Dona-Couch <andrew@donacou.ch>,
	Andrew Morton <akpm@linux-foundation.org>,
	Drew DeVault <sir@cmpwn.com>
Cc: Ammar Faizi <ammarfaizi2@gnuweeb.org>,
	linux-kernel@vger.kernel.org, linux-api@vger.kernel.org,
	io_uring Mailing List <io-uring@vger.kernel.org>,
	Pavel Begunkov <asml.silence@gmail.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB
Date: Mon, 22 Nov 2021 13:44:11 -0700	[thread overview]
Message-ID: <3adc55d3-f383-efa9-7319-740fc6ab5d7a@kernel.dk> (raw)
In-Reply-To: <5f998bb7-7b5d-9253-2337-b1d9ea59c796@redhat.com>

On 11/22/21 1:08 PM, David Hildenbrand wrote:
> On 22.11.21 20:53, Jens Axboe wrote:
>> On 11/22/21 11:26 AM, David Hildenbrand wrote:
>>> On 22.11.21 18:55, Andrew Dona-Couch wrote:
>>>> Forgive me for jumping in to an already overburdened thread.  But can
>>>> someone pushing back on this clearly explain the issue with applying
>>>> this patch?
>>>
>>> It will allow unprivileged users to easily and even "accidentally"
>>> allocate more unmovable memory than it should in some environments. Such
>>> limits exist for a reason. And there are ways for admins/distros to
>>> tweak these limits if they know what they are doing.
>>
>> But that's entirely the point, the cases where this change is needed are
>> already screwed by a distro and the user is the administrator. This is
>> _exactly_ the case where things should just work out of the box. If
>> you're managing farms of servers, yeah you have competent administration
>> and you can be expected to tweak settings to get the best experience and
>> performance, but the kernel should provide a sane default. 64K isn't a
>> sane default.
> 
> 0.1% of RAM isn't either.

No default is perfect, byt 0.1% will solve 99% of the problem. And most
likely solve 100% of the problems for the important case, which is where
you want things to Just Work on your distro without doing any
administration.  If you're aiming for perfection, it doesn't exist.

>>> This is not a step into the right direction. This is all just trying to
>>> hide the fact that we're exposing FOLL_LONGTERM usage to random
>>> unprivileged users.
>>>
>>> Maybe we could instead try getting rid of FOLL_LONGTERM usage and the
>>> memlock limit in io_uring altogether, for example, by using mmu
>>> notifiers. But I'm no expert on the io_uring code.
>>
>> You can't use mmu notifiers without impacting the fast path. This isn't
>> just about io_uring, there are other users of memlock right now (like
>> bpf) which just makes it even worse.
> 
> 1) Do we have a performance evaluation? Did someone try and come up with
> a conclusion how bad it would be?

I honestly don't remember the details, I took a look at it about a year
ago due to some unrelated reasons. These days it just pertains to
registered buffers, so it's less of an issue than back then when it
dealt with the rings as well. Hence might be feasible, I'm certainly not
against anyone looking into it. Easy enough to review and test for
performance concerns.

> 2) Could be provide a mmu variant to ordinary users that's just good
> enough but maybe not as fast as what we have today? And limit
> FOLL_LONGTERM to special, privileged users?

If it's not as fast, then it's most likely not good enough though...

> 3) Just because there are other memlock users is not an excuse. For
> example, VFIO/VDPA have to use it for a reason, because there is no way
> not do use FOLL_LONGTERM.

It's not an excuse, the statement merely means that the problem is
_worse_ as there are other memlock users.

>>
>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something
>> like what was suggested, if that will help move things forward. IMHO the
>> 32MB machine is mostly a theoretical case, but whatever .
> 
> 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges
> where FOLL_LONGTERM cannot be used, as that memory is not available.
> 
> 2) With 0.1% RAM it's sufficient to start 1000 processes to break any
> system completely and deeply mess up the MM. Oh my.

We're talking per-user limits here. But if you want to talk hyperbole,
then 64K multiplied by some other random number will also allow
everything to be pinned, potentially.

-- 
Jens Axboe