Re: [PATCH 05/17] Add io_uring IO interface

From: Jens Axboe <axboe@kernel.dk>
To: Roman Penyaev <rpenyaev@suse.de>
Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
	linux-block@vger.kernel.org, hch@lst.de, jmoyer@redhat.com,
	avi@scylladb.com, linux-block-owner@vger.kernel.org
Subject: Re: [PATCH 05/17] Add io_uring IO interface
Date: Mon, 21 Jan 2019 09:23:44 -0700	[thread overview]
Message-ID: <df5b04ea-1c7c-03e5-087e-d9e3763d6670@kernel.dk> (raw)
In-Reply-To: <eb1e623843cd26ced5d06deb7fdb7851@suse.de>

On 1/21/19 8:58 AM, Roman Penyaev wrote:
> On 2019-01-21 16:30, Jens Axboe wrote:
>> On 1/21/19 2:13 AM, Roman Penyaev wrote:
>>> On 2019-01-18 17:12, Jens Axboe wrote:
>>>
>>> [...]
>>>
>>>> +
>>>> +static int io_uring_create(unsigned entries, struct io_uring_params
>>>> *p,
>>>> +			   bool compat)
>>>> +{
>>>> +	struct user_struct *user = NULL;
>>>> +	struct io_ring_ctx *ctx;
>>>> +	int ret;
>>>> +
>>>> +	if (entries > IORING_MAX_ENTRIES)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/*
>>>> +	 * Use twice as many entries for the CQ ring. It's possible for the
>>>> +	 * application to drive a higher depth than the size of the SQ 
>>>> ring,
>>>> +	 * since the sqes are only used at submission time. This allows for
>>>> +	 * some flexibility in overcommitting a bit.
>>>> +	 */
>>>> +	p->sq_entries = roundup_pow_of_two(entries);
>>>> +	p->cq_entries = 2 * p->sq_entries;
>>>> +
>>>> +	if (!capable(CAP_IPC_LOCK)) {
>>>> +		user = get_uid(current_user());
>>>> +		ret = __io_account_mem(user, ring_pages(p->sq_entries,
>>>> +							p->cq_entries));
>>>> +		if (ret) {
>>>> +			free_uid(user);
>>>> +			return ret;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	ctx = io_ring_ctx_alloc(p);
>>>> +	if (!ctx)
>>>> +		return -ENOMEM;
>>>
>>> Hi Jens,
>>>
>>> It seems pages should be "unaccounted" back here and uid freed if path
>>> with "if (!capable(CAP_IPC_LOCK))" above was taken.
>>
>> Thanks, yes that is leaky. I'll fix that up.
>>
>>> But really, could please someone explain me what is wrong with
>>> allocating
>>> all urings in mmap() without touching RLIMIT_MEMLOCK at all?  Thus all
>>> memory will be accounted to the caller app and if app is greedy it 
>>> will
>>> be killed by oom.  What I'm missing?
>>
>> I don't really what that'd change, if we do it off the ->mmap() or when
>> we setup the io_uring instance with io_uring_setup(2). We need this 
>> memory
>> to be pinned, we can't fault on it.
> 
> Hm, I thought that for pinning there is a separate counter ->pinned_vm
> (introduced by bc3e53f682d9 ("mm: distinguish between mlocked and pinned
> pages")  Which seems not wired up with anything, just a counter, used by
> couple of drivers.

io_uring doesn't inc/dec either of those, but it probably should. As it
appears rather unused, probably not a big deal.

> Hmmm.. Frankly, now I am lost. You map these pages through
> remap_pfn_range(), so virtual user mapping won't fault, right?  And
> these pages you allocate with GFP_KERNEL, so they are already pinned.

Right, they will not fault. My point is that it sounded like you want
the application to allocate this memory in userspace, and then have the
kernel map it. I don't want to do that, that brings it's own host of
issues with it (we used to do that). The mmap(2) of kernel memory is
much cleaner.

> So now I do not understand why this accounting is needed at all :)
> The only reason I had in mind is some kind of accounting, to filter out
> greedy and nasty apps.  If this is not the case, then I am lost.
> Could you please explain?

We need some kind of limit, to prevent a user from creating millions of
io_uring instances and pining down everything. The old aio code realized
this after the fact, and added some silly sysctls to control this. I
want to avoid the same mess, and hence it makes more sense to tie into
some kind of limiting we already have, like RLIMIT_MEMLOCK. Since we're
using that rlimit, accounting the memory as locked is the right way to
go.

-- 
Jens Axboe