linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Roman Penyaev <rpenyaev@suse.de>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
	linux-block@vger.kernel.org, hch@lst.de, jmoyer@redhat.com,
	avi@scylladb.com, linux-block-owner@vger.kernel.org
Subject: Re: [PATCH 05/17] Add io_uring IO interface
Date: Mon, 21 Jan 2019 17:49:22 +0100	[thread overview]
Message-ID: <4e7ef6f79c1fcd3aafa992ea9652e4ea@suse.de> (raw)
In-Reply-To: <df5b04ea-1c7c-03e5-087e-d9e3763d6670@kernel.dk>

On 2019-01-21 17:23, Jens Axboe wrote:
> On 1/21/19 8:58 AM, Roman Penyaev wrote:
>> On 2019-01-21 16:30, Jens Axboe wrote:
>>> On 1/21/19 2:13 AM, Roman Penyaev wrote:
>>>> On 2019-01-18 17:12, Jens Axboe wrote:
>>>> 
>>>> [...]
>>>> 
>>>>> +
>>>>> +static int io_uring_create(unsigned entries, struct 
>>>>> io_uring_params
>>>>> *p,
>>>>> +			   bool compat)
>>>>> +{
>>>>> +	struct user_struct *user = NULL;
>>>>> +	struct io_ring_ctx *ctx;
>>>>> +	int ret;
>>>>> +
>>>>> +	if (entries > IORING_MAX_ENTRIES)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	/*
>>>>> +	 * Use twice as many entries for the CQ ring. It's possible for 
>>>>> the
>>>>> +	 * application to drive a higher depth than the size of the SQ
>>>>> ring,
>>>>> +	 * since the sqes are only used at submission time. This allows 
>>>>> for
>>>>> +	 * some flexibility in overcommitting a bit.
>>>>> +	 */
>>>>> +	p->sq_entries = roundup_pow_of_two(entries);
>>>>> +	p->cq_entries = 2 * p->sq_entries;
>>>>> +
>>>>> +	if (!capable(CAP_IPC_LOCK)) {
>>>>> +		user = get_uid(current_user());
>>>>> +		ret = __io_account_mem(user, ring_pages(p->sq_entries,
>>>>> +							p->cq_entries));
>>>>> +		if (ret) {
>>>>> +			free_uid(user);
>>>>> +			return ret;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	ctx = io_ring_ctx_alloc(p);
>>>>> +	if (!ctx)
>>>>> +		return -ENOMEM;
>>>> 
>>>> Hi Jens,
>>>> 
>>>> It seems pages should be "unaccounted" back here and uid freed if 
>>>> path
>>>> with "if (!capable(CAP_IPC_LOCK))" above was taken.
>>> 
>>> Thanks, yes that is leaky. I'll fix that up.
>>> 
>>>> But really, could please someone explain me what is wrong with
>>>> allocating
>>>> all urings in mmap() without touching RLIMIT_MEMLOCK at all?  Thus 
>>>> all
>>>> memory will be accounted to the caller app and if app is greedy it
>>>> will
>>>> be killed by oom.  What I'm missing?
>>> 
>>> I don't really what that'd change, if we do it off the ->mmap() or 
>>> when
>>> we setup the io_uring instance with io_uring_setup(2). We need this
>>> memory
>>> to be pinned, we can't fault on it.
>> 
>> Hm, I thought that for pinning there is a separate counter ->pinned_vm
>> (introduced by bc3e53f682d9 ("mm: distinguish between mlocked and 
>> pinned
>> pages")  Which seems not wired up with anything, just a counter, used 
>> by
>> couple of drivers.
> 
> io_uring doesn't inc/dec either of those, but it probably should. As it
> appears rather unused, probably not a big deal.
> 
>> Hmmm.. Frankly, now I am lost. You map these pages through
>> remap_pfn_range(), so virtual user mapping won't fault, right?  And
>> these pages you allocate with GFP_KERNEL, so they are already pinned.
> 
> Right, they will not fault. My point is that it sounded like you want
> the application to allocate this memory in userspace, and then have the
> kernel map it. I don't want to do that, that brings it's own host of
> issues with it (we used to do that). The mmap(2) of kernel memory is
> much cleaner.

No, no.  I've explained below.

> 
>> So now I do not understand why this accounting is needed at all :)
>> The only reason I had in mind is some kind of accounting, to filter 
>> out
>> greedy and nasty apps.  If this is not the case, then I am lost.
>> Could you please explain?
> 
> We need some kind of limit, to prevent a user from creating millions of
> io_uring instances and pining down everything. The old aio code 
> realized
> this after the fact, and added some silly sysctls to control this. I
> want to avoid the same mess, and hence it makes more sense to tie into
> some kind of limiting we already have, like RLIMIT_MEMLOCK. Since we're
> using that rlimit, accounting the memory as locked is the right way to
> go.

Yes, that what I thought from the very beginning: RLIMIT_MEMLOCK is used
to limit somehow the allocation.  Thanks for clarifying that.

But again returning to mmap(): why not to do the same alloc of pages
with GFP_KERNEL and remap_pfn_range() (exactly like you do now), but
inside ->mmap callback?  (so simply postpone allocation to the mmap(2)
step).  Then allocated memory will be "atomically" accounted for user
vma, and greedy app will be safely killed by oom even without usage of
RLIMIT_MEMLOCK limit (which is a pain if it is low, right?).

So basically you do not have this unsafe gap: memory is allocated in
io_uring_setup(2) and then sometime in the future accounted for vma
inside mmap(2). No. Allocation and mmaping happens directly inside
mmap(2) callback, so no rlimit is needed.

So this is an attempt to solve low limit of RLIMIT_MEMLOCK, which
you recently discussed Jeff Moyer in another thread.

--
Roman








  reply	other threads:[~2019-01-21 16:49 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-18 16:12 [PATCHSET v6] io_uring IO interface Jens Axboe
2019-01-18 16:12 ` [PATCH 01/17] fs: add an iopoll method to struct file_operations Jens Axboe
2019-01-18 16:12 ` [PATCH 02/17] block: wire up block device iopoll method Jens Axboe
2019-01-18 16:12 ` [PATCH 03/17] block: add bio_set_polled() helper Jens Axboe
2019-01-18 16:12 ` [PATCH 04/17] iomap: wire up the iopoll method Jens Axboe
2019-01-18 16:12 ` [PATCH 05/17] Add io_uring IO interface Jens Axboe
2019-01-21  9:13   ` Roman Penyaev
2019-01-21 15:30     ` Jens Axboe
2019-01-21 15:58       ` Roman Penyaev
2019-01-21 16:23         ` Jens Axboe
2019-01-21 16:49           ` Roman Penyaev [this message]
2019-01-22 16:11             ` Jens Axboe
2019-01-18 16:12 ` [PATCH 06/17] io_uring: add fsync support Jens Axboe
2019-01-18 16:12 ` [PATCH 07/17] io_uring: support for IO polling Jens Axboe
2019-01-18 16:12 ` [PATCH 08/17] fs: add fget_many() and fput_many() Jens Axboe
2019-01-18 16:12 ` [PATCH 09/17] io_uring: use fget/fput_many() for file references Jens Axboe
2019-01-18 16:12 ` [PATCH 10/17] io_uring: batch io_kiocb allocation Jens Axboe
2019-01-18 16:12 ` [PATCH 11/17] block: implement bio helper to add iter bvec pages to bio Jens Axboe
2019-01-18 16:12 ` [PATCH 12/17] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-01-18 16:12 ` [PATCH 13/17] io_uring: add file set registration Jens Axboe
2019-01-18 16:12 ` [PATCH 14/17] io_uring: add submission polling Jens Axboe
2019-01-18 16:12 ` [PATCH 15/17] io_uring: add io_kiocb ref count Jens Axboe
2019-01-18 16:12 ` [PATCH 16/17] io_uring: add support for IORING_OP_POLL Jens Axboe
2019-01-18 16:12 ` [PATCH 17/17] io_uring: add io_uring_event cache hit information Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4e7ef6f79c1fcd3aafa992ea9652e4ea@suse.de \
    --to=rpenyaev@suse.de \
    --cc=avi@scylladb.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=jmoyer@redhat.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-block-owner@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).