All of lore.kernel.org
 help / color / mirror / Atom feed
* io_uring_prep_openat_direct() and link/drain
@ 2022-03-29 13:20 Miklos Szeredi
  2022-03-29 16:08 ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-03-29 13:20 UTC (permalink / raw)
  To: io-uring

[-- Attachment #1: Type: text/plain, Size: 539 bytes --]

Hi,

I'm trying to read multiple files with io_uring and getting stuck,
because the link and drain flags don't seem to do what they are
documented to do.

Kernel is v5.17 and liburing is compiled from the git tree at
7a3a27b6a384 ("add tests for nonblocking accept sockets").

Without those flags the attached example works some of the time, but
that's probably accidental since ordering is not ensured.

Adding the drain or link flags make it even worse (fail in casese that
the unordered one didn't).

What am I missing?

Thanks,
Miklos

[-- Attachment #2: readfiles.c --]
[-- Type: text/x-csrc, Size: 2277 bytes --]

#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <err.h>
#include "liburing.h"

#define CHECK_NEGERR(_expr) \
	({ typeof(_expr) _ret = (_expr); if (_ret < 0) { errno = -_ret; err(1, #_expr); } _ret; })
#define CHECK_NULL(_expr) \
	({ typeof(_expr) _ret = (_expr); if (_ret == NULL) { errx(1, #_expr " returned NULL"); } _ret; })


int main(int argc, char *argv[])
{
	struct io_uring ring;
	int ret, o, i, j, x, num, slot;
	struct io_uring_sqe *sqe;
	struct io_uring_cqe *cqe;
	char *s, **bufs;
	int *fds;
	const size_t bufsize = 131072;
	const int ops_per_file = 2;

	if (argc < 2)
		errx(1, "usage: %s file [...]", argv[0]);

	num = argc - 1;
	bufs = CHECK_NULL(calloc(num, sizeof(bufs[0])));
	fds = CHECK_NULL(calloc(num, sizeof(fds[0])));
	for (i = 0; i < num; i++) {
		bufs[i] = CHECK_NULL(malloc(bufsize));
		fds[i] = -1;
	}

	ret = CHECK_NEGERR(io_uring_queue_init(num * ops_per_file, &ring, 0));
	ret = CHECK_NEGERR(io_uring_register_files(&ring, fds, num));

	for (i = 0; i < num; i++) {
		slot = i;

		sqe = CHECK_NULL(io_uring_get_sqe(&ring));
		sqe->user_data = i * ops_per_file;
		io_uring_prep_openat_direct(sqe, AT_FDCWD, argv[i + 1],
					    O_RDONLY, 0, slot);
//		sqe->flags |= IOSQE_IO_DRAIN;
//		sqe->flags |= IOSQE_IO_LINK;

		sqe = CHECK_NULL(io_uring_get_sqe(&ring));
		sqe->user_data = i * ops_per_file + 1;
		io_uring_prep_read(sqe, slot, bufs[i], bufsize, 0);
		sqe->flags |= IOSQE_FIXED_FILE;
//		sqe->flags |= IOSQE_IO_DRAIN;
//		sqe->flags |= IOSQE_IO_LINK;
	}

	ret = CHECK_NEGERR(io_uring_submit(&ring));
	if (ret != num * ops_per_file)
		warnx("io_uring_submit submitted less: %d\n", ret);

	for (j = ret; j; j--) {
		CHECK_NEGERR(io_uring_wait_cqe(&ring, &cqe));

		x = cqe->user_data % ops_per_file;
		i = cqe->user_data / ops_per_file;
		printf("%i/%i [%s] = ", i, x, argv[i + 1]);

		ret = cqe->res;
		if (ret < 0) {
			printf("ERROR: %s (%i)\n", strerror(-ret), ret);
		} else if (x == 1) {
			s = bufs[i];

			for (o = 0; o < ret; o += strlen(s + o) + 1)
				printf("\"%.*s\" ", ret - o, s + o);

			printf("(len=%i)\n", ret);
		} else if (x == 0) {
			printf("SUCCESS open\n");
		} else {
			printf("SUCCESS ???\n");
		}
		io_uring_cqe_seen(&ring, cqe);
	}

	io_uring_queue_exit(&ring);
	return 0;
}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-29 13:20 io_uring_prep_openat_direct() and link/drain Miklos Szeredi
@ 2022-03-29 16:08 ` Jens Axboe
  2022-03-29 17:04   ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-03-29 16:08 UTC (permalink / raw)
  To: Miklos Szeredi, io-uring

On 3/29/22 7:20 AM, Miklos Szeredi wrote:
> Hi,
> 
> I'm trying to read multiple files with io_uring and getting stuck,
> because the link and drain flags don't seem to do what they are
> documented to do.
> 
> Kernel is v5.17 and liburing is compiled from the git tree at
> 7a3a27b6a384 ("add tests for nonblocking accept sockets").
> 
> Without those flags the attached example works some of the time, but
> that's probably accidental since ordering is not ensured.
> 
> Adding the drain or link flags make it even worse (fail in casese that
> the unordered one didn't).
> 
> What am I missing?

I don't think you're missing anything, it looks like a bug. What you
want here is:

prep_open_direct(sqe);
sqe->flags |= IOSQE_IO_LINK;
...
prep_read(sqe);

submit();

You don't want link on the read, it just depends on that previous open.
And you don't need drain.

But there's an issue with file assignment, it's done with the read is
prepped, not before it is run. Hence it will fail with EBADF currently,
which is the issue at hand.

Let me write up a fix for this, would be great if you could test.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-29 16:08 ` Jens Axboe
@ 2022-03-29 17:04   ` Jens Axboe
  2022-03-29 18:21     ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-03-29 17:04 UTC (permalink / raw)
  To: Miklos Szeredi, io-uring

On 3/29/22 10:08 AM, Jens Axboe wrote:
> On 3/29/22 7:20 AM, Miklos Szeredi wrote:
>> Hi,
>>
>> I'm trying to read multiple files with io_uring and getting stuck,
>> because the link and drain flags don't seem to do what they are
>> documented to do.
>>
>> Kernel is v5.17 and liburing is compiled from the git tree at
>> 7a3a27b6a384 ("add tests for nonblocking accept sockets").
>>
>> Without those flags the attached example works some of the time, but
>> that's probably accidental since ordering is not ensured.
>>
>> Adding the drain or link flags make it even worse (fail in casese that
>> the unordered one didn't).
>>
>> What am I missing?
> 
> I don't think you're missing anything, it looks like a bug. What you
> want here is:
> 
> prep_open_direct(sqe);
> sqe->flags |= IOSQE_IO_LINK;
> ...
> prep_read(sqe);
> 
> submit();
> 
> You don't want link on the read, it just depends on that previous open.
> And you don't need drain.
> 
> But there's an issue with file assignment, it's done with the read is
> prepped, not before it is run. Hence it will fail with EBADF currently,
> which is the issue at hand.
> 
> Let me write up a fix for this, would be great if you could test.

Can you try and pull:

git://git.kernel.dk/linux-block for-5.18/io_uring

into v5.17 and see if that works for you? It will merge cleanly, no
rejects.

Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-29 17:04   ` Jens Axboe
@ 2022-03-29 18:21     ` Miklos Szeredi
  2022-03-29 18:26       ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-03-29 18:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Tue, 29 Mar 2022 at 19:04, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/29/22 10:08 AM, Jens Axboe wrote:
> > On 3/29/22 7:20 AM, Miklos Szeredi wrote:
> >> Hi,
> >>
> >> I'm trying to read multiple files with io_uring and getting stuck,
> >> because the link and drain flags don't seem to do what they are
> >> documented to do.
> >>
> >> Kernel is v5.17 and liburing is compiled from the git tree at
> >> 7a3a27b6a384 ("add tests for nonblocking accept sockets").
> >>
> >> Without those flags the attached example works some of the time, but
> >> that's probably accidental since ordering is not ensured.
> >>
> >> Adding the drain or link flags make it even worse (fail in casese that
> >> the unordered one didn't).
> >>
> >> What am I missing?
> >
> > I don't think you're missing anything, it looks like a bug. What you
> > want here is:
> >
> > prep_open_direct(sqe);
> > sqe->flags |= IOSQE_IO_LINK;
> > ...
> > prep_read(sqe);

So with the below merge this works.   But if instead I do

prep_open_direct(sqe);
 ...
prep_read(sqe);
sqe->flags |= IOSQE_IO_DRAIN;

than it doesn't.  Shouldn't drain have a stronger ordering guarantee than link?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-29 18:21     ` Miklos Szeredi
@ 2022-03-29 18:26       ` Jens Axboe
  2022-03-29 18:31         ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-03-29 18:26 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 3/29/22 12:21 PM, Miklos Szeredi wrote:
> On Tue, 29 Mar 2022 at 19:04, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 3/29/22 10:08 AM, Jens Axboe wrote:
>>> On 3/29/22 7:20 AM, Miklos Szeredi wrote:
>>>> Hi,
>>>>
>>>> I'm trying to read multiple files with io_uring and getting stuck,
>>>> because the link and drain flags don't seem to do what they are
>>>> documented to do.
>>>>
>>>> Kernel is v5.17 and liburing is compiled from the git tree at
>>>> 7a3a27b6a384 ("add tests for nonblocking accept sockets").
>>>>
>>>> Without those flags the attached example works some of the time, but
>>>> that's probably accidental since ordering is not ensured.
>>>>
>>>> Adding the drain or link flags make it even worse (fail in casese that
>>>> the unordered one didn't).
>>>>
>>>> What am I missing?
>>>
>>> I don't think you're missing anything, it looks like a bug. What you
>>> want here is:
>>>
>>> prep_open_direct(sqe);
>>> sqe->flags |= IOSQE_IO_LINK;
>>> ...
>>> prep_read(sqe);
> 
> So with the below merge this works.   But if instead I do
> 
> prep_open_direct(sqe);
>  ...
> prep_read(sqe);
> sqe->flags |= IOSQE_IO_DRAIN;
> 
> than it doesn't.  Shouldn't drain have a stronger ordering guarantee than link?

I didn't test that, but I bet it's running into the same kind of issue
wrt prep. Are you getting -EBADF? The drain will indeed ensure that
_execution_ doesn't start until the previous requests have completed,
but it's still prepared before.

For your use case, IO_LINK is what you want and that must work.

I'll check the drain case just in case, it may in fact work if you just
edit the code base you're running now and remove these two lines from
io_init_req():

if (unlikely(!req->file)) {
-        if (!ctx->submit_state.link.head)
-                return -EBADF;
        req->result = fd;
        req->flags |= REQ_F_DEFERRED_FILE;
}

to not make it dependent on link.head. Probably not a bad idea in
general, as the rest of the handlers have been audited for req->file
usage in prep.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-29 18:26       ` Jens Axboe
@ 2022-03-29 18:31         ` Miklos Szeredi
  2022-03-29 18:40           ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-03-29 18:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Tue, 29 Mar 2022 at 20:26, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/29/22 12:21 PM, Miklos Szeredi wrote:
> > On Tue, 29 Mar 2022 at 19:04, Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 3/29/22 10:08 AM, Jens Axboe wrote:
> >>> On 3/29/22 7:20 AM, Miklos Szeredi wrote:
> >>>> Hi,
> >>>>
> >>>> I'm trying to read multiple files with io_uring and getting stuck,
> >>>> because the link and drain flags don't seem to do what they are
> >>>> documented to do.
> >>>>
> >>>> Kernel is v5.17 and liburing is compiled from the git tree at
> >>>> 7a3a27b6a384 ("add tests for nonblocking accept sockets").
> >>>>
> >>>> Without those flags the attached example works some of the time, but
> >>>> that's probably accidental since ordering is not ensured.
> >>>>
> >>>> Adding the drain or link flags make it even worse (fail in casese that
> >>>> the unordered one didn't).
> >>>>
> >>>> What am I missing?
> >>>
> >>> I don't think you're missing anything, it looks like a bug. What you
> >>> want here is:
> >>>
> >>> prep_open_direct(sqe);
> >>> sqe->flags |= IOSQE_IO_LINK;
> >>> ...
> >>> prep_read(sqe);
> >
> > So with the below merge this works.   But if instead I do
> >
> > prep_open_direct(sqe);
> >  ...
> > prep_read(sqe);
> > sqe->flags |= IOSQE_IO_DRAIN;
> >
> > than it doesn't.  Shouldn't drain have a stronger ordering guarantee than link?
>
> I didn't test that, but I bet it's running into the same kind of issue
> wrt prep. Are you getting -EBADF? The drain will indeed ensure that
> _execution_ doesn't start until the previous requests have completed,
> but it's still prepared before.
>
> For your use case, IO_LINK is what you want and that must work.
>
> I'll check the drain case just in case, it may in fact work if you just
> edit the code base you're running now and remove these two lines from
> io_init_req():
>
> if (unlikely(!req->file)) {
> -        if (!ctx->submit_state.link.head)
> -                return -EBADF;
>         req->result = fd;
>         req->flags |= REQ_F_DEFERRED_FILE;
> }
>
> to not make it dependent on link.head. Probably not a bad idea in
> general, as the rest of the handlers have been audited for req->file
> usage in prep.

Nope, that results in the following Oops:

BUG: kernel NULL pointer dereference, address: 0000000000000044
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 3 PID: 1126 Comm: readfiles Not tainted
5.17.0-00065-g3287b182c9c3-dirty #623
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-29-g6a62e0cb0dfe-prebuilt.qemu.org 04/01/2014
RIP: 0010:io_rw_init_file+0x15/0x170
Code: 00 6d 22 82 0f 95 c0 83 c0 02 c3 66 2e 0f 1f 84 00 00 00 00 00
0f 1f 44 00 00 41 55 41 54 55 53 4c 8b 2f 4c 8b 67 58 8b 6f 20 <41> 23
75 44 0f 84 28 01 00 00 48 89 fb f6 47 44 01 0f 84 08 01 00
RSP: 0018:ffffc9000108fba8 EFLAGS: 00010207
RAX: 0000000000000001 RBX: ffff888103ddd688 RCX: ffffc9000108fc18
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888103ddd600
RBP: 0000000000000000 R08: ffffc9000108fbd8 R09: 00007ffffffff000
R10: 0000000000020000 R11: 000056012e2ce2e0 R12: ffff88810276b800
R13: 0000000000000000 R14: 0000000000000000 R15: ffff888103ddd600
FS:  00007f9058d72580(0000) GS:ffff888237d80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000044 CR3: 0000000100966004 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 io_read+0x65/0x4d0
 ? select_task_rq_fair+0x602/0xf20
 ? newidle_balance.constprop.0+0x2ff/0x3a0
 io_issue_sqe+0xd86/0x21a0
 ? __schedule+0x228/0x610
 ? timerqueue_del+0x2a/0x40
 io_req_task_submit+0x26/0x100
 tctx_task_work+0x172/0x4b0
 task_work_run+0x5c/0x90
 io_cqring_wait+0x48d/0x790
 ? io_eventfd_put+0x20/0x20
 __do_sys_io_uring_enter+0x28d/0x5e0
 ? __cond_resched+0x16/0x40
 ? task_work_run+0x61/0x90
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x56012d87159c
Code: c2 41 8b 54 24 04 8b bd cc 00 00 00 41 83 ca 10 f6 85 d0 00 00
00 01 4d 8b 44 24 10 44 0f 44 d0 45 8b 4c 24 0c 44 89 f0 0f 05 <41> 89
c3 85 c0 0f 88 4a ff ff ff 41 29 04 24 bf 01 00 00 00 48 85
RSP: 002b:00007ffc8db5c550 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000056012d87159c
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007ffc8db5c620 R08: 0000000000000000 R09: 0000000000000008
R10: 0000000000000001 R11: 0000000000000246 R12: 00007ffc8db5c580
R13: 00007ffc8db5c618 R14: 00000000000001aa R15: 0000000000000000
 </TASK>
Modules linked in:
CR2: 0000000000000044
---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-29 18:31         ` Miklos Szeredi
@ 2022-03-29 18:40           ` Jens Axboe
  2022-03-29 19:30             ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-03-29 18:40 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 3/29/22 12:31 PM, Miklos Szeredi wrote:
> On Tue, 29 Mar 2022 at 20:26, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 3/29/22 12:21 PM, Miklos Szeredi wrote:
>>> On Tue, 29 Mar 2022 at 19:04, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 3/29/22 10:08 AM, Jens Axboe wrote:
>>>>> On 3/29/22 7:20 AM, Miklos Szeredi wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to read multiple files with io_uring and getting stuck,
>>>>>> because the link and drain flags don't seem to do what they are
>>>>>> documented to do.
>>>>>>
>>>>>> Kernel is v5.17 and liburing is compiled from the git tree at
>>>>>> 7a3a27b6a384 ("add tests for nonblocking accept sockets").
>>>>>>
>>>>>> Without those flags the attached example works some of the time, but
>>>>>> that's probably accidental since ordering is not ensured.
>>>>>>
>>>>>> Adding the drain or link flags make it even worse (fail in casese that
>>>>>> the unordered one didn't).
>>>>>>
>>>>>> What am I missing?
>>>>>
>>>>> I don't think you're missing anything, it looks like a bug. What you
>>>>> want here is:
>>>>>
>>>>> prep_open_direct(sqe);
>>>>> sqe->flags |= IOSQE_IO_LINK;
>>>>> ...
>>>>> prep_read(sqe);
>>>
>>> So with the below merge this works.   But if instead I do
>>>
>>> prep_open_direct(sqe);
>>>  ...
>>> prep_read(sqe);
>>> sqe->flags |= IOSQE_IO_DRAIN;
>>>
>>> than it doesn't.  Shouldn't drain have a stronger ordering guarantee than link?
>>
>> I didn't test that, but I bet it's running into the same kind of issue
>> wrt prep. Are you getting -EBADF? The drain will indeed ensure that
>> _execution_ doesn't start until the previous requests have completed,
>> but it's still prepared before.
>>
>> For your use case, IO_LINK is what you want and that must work.
>>
>> I'll check the drain case just in case, it may in fact work if you just
>> edit the code base you're running now and remove these two lines from
>> io_init_req():
>>
>> if (unlikely(!req->file)) {
>> -        if (!ctx->submit_state.link.head)
>> -                return -EBADF;
>>         req->result = fd;
>>         req->flags |= REQ_F_DEFERRED_FILE;
>> }
>>
>> to not make it dependent on link.head. Probably not a bad idea in
>> general, as the rest of the handlers have been audited for req->file
>> usage in prep.
> 
> Nope, that results in the following Oops:
> 
> BUG: kernel NULL pointer dereference, address: 0000000000000044
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] SMP PTI
> CPU: 3 PID: 1126 Comm: readfiles Not tainted
> 5.17.0-00065-g3287b182c9c3-dirty #623
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.15.0-29-g6a62e0cb0dfe-prebuilt.qemu.org 04/01/2014
> RIP: 0010:io_rw_init_file+0x15/0x170
> Code: 00 6d 22 82 0f 95 c0 83 c0 02 c3 66 2e 0f 1f 84 00 00 00 00 00
> 0f 1f 44 00 00 41 55 41 54 55 53 4c 8b 2f 4c 8b 67 58 8b 6f 20 <41> 23
> 75 44 0f 84 28 01 00 00 48 89 fb f6 47 44 01 0f 84 08 01 00
> RSP: 0018:ffffc9000108fba8 EFLAGS: 00010207
> RAX: 0000000000000001 RBX: ffff888103ddd688 RCX: ffffc9000108fc18
> RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888103ddd600
> RBP: 0000000000000000 R08: ffffc9000108fbd8 R09: 00007ffffffff000
> R10: 0000000000020000 R11: 000056012e2ce2e0 R12: ffff88810276b800
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff888103ddd600
> FS:  00007f9058d72580(0000) GS:ffff888237d80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000044 CR3: 0000000100966004 CR4: 0000000000370ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  io_read+0x65/0x4d0
>  ? select_task_rq_fair+0x602/0xf20
>  ? newidle_balance.constprop.0+0x2ff/0x3a0
>  io_issue_sqe+0xd86/0x21a0
>  ? __schedule+0x228/0x610
>  ? timerqueue_del+0x2a/0x40
>  io_req_task_submit+0x26/0x100
>  tctx_task_work+0x172/0x4b0
>  task_work_run+0x5c/0x90
>  io_cqring_wait+0x48d/0x790
>  ? io_eventfd_put+0x20/0x20
>  __do_sys_io_uring_enter+0x28d/0x5e0
>  ? __cond_resched+0x16/0x40
>  ? task_work_run+0x61/0x90
>  do_syscall_64+0x3b/0x90
>  entry_SYSCALL_64_after_hwframe+0x44/0xae

Ah yes that makes sense, since I only worried the prep file part up for
links. Forgot about that... Let me test, I'll see if it's feasible to do
for drain and send you an incremental.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-29 18:40           ` Jens Axboe
@ 2022-03-29 19:30             ` Miklos Szeredi
  2022-03-29 20:03               ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-03-29 19:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Tue, 29 Mar 2022 at 20:40, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/29/22 12:31 PM, Miklos Szeredi wrote:
> > On Tue, 29 Mar 2022 at 20:26, Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 3/29/22 12:21 PM, Miklos Szeredi wrote:
> >>> On Tue, 29 Mar 2022 at 19:04, Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> On 3/29/22 10:08 AM, Jens Axboe wrote:
> >>>>> On 3/29/22 7:20 AM, Miklos Szeredi wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I'm trying to read multiple files with io_uring and getting stuck,
> >>>>>> because the link and drain flags don't seem to do what they are
> >>>>>> documented to do.
> >>>>>>
> >>>>>> Kernel is v5.17 and liburing is compiled from the git tree at
> >>>>>> 7a3a27b6a384 ("add tests for nonblocking accept sockets").
> >>>>>>
> >>>>>> Without those flags the attached example works some of the time, but
> >>>>>> that's probably accidental since ordering is not ensured.
> >>>>>>
> >>>>>> Adding the drain or link flags make it even worse (fail in casese that
> >>>>>> the unordered one didn't).
> >>>>>>
> >>>>>> What am I missing?
> >>>>>
> >>>>> I don't think you're missing anything, it looks like a bug. What you
> >>>>> want here is:
> >>>>>
> >>>>> prep_open_direct(sqe);
> >>>>> sqe->flags |= IOSQE_IO_LINK;
> >>>>> ...
> >>>>> prep_read(sqe);
> >>>
> >>> So with the below merge this works.   But if instead I do
> >>>
> >>> prep_open_direct(sqe);
> >>>  ...
> >>> prep_read(sqe);
> >>> sqe->flags |= IOSQE_IO_DRAIN;

And this doesn't work either:

prep_open_direct(sqe);
sqe->flags |= IOSQE_IO_LINK;
...
prep_read(sqe);
sqe->flags |= IOSQE_IO_LINK;
...
prep_open_direct(sqe);
sqe->flags |= IOSQE_IO_LINK;
...
prep_read(sqe);

Yeah, the link is not needed for the read (unless the fixed file slot
is to be reused), but link/drain should work as general ordering
instructions, not just in special cases.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-29 19:30             ` Miklos Szeredi
@ 2022-03-29 20:03               ` Jens Axboe
  2022-03-30  8:18                 ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-03-29 20:03 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 3/29/22 1:30 PM, Miklos Szeredi wrote:
> On Tue, 29 Mar 2022 at 20:40, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 3/29/22 12:31 PM, Miklos Szeredi wrote:
>>> On Tue, 29 Mar 2022 at 20:26, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 3/29/22 12:21 PM, Miklos Szeredi wrote:
>>>>> On Tue, 29 Mar 2022 at 19:04, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 3/29/22 10:08 AM, Jens Axboe wrote:
>>>>>>> On 3/29/22 7:20 AM, Miklos Szeredi wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm trying to read multiple files with io_uring and getting stuck,
>>>>>>>> because the link and drain flags don't seem to do what they are
>>>>>>>> documented to do.
>>>>>>>>
>>>>>>>> Kernel is v5.17 and liburing is compiled from the git tree at
>>>>>>>> 7a3a27b6a384 ("add tests for nonblocking accept sockets").
>>>>>>>>
>>>>>>>> Without those flags the attached example works some of the time, but
>>>>>>>> that's probably accidental since ordering is not ensured.
>>>>>>>>
>>>>>>>> Adding the drain or link flags make it even worse (fail in casese that
>>>>>>>> the unordered one didn't).
>>>>>>>>
>>>>>>>> What am I missing?
>>>>>>>
>>>>>>> I don't think you're missing anything, it looks like a bug. What you
>>>>>>> want here is:
>>>>>>>
>>>>>>> prep_open_direct(sqe);
>>>>>>> sqe->flags |= IOSQE_IO_LINK;
>>>>>>> ...
>>>>>>> prep_read(sqe);
>>>>>
>>>>> So with the below merge this works.   But if instead I do
>>>>>
>>>>> prep_open_direct(sqe);
>>>>>  ...
>>>>> prep_read(sqe);
>>>>> sqe->flags |= IOSQE_IO_DRAIN;
> 
> And this doesn't work either:
> 
> prep_open_direct(sqe);
> sqe->flags |= IOSQE_IO_LINK;
> ...
> prep_read(sqe);
> sqe->flags |= IOSQE_IO_LINK;
> ...
> prep_open_direct(sqe);
> sqe->flags |= IOSQE_IO_LINK;
> ...
> prep_read(sqe);
> 
> Yeah, the link is not needed for the read (unless the fixed file slot
> is to be reused), but link/drain should work as general ordering
> instructions, not just in special cases.

Not disagreeing with that, it had nothing to do with any assumptions
like that, it was just a bug in the change I made in terms of when
things got punted async.

Can you try and repull the branch? I rebased the top two, so reset to
v5.17 first and then pull it.

BTW, I would not recommend using DRAIN, it's very expensive compared to
just a link. Particularly, using just a single link between the
open+read makes sense, and should be efficient. Don't link between all
of them, that's creating a more expensive dependency that doesn't make
any sense for your use case either. Should they both work? Of course,
and I think they will in the current branch. Merely stating that they
make no sense other than as an exercise-the-logic test case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-29 20:03               ` Jens Axboe
@ 2022-03-30  8:18                 ` Miklos Szeredi
  2022-03-30 12:35                   ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-03-30  8:18 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Tue, 29 Mar 2022 at 22:03, Jens Axboe <axboe@kernel.dk> wrote:

> Can you try and repull the branch? I rebased the top two, so reset to
> v5.17 first and then pull it.

This one works in all cases, except if there's a link flag on the
read.  In that case the read itself will succeed but following
requests on the link chain will fail with -ECANCELED.

> BTW, I would not recommend using DRAIN, it's very expensive compared to
> just a link. Particularly, using just a single link between the
> open+read makes sense, and should be efficient. Don't link between all
> of them, that's creating a more expensive dependency that doesn't make
> any sense for your use case either. Should they both work? Of course,
> and I think they will in the current branch. Merely stating that they
> make no sense other than as an exercise-the-logic test case.

Understood.   Once all of these cases work, it should be possible to
compare the performance of parallel vs. serial execution.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-30  8:18                 ` Miklos Szeredi
@ 2022-03-30 12:35                   ` Jens Axboe
  2022-03-30 12:43                     ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-03-30 12:35 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 3/30/22 2:18 AM, Miklos Szeredi wrote:
> On Tue, 29 Mar 2022 at 22:03, Jens Axboe <axboe@kernel.dk> wrote:
> 
>> Can you try and repull the branch? I rebased the top two, so reset to
>> v5.17 first and then pull it.
> 
> This one works in all cases, except if there's a link flag on the
> read.  In that case the read itself will succeed but following
> requests on the link chain will fail with -ECANCELED.

That sounds like you're asking for more data than what is being read,
in which case the link should correctly break and cancel the rest.
That's expected behavior, a short read is not a successful read in that
regard.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-30 12:35                   ` Jens Axboe
@ 2022-03-30 12:43                     ` Miklos Szeredi
  2022-03-30 12:48                       ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-03-30 12:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Wed, 30 Mar 2022 at 14:35, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/30/22 2:18 AM, Miklos Szeredi wrote:
> > On Tue, 29 Mar 2022 at 22:03, Jens Axboe <axboe@kernel.dk> wrote:
> >
> >> Can you try and repull the branch? I rebased the top two, so reset to
> >> v5.17 first and then pull it.
> >
> > This one works in all cases, except if there's a link flag on the
> > read.  In that case the read itself will succeed but following
> > requests on the link chain will fail with -ECANCELED.
>
> That sounds like you're asking for more data than what is being read,
> in which case the link should correctly break and cancel the rest.
> That's expected behavior, a short read is not a successful read in that
> regard.

Sorry, I don't get it.  Does the link flag has other meaning than
"wait until this request completes before starting the next"?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-30 12:43                     ` Miklos Szeredi
@ 2022-03-30 12:48                       ` Jens Axboe
  2022-03-30 12:51                         ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-03-30 12:48 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 3/30/22 6:43 AM, Miklos Szeredi wrote:
> On Wed, 30 Mar 2022 at 14:35, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 3/30/22 2:18 AM, Miklos Szeredi wrote:
>>> On Tue, 29 Mar 2022 at 22:03, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>> Can you try and repull the branch? I rebased the top two, so reset to
>>>> v5.17 first and then pull it.
>>>
>>> This one works in all cases, except if there's a link flag on the
>>> read.  In that case the read itself will succeed but following
>>> requests on the link chain will fail with -ECANCELED.
>>
>> That sounds like you're asking for more data than what is being read,
>> in which case the link should correctly break and cancel the rest.
>> That's expected behavior, a short read is not a successful read in that
>> regard.
> 
> Sorry, I don't get it.  Does the link flag has other meaning than
> "wait until this request completes before starting the next"?

IOSQE_IO_LINK means "wait until this request SUCCESSFULLY completes
before starting the next one". You can't reliably use a link from an eg
read, if you don't know how much data it read. If you don't care about
success, then use IOSQE_IO_HARDLINK. That is semantically like a link,
but it doesn't break on an unexpected result.

If you're using LINK from the read _and_ the read returns less than you
asked for, IOSQE_IO_LINK will break the chain as that is an unexpected
result.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-30 12:48                       ` Jens Axboe
@ 2022-03-30 12:51                         ` Miklos Szeredi
  2022-03-30 14:58                           ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-03-30 12:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Wed, 30 Mar 2022 at 14:49, Jens Axboe <axboe@kernel.dk> wrote:

> IOSQE_IO_LINK means "wait until this request SUCCESSFULLY completes
> before starting the next one". You can't reliably use a link from an eg
> read, if you don't know how much data it read. If you don't care about
> success, then use IOSQE_IO_HARDLINK. That is semantically like a link,
> but it doesn't break on an unexpected result.
>
> If you're using LINK from the read _and_ the read returns less than you
> asked for, IOSQE_IO_LINK will break the chain as that is an unexpected
> result.

Ah, okay.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-30 12:51                         ` Miklos Szeredi
@ 2022-03-30 14:58                           ` Miklos Szeredi
  2022-03-30 15:05                             ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-03-30 14:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

[-- Attachment #1: Type: text/plain, Size: 732 bytes --]

Next issue:  seems like file slot reuse is not working correctly.
Attached program compares reads using io_uring with plain reads of
proc files.

In the below example it is using two slots alternately but the number
of slots does not seem to matter, read is apparently always using a
stale file (the prior one to the most recent open on that slot).  See
how the sizes of the files lag by two lines:

root@kvm:~# ./procreads
procreads: /proc/1/stat: ok (313)
procreads: /proc/2/stat: ok (149)
procreads: /proc/3/stat: read size mismatch 313/150
procreads: /proc/4/stat: read size mismatch 149/154
procreads: /proc/5/stat: read size mismatch 150/161
procreads: /proc/6/stat: read size mismatch 154/171
...

Any ideas?

Thanks,
Miklos

[-- Attachment #2: procreads.c --]
[-- Type: text/x-csrc, Size: 2729 bytes --]

#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <dirent.h>
#include <unistd.h>
#include <err.h>
#include "liburing.h"

#define CHECK_NEGERR(_expr) \
	({ typeof(_expr) _ret = (_expr); if (_ret < 0) { errno = -_ret; err(1, #_expr); } _ret; })
#define CHECK_NULL(_expr) \
	({ typeof(_expr) _ret = (_expr); if (_ret == NULL) { errx(1, #_expr " returned NULL"); } _ret; })

ssize_t readfile_uring(struct io_uring *ring, int slot,
		       const char *path, char *buf, size_t size)
{
	struct io_uring_sqe *sqe;
	struct io_uring_cqe *cqe;
	int ret, i;
	unsigned int sum = 0;

	for (i = 0; path[i]; i++)
		sum += path[i];

	sqe = io_uring_get_sqe(ring);
	io_uring_prep_openat_direct(sqe, AT_FDCWD, path, O_RDONLY, 0, slot);
	sqe->flags = IOSQE_IO_LINK | IOSQE_CQE_SKIP_SUCCESS;
	sqe->user_data = 0xdead0000 + sum;

	sqe = io_uring_get_sqe(ring);
	io_uring_prep_read(sqe, slot, buf, size, 0);
	sqe->flags = IOSQE_FIXED_FILE;
	sqe->user_data = 0xfeed0000 + sum;

	ret = CHECK_NEGERR(io_uring_submit_and_wait(ring, 1));
	if (ret < 2)
		warnx("short submit count: %i", ret);

	ret = CHECK_NEGERR(io_uring_wait_cqe(ring, &cqe));

	if ((cqe->user_data & 0xffff) != sum)
		warnx("wrong sum: %x (should be %x)",(unsigned int) cqe->user_data, sum);
	if (cqe->res >= 0 && (cqe->user_data & 0xffff0000) != 0xfeed0000)
		warnx("not skipped: %x", (unsigned int) cqe->user_data);

	ret = cqe->res;
	io_uring_cqe_seen(ring, cqe);
	if (ret < 0) {
		errno = -ret;
		warn("failed to open or read %s", path);
	}

	return ret;
}

static ssize_t readfile_plain(const char *path, char *buf, size_t size)
{
	int fd;
	ssize_t ret;

	fd = open(path, O_RDONLY);
	if (fd == -1)
		return -errno;

	ret = read(fd, buf, size);
	if (ret == -1)
		return -errno;

	close(fd);

	return ret;
}

int main(void)
{
	int fds[] = { -1, -1 };
	struct io_uring ring;
	char *name, path[4096], buf1[4096], buf2[4096];
	DIR *dp;
	struct dirent *de;
	ssize_t ret1, ret2;
	int slot = 0;
	unsigned int numslots = sizeof(fds)/sizeof(fds[0]);

	CHECK_NEGERR(io_uring_queue_init(32, &ring, 0));
	CHECK_NEGERR(io_uring_register_files(&ring, fds, numslots));

	dp = CHECK_NULL(opendir("/proc"));
	while ((de = readdir(dp))) {
		name = de->d_name;
		if (name[0] > '0' && name[0] <= '9') {
			sprintf(path, "/proc/%s/stat", name);
			ret1 = readfile_uring(&ring, slot, path, buf1, sizeof(buf1));
			ret2 = readfile_plain(path, buf2, sizeof(buf2));
			if (ret1 != ret2)
				warnx("%s: read size mismatch %zi/%zi",
				      path, ret1, ret2);
			else if (ret1 > 0 && memcmp(buf1, buf2, ret1))
				warnx("%s: data mismatch", path);
			else {
				warnx("%s: ok (%zi)", path, ret1);
			}

			slot = (slot + 1) % numslots;
		}
	}
	closedir(dp);
	return 0;
}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-30 14:58                           ` Miklos Szeredi
@ 2022-03-30 15:05                             ` Jens Axboe
  2022-03-30 15:12                               ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-03-30 15:05 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 3/30/22 8:58 AM, Miklos Szeredi wrote:
> Next issue:  seems like file slot reuse is not working correctly.
> Attached program compares reads using io_uring with plain reads of
> proc files.
> 
> In the below example it is using two slots alternately but the number
> of slots does not seem to matter, read is apparently always using a
> stale file (the prior one to the most recent open on that slot).  See
> how the sizes of the files lag by two lines:
> 
> root@kvm:~# ./procreads
> procreads: /proc/1/stat: ok (313)
> procreads: /proc/2/stat: ok (149)
> procreads: /proc/3/stat: read size mismatch 313/150
> procreads: /proc/4/stat: read size mismatch 149/154
> procreads: /proc/5/stat: read size mismatch 150/161
> procreads: /proc/6/stat: read size mismatch 154/171
> ...
> 
> Any ideas?

Didn't look at your code yet, but with the current tree, this is the
behavior when a fixed file is used:

At prep time, if the slot is valid it is used. If it isn't valid,
assignment is deferred until the request is issued.

Which granted is a bit weird. It means that if you do:

<open fileA into slot 1, slot 1 currently unused><read slot 1>

the read will read from fileA. But for:

<open fileB into slot 1, slot 1 is fileA currently><read slot 1>

since slot 1 is already valid at prep time for the read, the read will
be from fileA again.

Is this what you are seeing? It's definitely a bit confusing, and the
only reason why I didn't change it is because it could potentially break
applications. Don't think there's a high risk of that, however, so may
indeed be worth it to just bite the bullet and the assignment is
consistent (eg always done from the perspective of the previous
dependent request having completed).

Is this what you are seeing?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-30 15:05                             ` Jens Axboe
@ 2022-03-30 15:12                               ` Miklos Szeredi
  2022-03-30 15:17                                 ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-03-30 15:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Wed, 30 Mar 2022 at 17:05, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/30/22 8:58 AM, Miklos Szeredi wrote:
> > Next issue:  seems like file slot reuse is not working correctly.
> > Attached program compares reads using io_uring with plain reads of
> > proc files.
> >
> > In the below example it is using two slots alternately but the number
> > of slots does not seem to matter, read is apparently always using a
> > stale file (the prior one to the most recent open on that slot).  See
> > how the sizes of the files lag by two lines:
> >
> > root@kvm:~# ./procreads
> > procreads: /proc/1/stat: ok (313)
> > procreads: /proc/2/stat: ok (149)
> > procreads: /proc/3/stat: read size mismatch 313/150
> > procreads: /proc/4/stat: read size mismatch 149/154
> > procreads: /proc/5/stat: read size mismatch 150/161
> > procreads: /proc/6/stat: read size mismatch 154/171
> > ...
> >
> > Any ideas?
>
> Didn't look at your code yet, but with the current tree, this is the
> behavior when a fixed file is used:
>
> At prep time, if the slot is valid it is used. If it isn't valid,
> assignment is deferred until the request is issued.
>
> Which granted is a bit weird. It means that if you do:
>
> <open fileA into slot 1, slot 1 currently unused><read slot 1>
>
> the read will read from fileA. But for:
>
> <open fileB into slot 1, slot 1 is fileA currently><read slot 1>
>
> since slot 1 is already valid at prep time for the read, the read will
> be from fileA again.
>
> Is this what you are seeing? It's definitely a bit confusing, and the
> only reason why I didn't change it is because it could potentially break
> applications. Don't think there's a high risk of that, however, so may
> indeed be worth it to just bite the bullet and the assignment is
> consistent (eg always done from the perspective of the previous
> dependent request having completed).
>
> Is this what you are seeing?

Right, this explains it.   Then the only workaround would be to wait
for the open to finish before submitting the read, but that would
defeat the whole point of using io_uring for this purpose.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-30 15:12                               ` Miklos Szeredi
@ 2022-03-30 15:17                                 ` Jens Axboe
  2022-03-30 15:53                                   ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-03-30 15:17 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 3/30/22 9:12 AM, Miklos Szeredi wrote:
> On Wed, 30 Mar 2022 at 17:05, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 3/30/22 8:58 AM, Miklos Szeredi wrote:
>>> Next issue:  seems like file slot reuse is not working correctly.
>>> Attached program compares reads using io_uring with plain reads of
>>> proc files.
>>>
>>> In the below example it is using two slots alternately but the number
>>> of slots does not seem to matter, read is apparently always using a
>>> stale file (the prior one to the most recent open on that slot).  See
>>> how the sizes of the files lag by two lines:
>>>
>>> root@kvm:~# ./procreads
>>> procreads: /proc/1/stat: ok (313)
>>> procreads: /proc/2/stat: ok (149)
>>> procreads: /proc/3/stat: read size mismatch 313/150
>>> procreads: /proc/4/stat: read size mismatch 149/154
>>> procreads: /proc/5/stat: read size mismatch 150/161
>>> procreads: /proc/6/stat: read size mismatch 154/171
>>> ...
>>>
>>> Any ideas?
>>
>> Didn't look at your code yet, but with the current tree, this is the
>> behavior when a fixed file is used:
>>
>> At prep time, if the slot is valid it is used. If it isn't valid,
>> assignment is deferred until the request is issued.
>>
>> Which granted is a bit weird. It means that if you do:
>>
>> <open fileA into slot 1, slot 1 currently unused><read slot 1>
>>
>> the read will read from fileA. But for:
>>
>> <open fileB into slot 1, slot 1 is fileA currently><read slot 1>
>>
>> since slot 1 is already valid at prep time for the read, the read will
>> be from fileA again.
>>
>> Is this what you are seeing? It's definitely a bit confusing, and the
>> only reason why I didn't change it is because it could potentially break
>> applications. Don't think there's a high risk of that, however, so may
>> indeed be worth it to just bite the bullet and the assignment is
>> consistent (eg always done from the perspective of the previous
>> dependent request having completed).
>>
>> Is this what you are seeing?
> 
> Right, this explains it.   Then the only workaround would be to wait
> for the open to finish before submitting the read, but that would
> defeat the whole point of using io_uring for this purpose.

Honestly, I think we should just change it during this round, making it
consistent with the "slot is unused" use case. The old use case is more
more of a "it happened to work" vs the newer consistent behavior of "we
always assign the file when execution starts on the request".

Let me spin a patch, would be great if you could test.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-30 15:17                                 ` Jens Axboe
@ 2022-03-30 15:53                                   ` Jens Axboe
  2022-03-30 17:49                                     ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-03-30 15:53 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 3/30/22 9:17 AM, Jens Axboe wrote:
> On 3/30/22 9:12 AM, Miklos Szeredi wrote:
>> On Wed, 30 Mar 2022 at 17:05, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 3/30/22 8:58 AM, Miklos Szeredi wrote:
>>>> Next issue:  seems like file slot reuse is not working correctly.
>>>> Attached program compares reads using io_uring with plain reads of
>>>> proc files.
>>>>
>>>> In the below example it is using two slots alternately but the number
>>>> of slots does not seem to matter, read is apparently always using a
>>>> stale file (the prior one to the most recent open on that slot).  See
>>>> how the sizes of the files lag by two lines:
>>>>
>>>> root@kvm:~# ./procreads
>>>> procreads: /proc/1/stat: ok (313)
>>>> procreads: /proc/2/stat: ok (149)
>>>> procreads: /proc/3/stat: read size mismatch 313/150
>>>> procreads: /proc/4/stat: read size mismatch 149/154
>>>> procreads: /proc/5/stat: read size mismatch 150/161
>>>> procreads: /proc/6/stat: read size mismatch 154/171
>>>> ...
>>>>
>>>> Any ideas?
>>>
>>> Didn't look at your code yet, but with the current tree, this is the
>>> behavior when a fixed file is used:
>>>
>>> At prep time, if the slot is valid it is used. If it isn't valid,
>>> assignment is deferred until the request is issued.
>>>
>>> Which granted is a bit weird. It means that if you do:
>>>
>>> <open fileA into slot 1, slot 1 currently unused><read slot 1>
>>>
>>> the read will read from fileA. But for:
>>>
>>> <open fileB into slot 1, slot 1 is fileA currently><read slot 1>
>>>
>>> since slot 1 is already valid at prep time for the read, the read will
>>> be from fileA again.
>>>
>>> Is this what you are seeing? It's definitely a bit confusing, and the
>>> only reason why I didn't change it is because it could potentially break
>>> applications. Don't think there's a high risk of that, however, so may
>>> indeed be worth it to just bite the bullet and the assignment is
>>> consistent (eg always done from the perspective of the previous
>>> dependent request having completed).
>>>
>>> Is this what you are seeing?
>>
>> Right, this explains it.   Then the only workaround would be to wait
>> for the open to finish before submitting the read, but that would
>> defeat the whole point of using io_uring for this purpose.
> 
> Honestly, I think we should just change it during this round, making it
> consistent with the "slot is unused" use case. The old use case is more
> more of a "it happened to work" vs the newer consistent behavior of "we
> always assign the file when execution starts on the request".
> 
> Let me spin a patch, would be great if you could test.

Something like this on top of the current tree should work. Can you
test?

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 4d4ca8e115f6..d36475cefb8c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -785,7 +785,6 @@ enum {
 	REQ_F_SINGLE_POLL_BIT,
 	REQ_F_DOUBLE_POLL_BIT,
 	REQ_F_PARTIAL_IO_BIT,
-	REQ_F_DEFERRED_FILE_BIT,
 	/* keep async read/write and isreg together and in order */
 	REQ_F_SUPPORT_NOWAIT_BIT,
 	REQ_F_ISREG_BIT,
@@ -850,8 +849,6 @@ enum {
 	REQ_F_DOUBLE_POLL	= BIT(REQ_F_DOUBLE_POLL_BIT),
 	/* request has already done partial IO */
 	REQ_F_PARTIAL_IO	= BIT(REQ_F_PARTIAL_IO_BIT),
-	/* request has file assignment deferred */
-	REQ_F_DEFERRED_FILE	= BIT(REQ_F_DEFERRED_FILE_BIT),
 };
 
 struct async_poll {
@@ -918,7 +915,11 @@ struct io_kiocb {
 	unsigned int			flags;
 
 	u64				user_data;
-	u32				result;
+	/* fd before execution, if valid, result after execution */
+	union {
+		u32			result;
+		s32			fd;
+	};
 	u32				cflags;
 
 	struct io_ring_ctx		*ctx;
@@ -1767,20 +1768,6 @@ static void io_kill_timeout(struct io_kiocb *req, int status)
 	}
 }
 
-static void io_assign_file(struct io_kiocb *req)
-{
-	if (req->file || !io_op_defs[req->opcode].needs_file)
-		return;
-	if (req->flags & REQ_F_DEFERRED_FILE) {
-		req->flags &= ~REQ_F_DEFERRED_FILE;
-		req->file = io_file_get(req->ctx, req, req->result,
-					req->flags & REQ_F_FIXED_FILE);
-		req->result = 0;
-	}
-	if (unlikely(!req->file))
-		req_set_fail(req);
-}
-
 static __cold void io_queue_deferred(struct io_ring_ctx *ctx)
 {
 	while (!list_empty(&ctx->defer_list)) {
@@ -1790,7 +1777,6 @@ static __cold void io_queue_deferred(struct io_ring_ctx *ctx)
 		if (req_need_defer(de->req, de->seq))
 			break;
 		list_del_init(&de->list);
-		io_assign_file(de->req);
 		io_req_task_queue(de->req);
 		kfree(de);
 	}
@@ -2131,7 +2117,6 @@ static void __io_req_complete_post(struct io_kiocb *req, s32 res,
 			if (req->flags & IO_DISARM_MASK)
 				io_disarm_next(req);
 			if (req->link) {
-				io_assign_file(req->link);
 				io_req_task_queue(req->link);
 				req->link = NULL;
 			}
@@ -2443,11 +2428,7 @@ static inline struct io_kiocb *io_req_find_next(struct io_kiocb *req)
 		__io_req_find_next_prep(req);
 	nxt = req->link;
 	req->link = NULL;
-	if (nxt) {
-		io_assign_file(nxt);
-		return nxt;
-	}
-	return NULL;
+	return nxt;
 }
 
 static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
@@ -2650,10 +2631,6 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
 
 static void io_req_task_queue(struct io_kiocb *req)
 {
-	if (unlikely(req->flags & REQ_F_FAIL)) {
-		io_req_task_queue_fail(req, -ECANCELED);
-		return;
-	}
 	req->io_task_work.func = io_req_task_submit;
 	io_req_task_work_add(req, false);
 }
@@ -2668,10 +2645,8 @@ static inline void io_queue_next(struct io_kiocb *req)
 {
 	struct io_kiocb *nxt = io_req_find_next(req);
 
-	if (nxt) {
-		io_assign_file(req);
+	if (nxt)
 		io_req_task_queue(nxt);
-	}
 }
 
 static void io_free_req(struct io_kiocb *req)
@@ -4545,9 +4520,6 @@ static int io_fsync_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
-	if (!req->file)
-		return -EBADF;
-
 	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
 	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index ||
@@ -7150,7 +7122,6 @@ static __cold void io_drain_req(struct io_kiocb *req)
 		spin_unlock(&ctx->completion_lock);
 queue:
 		ctx->drain_active = false;
-		io_assign_file(req);
 		io_req_task_queue(req);
 		return;
 	}
@@ -7259,6 +7230,23 @@ static void io_clean_op(struct io_kiocb *req)
 	req->flags &= ~IO_REQ_CLEAN_FLAGS;
 }
 
+static bool io_assign_file(struct io_kiocb *req)
+{
+	if (req->file || !io_op_defs[req->opcode].needs_file)
+		return true;
+
+	req->file = io_file_get(req->ctx, req, req->fd,
+					req->flags & REQ_F_FIXED_FILE);
+	if (unlikely(!req->file)) {
+		req_set_fail(req);
+		req->result = -EBADF;
+		return false;
+	}
+
+	req->result = 0;
+	return true;
+}
+
 static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 {
 	const struct cred *creds = NULL;
@@ -7269,6 +7257,8 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 
 	if (!io_op_defs[req->opcode].audit_skip)
 		audit_uring_entry(req->opcode);
+	if (unlikely(!io_assign_file(req)))
+		return -EBADF;
 
 	switch (req->opcode) {
 	case IORING_OP_NOP:
@@ -7429,8 +7419,7 @@ static void io_wq_submit_work(struct io_wq_work *work)
 	if (timeout)
 		io_queue_linked_timeout(timeout);
 
-	io_assign_file(req);
-	if (unlikely(!req->file && def->needs_file)) {
+	if (!io_assign_file(req)) {
 		work->flags |= IO_WQ_WORK_CANCEL;
 		err = -EBADF;
 	}
@@ -7732,12 +7721,12 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	int personality;
 	u8 opcode;
 
+	req->file = NULL;
 	/* req is partially pre-initialised, see io_preinit_req() */
 	req->opcode = opcode = READ_ONCE(sqe->opcode);
 	/* same numerical values with corresponding REQ_F_*, safe to copy */
 	req->flags = sqe_flags = READ_ONCE(sqe->flags);
 	req->user_data = READ_ONCE(sqe->user_data);
-	req->file = NULL;
 	req->fixed_rsrc_refs = NULL;
 	req->task = current;
 
@@ -7776,7 +7765,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 
 	if (io_op_defs[opcode].needs_file) {
 		struct io_submit_state *state = &ctx->submit_state;
-		int fd = READ_ONCE(sqe->fd);
+
+		req->fd = READ_ONCE(sqe->fd);
 
 		/*
 		 * Plug now if we have more than 2 IO left after this, and the
@@ -7787,17 +7777,6 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			state->need_plug = false;
 			blk_start_plug_nr_ios(&state->plug, state->submit_nr);
 		}
-
-		req->file = io_file_get(ctx, req, fd,
-					(sqe_flags & IOSQE_FIXED_FILE));
-		if (unlikely(!req->file)) {
-			/* unless being deferred, error is final */
-			if (!(ctx->submit_state.link.head ||
-			     (sqe_flags & IOSQE_IO_DRAIN)))
-				return -EBADF;
-			req->result = fd;
-			req->flags |= REQ_F_DEFERRED_FILE;
-		}
 	}
 
 	personality = READ_ONCE(sqe->personality);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-30 15:53                                   ` Jens Axboe
@ 2022-03-30 17:49                                     ` Jens Axboe
  2022-04-01  8:40                                       ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-03-30 17:49 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 3/30/22 9:53 AM, Jens Axboe wrote:
> On 3/30/22 9:17 AM, Jens Axboe wrote:
>> On 3/30/22 9:12 AM, Miklos Szeredi wrote:
>>> On Wed, 30 Mar 2022 at 17:05, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 3/30/22 8:58 AM, Miklos Szeredi wrote:
>>>>> Next issue:  seems like file slot reuse is not working correctly.
>>>>> Attached program compares reads using io_uring with plain reads of
>>>>> proc files.
>>>>>
>>>>> In the below example it is using two slots alternately but the number
>>>>> of slots does not seem to matter, read is apparently always using a
>>>>> stale file (the prior one to the most recent open on that slot).  See
>>>>> how the sizes of the files lag by two lines:
>>>>>
>>>>> root@kvm:~# ./procreads
>>>>> procreads: /proc/1/stat: ok (313)
>>>>> procreads: /proc/2/stat: ok (149)
>>>>> procreads: /proc/3/stat: read size mismatch 313/150
>>>>> procreads: /proc/4/stat: read size mismatch 149/154
>>>>> procreads: /proc/5/stat: read size mismatch 150/161
>>>>> procreads: /proc/6/stat: read size mismatch 154/171
>>>>> ...
>>>>>
>>>>> Any ideas?
>>>>
>>>> Didn't look at your code yet, but with the current tree, this is the
>>>> behavior when a fixed file is used:
>>>>
>>>> At prep time, if the slot is valid it is used. If it isn't valid,
>>>> assignment is deferred until the request is issued.
>>>>
>>>> Which granted is a bit weird. It means that if you do:
>>>>
>>>> <open fileA into slot 1, slot 1 currently unused><read slot 1>
>>>>
>>>> the read will read from fileA. But for:
>>>>
>>>> <open fileB into slot 1, slot 1 is fileA currently><read slot 1>
>>>>
>>>> since slot 1 is already valid at prep time for the read, the read will
>>>> be from fileA again.
>>>>
>>>> Is this what you are seeing? It's definitely a bit confusing, and the
>>>> only reason why I didn't change it is because it could potentially break
>>>> applications. Don't think there's a high risk of that, however, so may
>>>> indeed be worth it to just bite the bullet and the assignment is
>>>> consistent (eg always done from the perspective of the previous
>>>> dependent request having completed).
>>>>
>>>> Is this what you are seeing?
>>>
>>> Right, this explains it.   Then the only workaround would be to wait
>>> for the open to finish before submitting the read, but that would
>>> defeat the whole point of using io_uring for this purpose.
>>
>> Honestly, I think we should just change it during this round, making it
>> consistent with the "slot is unused" use case. The old use case is more
>> more of a "it happened to work" vs the newer consistent behavior of "we
>> always assign the file when execution starts on the request".
>>
>> Let me spin a patch, would be great if you could test.
> 
> Something like this on top of the current tree should work. Can you
> test?

You can also just re-pull for-5.18/io_uring, it has been updated. A last
minute edit make a 0 return from io_assign_file() which should've been
'true'...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-03-30 17:49                                     ` Jens Axboe
@ 2022-04-01  8:40                                       ` Miklos Szeredi
  2022-04-01 15:36                                         ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-04-01  8:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Wed, 30 Mar 2022 at 19:49, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/30/22 9:53 AM, Jens Axboe wrote:
> > On 3/30/22 9:17 AM, Jens Axboe wrote:
> >> On 3/30/22 9:12 AM, Miklos Szeredi wrote:
> >>> On Wed, 30 Mar 2022 at 17:05, Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> On 3/30/22 8:58 AM, Miklos Szeredi wrote:
> >>>>> Next issue:  seems like file slot reuse is not working correctly.
> >>>>> Attached program compares reads using io_uring with plain reads of
> >>>>> proc files.
> >>>>>
> >>>>> In the below example it is using two slots alternately but the number
> >>>>> of slots does not seem to matter, read is apparently always using a
> >>>>> stale file (the prior one to the most recent open on that slot).  See
> >>>>> how the sizes of the files lag by two lines:
> >>>>>
> >>>>> root@kvm:~# ./procreads
> >>>>> procreads: /proc/1/stat: ok (313)
> >>>>> procreads: /proc/2/stat: ok (149)
> >>>>> procreads: /proc/3/stat: read size mismatch 313/150
> >>>>> procreads: /proc/4/stat: read size mismatch 149/154
> >>>>> procreads: /proc/5/stat: read size mismatch 150/161
> >>>>> procreads: /proc/6/stat: read size mismatch 154/171
> >>>>> ...
> >>>>>
> >>>>> Any ideas?
> >>>>
> >>>> Didn't look at your code yet, but with the current tree, this is the
> >>>> behavior when a fixed file is used:
> >>>>
> >>>> At prep time, if the slot is valid it is used. If it isn't valid,
> >>>> assignment is deferred until the request is issued.
> >>>>
> >>>> Which granted is a bit weird. It means that if you do:
> >>>>
> >>>> <open fileA into slot 1, slot 1 currently unused><read slot 1>
> >>>>
> >>>> the read will read from fileA. But for:
> >>>>
> >>>> <open fileB into slot 1, slot 1 is fileA currently><read slot 1>
> >>>>
> >>>> since slot 1 is already valid at prep time for the read, the read will
> >>>> be from fileA again.
> >>>>
> >>>> Is this what you are seeing? It's definitely a bit confusing, and the
> >>>> only reason why I didn't change it is because it could potentially break
> >>>> applications. Don't think there's a high risk of that, however, so may
> >>>> indeed be worth it to just bite the bullet and the assignment is
> >>>> consistent (eg always done from the perspective of the previous
> >>>> dependent request having completed).
> >>>>
> >>>> Is this what you are seeing?
> >>>
> >>> Right, this explains it.   Then the only workaround would be to wait
> >>> for the open to finish before submitting the read, but that would
> >>> defeat the whole point of using io_uring for this purpose.
> >>
> >> Honestly, I think we should just change it during this round, making it
> >> consistent with the "slot is unused" use case. The old use case is more
> >> more of a "it happened to work" vs the newer consistent behavior of "we
> >> always assign the file when execution starts on the request".
> >>
> >> Let me spin a patch, would be great if you could test.
> >
> > Something like this on top of the current tree should work. Can you
> > test?
>
> You can also just re-pull for-5.18/io_uring, it has been updated. A last
> minute edit make a 0 return from io_assign_file() which should've been
> 'true'...

Yep, this works now.

Next issue:  will get ENFILE even though there are just 40 slots.
When running as root, then it will get as far as invoking the OOM
killer, which is really bad.

There's no leak, this apparently only happens when the worker doing
the fputs can't keep up.  Simple solution:  do the fput() of the
previous file synchronously with the open_direct operation; fput
shouldn't be expensive...  Is there a reason why this wouldn't work?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-04-01  8:40                                       ` Miklos Szeredi
@ 2022-04-01 15:36                                         ` Jens Axboe
  2022-04-01 16:02                                           ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-04-01 15:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 4/1/22 2:40 AM, Miklos Szeredi wrote:
> On Wed, 30 Mar 2022 at 19:49, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 3/30/22 9:53 AM, Jens Axboe wrote:
>>> On 3/30/22 9:17 AM, Jens Axboe wrote:
>>>> On 3/30/22 9:12 AM, Miklos Szeredi wrote:
>>>>> On Wed, 30 Mar 2022 at 17:05, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 3/30/22 8:58 AM, Miklos Szeredi wrote:
>>>>>>> Next issue:  seems like file slot reuse is not working correctly.
>>>>>>> Attached program compares reads using io_uring with plain reads of
>>>>>>> proc files.
>>>>>>>
>>>>>>> In the below example it is using two slots alternately but the number
>>>>>>> of slots does not seem to matter, read is apparently always using a
>>>>>>> stale file (the prior one to the most recent open on that slot).  See
>>>>>>> how the sizes of the files lag by two lines:
>>>>>>>
>>>>>>> root@kvm:~# ./procreads
>>>>>>> procreads: /proc/1/stat: ok (313)
>>>>>>> procreads: /proc/2/stat: ok (149)
>>>>>>> procreads: /proc/3/stat: read size mismatch 313/150
>>>>>>> procreads: /proc/4/stat: read size mismatch 149/154
>>>>>>> procreads: /proc/5/stat: read size mismatch 150/161
>>>>>>> procreads: /proc/6/stat: read size mismatch 154/171
>>>>>>> ...
>>>>>>>
>>>>>>> Any ideas?
>>>>>>
>>>>>> Didn't look at your code yet, but with the current tree, this is the
>>>>>> behavior when a fixed file is used:
>>>>>>
>>>>>> At prep time, if the slot is valid it is used. If it isn't valid,
>>>>>> assignment is deferred until the request is issued.
>>>>>>
>>>>>> Which granted is a bit weird. It means that if you do:
>>>>>>
>>>>>> <open fileA into slot 1, slot 1 currently unused><read slot 1>
>>>>>>
>>>>>> the read will read from fileA. But for:
>>>>>>
>>>>>> <open fileB into slot 1, slot 1 is fileA currently><read slot 1>
>>>>>>
>>>>>> since slot 1 is already valid at prep time for the read, the read will
>>>>>> be from fileA again.
>>>>>>
>>>>>> Is this what you are seeing? It's definitely a bit confusing, and the
>>>>>> only reason why I didn't change it is because it could potentially break
>>>>>> applications. Don't think there's a high risk of that, however, so may
>>>>>> indeed be worth it to just bite the bullet and the assignment is
>>>>>> consistent (eg always done from the perspective of the previous
>>>>>> dependent request having completed).
>>>>>>
>>>>>> Is this what you are seeing?
>>>>>
>>>>> Right, this explains it.   Then the only workaround would be to wait
>>>>> for the open to finish before submitting the read, but that would
>>>>> defeat the whole point of using io_uring for this purpose.
>>>>
>>>> Honestly, I think we should just change it during this round, making it
>>>> consistent with the "slot is unused" use case. The old use case is more
>>>> more of a "it happened to work" vs the newer consistent behavior of "we
>>>> always assign the file when execution starts on the request".
>>>>
>>>> Let me spin a patch, would be great if you could test.
>>>
>>> Something like this on top of the current tree should work. Can you
>>> test?
>>
>> You can also just re-pull for-5.18/io_uring, it has been updated. A last
>> minute edit make a 0 return from io_assign_file() which should've been
>> 'true'...
> 
> Yep, this works now.
> 
> Next issue:  will get ENFILE even though there are just 40 slots.
> When running as root, then it will get as far as invoking the OOM
> killer, which is really bad.
> 
> There's no leak, this apparently only happens when the worker doing
> the fputs can't keep up.  Simple solution:  do the fput() of the
> previous file synchronously with the open_direct operation; fput
> shouldn't be expensive...  Is there a reason why this wouldn't work?

I take it you're continually reusing those slots? If you have a test
case that'd be ideal. Agree that it sounds like we just need an
appropriate breather to allow fput/task_work to run. Or it could be the
deferral free of the fixed slot.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-04-01 15:36                                         ` Jens Axboe
@ 2022-04-01 16:02                                           ` Miklos Szeredi
  2022-04-01 16:21                                             ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-04-01 16:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

[-- Attachment #1: Type: text/plain, Size: 738 bytes --]

On Fri, 1 Apr 2022 at 17:36, Jens Axboe <axboe@kernel.dk> wrote:

> I take it you're continually reusing those slots?

Yes.

>  If you have a test
> case that'd be ideal. Agree that it sounds like we just need an
> appropriate breather to allow fput/task_work to run. Or it could be the
> deferral free of the fixed slot.

Adding a breather could make the worst case latency be large.  I think
doing the fput synchronously would be better in general.

I test this on an VM with 8G of memory and run the following:

./forkbomb 14 &
# wait till 16k processes are forked
for i in `seq 1 100`; do ./procreads u; done

You can compare performance with plain reads (./procreads p), the
other tests don't work on public kernels.

Thanks,
Miklos

[-- Attachment #2: procreads.c --]
[-- Type: text/x-csrc, Size: 6663 bytes --]

#define _GNU_SOURCE

#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <dirent.h>
#include <unistd.h>
#include <err.h>
#include "liburing.h"

#define CHECK_NEGERR(_expr) \
	({ typeof(_expr) _ret = (_expr); if (_ret < 0) { errno = -_ret; err(1, #_expr); } _ret; })
#define CHECK_NULL(_expr) \
	({ typeof(_expr) _ret = (_expr); if (_ret == NULL) { errx(1, #_expr " returned NULL"); } _ret; })
#define CHECK_ERR(_expr) \
	({ typeof(_expr) _ret = (_expr); if (_ret == -1) { err(1, #_expr); } _ret; })


struct name_val {
	char *name;		/* in */
	struct iovec value_in;	/* in */
	struct iovec value_out;	/* out */
	uint32_t error;		/* out */
	uint32_t reserved;
};

static bool debug;
static const char *proc_list[] = { "stat", "status", "cmdline", "cgroup" };
#define proc_num (sizeof(proc_list)/sizeof(proc_list[0]))
#define batch 10

int getvalues(int dfd, const char *path, struct name_val *vec, size_t num,
	      unsigned int flags)
{
	return syscall(451, dfd, path, vec, num, flags);
}

static void print_val(const char *name, struct name_val *nv)
{
	const char *s = nv->value_out.iov_base;
	size_t len = nv->value_out.iov_len;
	const size_t prmax = 40;
	int prlen = len < prmax ? len : prmax;
	const char *cont = len < prmax ? "" : "...";

	if (nv->error)
		printf("/proc/%s/%s = ERROR %s (%i)\n",
		       name, nv->name, strerror(nv->error), nv->error);
	else if (debug)
		printf("/proc/%s/%s = \"%.*s\"%s (len=%zi)\n",
		       name, nv->name, prlen, s, cont, len);
}

static void print_values(const char *name, struct name_val *vec, size_t num,
			 ssize_t ret)
{
	int i;

	if (ret < 0) {
		errno = -ret; warn("getvalues failed");
	} else {
		if ((size_t) ret < num)
			warnx("%zi values read out of %zi", ret, num);
		for (i = 0; i < ret; i++)
			print_val(name, &vec[i]);
	}
}

static ssize_t readfile_plain(int dfd, const char *path, char *buf, size_t size)
{
	int fd;
	ssize_t ret;

	fd = openat(dfd, path, O_RDONLY);
	if (fd == -1)
		return -errno;

	ret = read(fd, buf, size);
	if (ret == -1)
		ret = -errno;
	else if ((size_t) ret == size)
		ret = -EOVERFLOW;

	close(fd);

	return ret;
}

static int readfiles_plain(int dfd, const char *path, struct name_val *vec,
			   size_t num, int mode)
{
	struct name_val *nv;
	ssize_t ret;
	size_t i;

	if (path[0])
		dfd = CHECK_ERR(openat(dfd, path, O_PATH));

	for (i = 0; i < num; i++) {
		nv = &vec[i];
		if (mode) {
			CHECK_ERR(getvalues(dfd, "", nv, 1, mode == 2));
		} else {
			ret = readfile_plain(dfd, nv->name,
					     nv->value_in.iov_base,
					     nv->value_in.iov_len);
			if (ret < 0) {
				nv->error = -ret;
			} else {
				nv->error = 0;
				nv->value_out.iov_base = nv->value_in.iov_base;
				nv->value_out.iov_len = ret;
			}
		}
	}
	if (path[0])
		close(dfd);

	return num;
}

static int readfiles_uring(struct io_uring *ring, int dfd, const char *path,
			   struct name_val *vec, size_t num)
{
	struct io_uring_sqe *sqe;
	struct io_uring_cqe *cqe;
	size_t slot;
	int ret, i;
	static int seq = 1;
	struct name_val *nv;

	if (path[0])
		dfd = CHECK_ERR(openat(dfd, path, O_PATH));

	for (slot = 0; slot < num; slot++) {
		nv = &vec[slot];
		sqe = io_uring_get_sqe(ring);
		io_uring_prep_openat_direct(sqe, dfd, nv->name, O_RDONLY, 0,
					    slot);
		sqe->flags = IOSQE_IO_LINK | IOSQE_CQE_SKIP_SUCCESS;
		sqe->user_data = seq + slot * 2;

		sqe = io_uring_get_sqe(ring);
		io_uring_prep_read(sqe, slot, nv->value_in.iov_base,
				   nv->value_in.iov_len, 0);
		sqe->flags = IOSQE_FIXED_FILE;
		sqe->user_data = seq + slot * 2 + 1;
	}

	ret = CHECK_NEGERR(io_uring_submit_and_wait(ring, num));
	ret /= 2;
	for (i = 0; i < ret; i++) {
		CHECK_NEGERR(io_uring_wait_cqe(ring, &cqe));
		slot = (cqe->user_data - seq) / 2;
		nv = &vec[slot];
		if (cqe->res < 0) {
			nv->error = -cqe->res;
		} else if ((size_t) cqe->res < nv->value_in.iov_len) {
			nv->error = 0;
			nv->value_out.iov_base = nv->value_in.iov_base;
			nv->value_out.iov_len = cqe->res;
		} else {
			nv->error = EOVERFLOW;
		}
		io_uring_cqe_seen(ring, cqe);
	}
	seq += 2 * num;
	if (path[0])
		close(dfd);

	return ret;
}

static const char *next_name(DIR *dp)
{
	const char *name;
	struct dirent *de;

	while ((de = readdir(dp))) {
		name = de->d_name;
		if (name[0] > '0' && name[0] <= '9')
			return name;
	}
	return NULL;
}

static size_t next_batch(DIR *dp, struct name_val *vec, size_t num,
			 const char **namep)
{
	const char *name;
	size_t i;

	if (batch == 1) {
		name = next_name(dp);
		if (!name)
			return 0;
		*namep = name;
		return 1;
	}

	*namep = "";
	for (i = 0; i < num; i++) {
		if (i % proc_num == 0 && (name = next_name(dp)) == NULL)
			break;
		free(vec[i].name);
		vec[i].name = CHECK_NULL(malloc(128));
		sprintf(vec[i].name, "%s/%s", name, proc_list[i % proc_num]);
	}
	return i;
}

static void test_uring(DIR *dp, struct name_val *vec, size_t num)
{
	int fds[proc_num * batch];
	const size_t numslots = sizeof(fds)/sizeof(fds[0]);
	struct io_uring ring;
	const char *name;
	ssize_t ret;

	memset(fds, -1, sizeof(fds));
	CHECK_NEGERR(io_uring_queue_init(num * 2, &ring, 0));
	CHECK_NEGERR(io_uring_register_files(&ring, fds, numslots));

	while ((num = next_batch(dp, vec, num, &name))) {
		ret = readfiles_uring(&ring, dirfd(dp), name, vec, num);
		print_values(name, vec, num, ret);
	}
	io_uring_queue_exit(&ring);
}

static void test_plain(DIR *dp, struct name_val *vec, size_t num, int mode)
{
	const char *name;
	ssize_t ret;

	while ((num = next_batch(dp, vec, num, &name))) {
		ret = readfiles_plain(dirfd(dp), name, vec, num, mode);
		print_values(name, vec, num, ret);
	}
}

static void test_values(DIR *dp, struct name_val *vec, size_t num, bool rf)
{
	const char *name;
	ssize_t ret;

	while ((num = next_batch(dp, vec, num, &name))) {
		ret = getvalues(dirfd(dp), name, vec, num, rf);
		print_values(name, vec, num, ret);
	}
}

int main(int argc, char *argv[])
{
	const size_t num = proc_num * batch;
	char buf[num][4096];
	struct name_val vec[num];
	DIR *dp;
	size_t i;
	char type = 'p';

	if (argc > 1)
		type = argv[1][0];

	if (argc > 2)
		debug = true;

	for (i = 0; i < num; i++) {
		vec[i].value_in.iov_base = (type != 'w' || !i) ? buf[i] : NULL;
		vec[i].value_in.iov_len = sizeof(buf[i]);
	}

	dp = CHECK_NULL(opendir("/proc"));
	switch (type) {
	case 'p':
		test_plain(dp, vec, num, 0);
		break;
	case 'r':
		test_plain(dp, vec, num, 1);
		break;
	case 's':
		test_plain(dp, vec, num, 2);
		break;
	case 'u':
		test_uring(dp, vec, num);
		break;
	case 'w':
		vec[0].value_in.iov_len = sizeof(buf[0]) * num;
		/* fallthrough */
	case 'v':
	case 'z':
		test_values(dp, vec, num, type == 'z');
		break;
	}
	closedir(dp);
	return 0;
}

[-- Attachment #3: forkbomb.c --]
[-- Type: text/x-csrc, Size: 511 bytes --]

#include <unistd.h>
#include <stdio.h>
#include <err.h>
#include <pthread.h>
#include <stdlib.h>

static void *run(void *)
{
	sleep(1000);
	return NULL;
}

int main(int argc, char *argv[])
{
	int pid, level, i;
	pthread_t thr;
	int maxlevel = atoi(argv[1]);

	for (level = 0; level < maxlevel; level++) {
		pid = fork();
		if (pid == -1)
			err(1, "fork");

		fprintf(stderr, ".");
		#if 0
		if (pid == 0) {
			for (i = 0; i < 4; i++)
				pthread_create(&thr, NULL, run, NULL);

		}
		#endif
	}
	sleep(1000);
}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-04-01 16:02                                           ` Miklos Szeredi
@ 2022-04-01 16:21                                             ` Jens Axboe
  2022-04-02  1:17                                               ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-04-01 16:21 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 4/1/22 10:02 AM, Miklos Szeredi wrote:
> On Fri, 1 Apr 2022 at 17:36, Jens Axboe <axboe@kernel.dk> wrote:
> 
>> I take it you're continually reusing those slots?
> 
> Yes.
> 
>>  If you have a test
>> case that'd be ideal. Agree that it sounds like we just need an
>> appropriate breather to allow fput/task_work to run. Or it could be the
>> deferral free of the fixed slot.
> 
> Adding a breather could make the worst case latency be large.  I think
> doing the fput synchronously would be better in general.

fput() isn't sync, it'll just offload to task_work. There are some
dependencies there that would need to be checked. But we'll find a way
to deal with it.

> I test this on an VM with 8G of memory and run the following:
> 
> ./forkbomb 14 &
> # wait till 16k processes are forked
> for i in `seq 1 100`; do ./procreads u; done
> 
> You can compare performance with plain reads (./procreads p), the
> other tests don't work on public kernels.

OK, I'll check up on this, but probably won't have time to do so before
early next week.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-04-01 16:21                                             ` Jens Axboe
@ 2022-04-02  1:17                                               ` Jens Axboe
  2022-04-05  7:45                                                 ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-04-02  1:17 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 4/1/22 10:21 AM, Jens Axboe wrote:
> On 4/1/22 10:02 AM, Miklos Szeredi wrote:
>> On Fri, 1 Apr 2022 at 17:36, Jens Axboe <axboe@kernel.dk> wrote:
>>
>>> I take it you're continually reusing those slots?
>>
>> Yes.
>>
>>>  If you have a test
>>> case that'd be ideal. Agree that it sounds like we just need an
>>> appropriate breather to allow fput/task_work to run. Or it could be the
>>> deferral free of the fixed slot.
>>
>> Adding a breather could make the worst case latency be large.  I think
>> doing the fput synchronously would be better in general.
> 
> fput() isn't sync, it'll just offload to task_work. There are some
> dependencies there that would need to be checked. But we'll find a way
> to deal with it.
> 
>> I test this on an VM with 8G of memory and run the following:
>>
>> ./forkbomb 14 &
>> # wait till 16k processes are forked
>> for i in `seq 1 100`; do ./procreads u; done
>>
>> You can compare performance with plain reads (./procreads p), the
>> other tests don't work on public kernels.
> 
> OK, I'll check up on this, but probably won't have time to do so before
> early next week.

Can you try with this patch? It's not complete yet, there's actually a
bunch of things we can do to improve the direct descriptor case. But
this one is easy enough to pull off, and I think it'll fix your OOM
case. Not a proposed patch, but it'll prove the theory.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0e199040f151..d52cd9c98d6d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -231,7 +231,7 @@ struct io_rsrc_put {
 	u64 tag;
 	union {
 		void *rsrc;
-		struct file *file;
+		unsigned long file_ptr;
 		struct io_mapped_ubuf *buf;
 	};
 };
@@ -1601,7 +1601,12 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq)
 
 #define FFS_NOWAIT		0x1UL
 #define FFS_ISREG		0x2UL
-#define FFS_MASK		~(FFS_NOWAIT|FFS_ISREG)
+#if defined(CONFIG_64BIT)
+#define FFS_DIRECT		0x4UL
+#else
+#define FFS_DIRECT		0x0UL
+#endif
+#define FFS_MASK		~(FFS_NOWAIT|FFS_ISREG|FFS_DIRECT)
 
 static inline bool io_req_ffs_set(struct io_kiocb *req)
 {
@@ -7443,12 +7448,19 @@ static inline struct file *io_file_from_index(struct io_ring_ctx *ctx,
 	return (struct file *) (slot->file_ptr & FFS_MASK);
 }
 
-static void io_fixed_file_set(struct io_fixed_file *file_slot, struct file *file)
+static bool io_fixed_file_set(struct io_fixed_file *file_slot, struct file *file,
+			      bool direct_descriptor)
 {
 	unsigned long file_ptr = (unsigned long) file;
+	bool ret = false;
 
 	file_ptr |= io_file_get_flags(file);
+	if (direct_descriptor) {
+		file_ptr |= FFS_DIRECT;
+		ret = true;
+	}
 	file_slot->file_ptr = file_ptr;
+	return ret;
 }
 
 static inline struct file *io_file_get_fixed(struct io_ring_ctx *ctx,
@@ -8917,7 +8929,7 @@ static int io_sqe_files_scm(struct io_ring_ctx *ctx)
 
 static void io_rsrc_file_put(struct io_ring_ctx *ctx, struct io_rsrc_put *prsrc)
 {
-	struct file *file = prsrc->file;
+	struct file *file = (struct file *) (prsrc->file_ptr & FFS_MASK);
 #if defined(CONFIG_UNIX)
 	struct sock *sock = ctx->ring_sock->sk;
 	struct sk_buff_head list, *head = &sock->sk_receive_queue;
@@ -9083,7 +9095,8 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 			fput(file);
 			goto out_fput;
 		}
-		io_fixed_file_set(io_fixed_file_slot(&ctx->file_table, i), file);
+		io_fixed_file_set(io_fixed_file_slot(&ctx->file_table, i), file,
+					false);
 	}
 
 	ret = io_sqe_files_scm(ctx);
@@ -9166,6 +9179,20 @@ static int io_queue_rsrc_removal(struct io_rsrc_data *data, unsigned idx,
 	return 0;
 }
 
+static int io_queue_file_removal(struct io_rsrc_data *data, unsigned idx,
+				 struct io_rsrc_node *node,
+				 unsigned long file_ptr)
+{
+	struct file *file = (struct file *) (file_ptr & FFS_MASK);
+
+	if (file_ptr & FFS_DIRECT) {
+		fput(file);
+		return 0;
+	}
+
+	return io_queue_rsrc_removal(data, idx, node, file);
+}
+
 static int io_install_fixed_file(struct io_kiocb *req, struct file *file,
 				 unsigned int issue_flags, u32 slot_index)
 {
@@ -9189,15 +9216,13 @@ static int io_install_fixed_file(struct io_kiocb *req, struct file *file,
 	file_slot = io_fixed_file_slot(&ctx->file_table, slot_index);
 
 	if (file_slot->file_ptr) {
-		struct file *old_file;
-
 		ret = io_rsrc_node_switch_start(ctx);
 		if (ret)
 			goto err;
 
-		old_file = (struct file *)(file_slot->file_ptr & FFS_MASK);
-		ret = io_queue_rsrc_removal(ctx->file_data, slot_index,
-					    ctx->rsrc_node, old_file);
+		ret = io_queue_file_removal(ctx->file_data, slot_index,
+					    ctx->rsrc_node,
+					    file_slot->file_ptr);
 		if (ret)
 			goto err;
 		file_slot->file_ptr = 0;
@@ -9205,13 +9230,13 @@ static int io_install_fixed_file(struct io_kiocb *req, struct file *file,
 	}
 
 	*io_get_tag_slot(ctx->file_data, slot_index) = 0;
-	io_fixed_file_set(file_slot, file);
-	ret = io_sqe_file_register(ctx, file, slot_index);
-	if (ret) {
-		file_slot->file_ptr = 0;
-		goto err;
+	if (!io_fixed_file_set(file_slot, file, true)) {
+		ret = io_sqe_file_register(ctx, file, slot_index);
+		if (ret) {
+			file_slot->file_ptr = 0;
+			goto err;
+		}
 	}
-
 	ret = 0;
 err:
 	if (needs_switch)
@@ -9228,7 +9253,6 @@ static int io_close_fixed(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_ring_ctx *ctx = req->ctx;
 	bool needs_lock = issue_flags & IO_URING_F_UNLOCKED;
 	struct io_fixed_file *file_slot;
-	struct file *file;
 	int ret, i;
 
 	io_ring_submit_lock(ctx, needs_lock);
@@ -9248,8 +9272,8 @@ static int io_close_fixed(struct io_kiocb *req, unsigned int issue_flags)
 	if (!file_slot->file_ptr)
 		goto out;
 
-	file = (struct file *)(file_slot->file_ptr & FFS_MASK);
-	ret = io_queue_rsrc_removal(ctx->file_data, offset, ctx->rsrc_node, file);
+	ret = io_queue_file_removal(ctx->file_data, offset,
+				    ctx->rsrc_node, file_slot->file_ptr);
 	if (ret)
 		goto out;
 
@@ -9298,9 +9322,9 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 		file_slot = io_fixed_file_slot(&ctx->file_table, i);
 
 		if (file_slot->file_ptr) {
-			file = (struct file *)(file_slot->file_ptr & FFS_MASK);
-			err = io_queue_rsrc_removal(data, up->offset + done,
-						    ctx->rsrc_node, file);
+			err = io_queue_file_removal(data, up->offset + done,
+						    ctx->rsrc_node,
+						    file_slot->file_ptr);
 			if (err)
 				break;
 			file_slot->file_ptr = 0;
@@ -9326,7 +9350,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 				break;
 			}
 			*io_get_tag_slot(data, up->offset + done) = tag;
-			io_fixed_file_set(file_slot, file);
+			io_fixed_file_set(file_slot, file, false);
 			err = io_sqe_file_register(ctx, file, i);
 			if (err) {
 				file_slot->file_ptr = 0;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-04-02  1:17                                               ` Jens Axboe
@ 2022-04-05  7:45                                                 ` Miklos Szeredi
  2022-04-05 14:44                                                   ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-04-05  7:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Sat, 2 Apr 2022 at 03:17, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/1/22 10:21 AM, Jens Axboe wrote:
> > On 4/1/22 10:02 AM, Miklos Szeredi wrote:
> >> On Fri, 1 Apr 2022 at 17:36, Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >>> I take it you're continually reusing those slots?
> >>
> >> Yes.
> >>
> >>>  If you have a test
> >>> case that'd be ideal. Agree that it sounds like we just need an
> >>> appropriate breather to allow fput/task_work to run. Or it could be the
> >>> deferral free of the fixed slot.
> >>
> >> Adding a breather could make the worst case latency be large.  I think
> >> doing the fput synchronously would be better in general.
> >
> > fput() isn't sync, it'll just offload to task_work. There are some
> > dependencies there that would need to be checked. But we'll find a way
> > to deal with it.
> >
> >> I test this on an VM with 8G of memory and run the following:
> >>
> >> ./forkbomb 14 &
> >> # wait till 16k processes are forked
> >> for i in `seq 1 100`; do ./procreads u; done
> >>
> >> You can compare performance with plain reads (./procreads p), the
> >> other tests don't work on public kernels.
> >
> > OK, I'll check up on this, but probably won't have time to do so before
> > early next week.
>
> Can you try with this patch? It's not complete yet, there's actually a
> bunch of things we can do to improve the direct descriptor case. But
> this one is easy enough to pull off, and I think it'll fix your OOM
> case. Not a proposed patch, but it'll prove the theory.

Sorry for the delay..

Patch works like charm.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-04-05  7:45                                                 ` Miklos Szeredi
@ 2022-04-05 14:44                                                   ` Jens Axboe
  2022-04-21 12:31                                                     ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-04-05 14:44 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 4/5/22 1:45 AM, Miklos Szeredi wrote:
> On Sat, 2 Apr 2022 at 03:17, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 4/1/22 10:21 AM, Jens Axboe wrote:
>>> On 4/1/22 10:02 AM, Miklos Szeredi wrote:
>>>> On Fri, 1 Apr 2022 at 17:36, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>>> I take it you're continually reusing those slots?
>>>>
>>>> Yes.
>>>>
>>>>>  If you have a test
>>>>> case that'd be ideal. Agree that it sounds like we just need an
>>>>> appropriate breather to allow fput/task_work to run. Or it could be the
>>>>> deferral free of the fixed slot.
>>>>
>>>> Adding a breather could make the worst case latency be large.  I think
>>>> doing the fput synchronously would be better in general.
>>>
>>> fput() isn't sync, it'll just offload to task_work. There are some
>>> dependencies there that would need to be checked. But we'll find a way
>>> to deal with it.
>>>
>>>> I test this on an VM with 8G of memory and run the following:
>>>>
>>>> ./forkbomb 14 &
>>>> # wait till 16k processes are forked
>>>> for i in `seq 1 100`; do ./procreads u; done
>>>>
>>>> You can compare performance with plain reads (./procreads p), the
>>>> other tests don't work on public kernels.
>>>
>>> OK, I'll check up on this, but probably won't have time to do so before
>>> early next week.
>>
>> Can you try with this patch? It's not complete yet, there's actually a
>> bunch of things we can do to improve the direct descriptor case. But
>> this one is easy enough to pull off, and I think it'll fix your OOM
>> case. Not a proposed patch, but it'll prove the theory.
> 
> Sorry for the delay..
> 
> Patch works like charm.

OK good, then it is the issue I suspected. Thanks for testing!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-04-05 14:44                                                   ` Jens Axboe
@ 2022-04-21 12:31                                                     ` Miklos Szeredi
  2022-04-21 12:34                                                       ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-04-21 12:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Tue, 5 Apr 2022 at 16:44, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/5/22 1:45 AM, Miklos Szeredi wrote:
> > On Sat, 2 Apr 2022 at 03:17, Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 4/1/22 10:21 AM, Jens Axboe wrote:
> >>> On 4/1/22 10:02 AM, Miklos Szeredi wrote:
> >>>> On Fri, 1 Apr 2022 at 17:36, Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>>> I take it you're continually reusing those slots?
> >>>>
> >>>> Yes.
> >>>>
> >>>>>  If you have a test
> >>>>> case that'd be ideal. Agree that it sounds like we just need an
> >>>>> appropriate breather to allow fput/task_work to run. Or it could be the
> >>>>> deferral free of the fixed slot.
> >>>>
> >>>> Adding a breather could make the worst case latency be large.  I think
> >>>> doing the fput synchronously would be better in general.
> >>>
> >>> fput() isn't sync, it'll just offload to task_work. There are some
> >>> dependencies there that would need to be checked. But we'll find a way
> >>> to deal with it.
> >>>
> >>>> I test this on an VM with 8G of memory and run the following:
> >>>>
> >>>> ./forkbomb 14 &
> >>>> # wait till 16k processes are forked
> >>>> for i in `seq 1 100`; do ./procreads u; done
> >>>>
> >>>> You can compare performance with plain reads (./procreads p), the
> >>>> other tests don't work on public kernels.
> >>>
> >>> OK, I'll check up on this, but probably won't have time to do so before
> >>> early next week.
> >>
> >> Can you try with this patch? It's not complete yet, there's actually a
> >> bunch of things we can do to improve the direct descriptor case. But
> >> this one is easy enough to pull off, and I think it'll fix your OOM
> >> case. Not a proposed patch, but it'll prove the theory.
> >
> > Sorry for the delay..
> >
> > Patch works like charm.
>
> OK good, then it is the issue I suspected. Thanks for testing!

Tested with v5.18-rc3 and performance seems significantly worse than
with the test patch:

test patch:
        avg     min     max     stdev
real    0.205   0.190   0.266   0.011
user    0.017   0.007   0.029   0.004
sys     0.374   0.336   0.503   0.022

5.18.0-rc3-00016-gb253435746d9:
        avg     min     max     stdev
real    0.725   0.200   18.090  2.279
user    0.019   0.005   0.046   0.006
sys     0.454   0.241   1.022   0.199

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-04-21 12:31                                                     ` Miklos Szeredi
@ 2022-04-21 12:34                                                       ` Jens Axboe
  2022-04-21 12:39                                                         ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-04-21 12:34 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 4/21/22 6:31 AM, Miklos Szeredi wrote:
> On Tue, 5 Apr 2022 at 16:44, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 4/5/22 1:45 AM, Miklos Szeredi wrote:
>>> On Sat, 2 Apr 2022 at 03:17, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 4/1/22 10:21 AM, Jens Axboe wrote:
>>>>> On 4/1/22 10:02 AM, Miklos Szeredi wrote:
>>>>>> On Fri, 1 Apr 2022 at 17:36, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>>> I take it you're continually reusing those slots?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>  If you have a test
>>>>>>> case that'd be ideal. Agree that it sounds like we just need an
>>>>>>> appropriate breather to allow fput/task_work to run. Or it could be the
>>>>>>> deferral free of the fixed slot.
>>>>>>
>>>>>> Adding a breather could make the worst case latency be large.  I think
>>>>>> doing the fput synchronously would be better in general.
>>>>>
>>>>> fput() isn't sync, it'll just offload to task_work. There are some
>>>>> dependencies there that would need to be checked. But we'll find a way
>>>>> to deal with it.
>>>>>
>>>>>> I test this on an VM with 8G of memory and run the following:
>>>>>>
>>>>>> ./forkbomb 14 &
>>>>>> # wait till 16k processes are forked
>>>>>> for i in `seq 1 100`; do ./procreads u; done
>>>>>>
>>>>>> You can compare performance with plain reads (./procreads p), the
>>>>>> other tests don't work on public kernels.
>>>>>
>>>>> OK, I'll check up on this, but probably won't have time to do so before
>>>>> early next week.
>>>>
>>>> Can you try with this patch? It's not complete yet, there's actually a
>>>> bunch of things we can do to improve the direct descriptor case. But
>>>> this one is easy enough to pull off, and I think it'll fix your OOM
>>>> case. Not a proposed patch, but it'll prove the theory.
>>>
>>> Sorry for the delay..
>>>
>>> Patch works like charm.
>>
>> OK good, then it is the issue I suspected. Thanks for testing!
> 
> Tested with v5.18-rc3 and performance seems significantly worse than
> with the test patch:
> 
> test patch:
>         avg     min     max     stdev
> real    0.205   0.190   0.266   0.011
> user    0.017   0.007   0.029   0.004
> sys     0.374   0.336   0.503   0.022
> 
> 5.18.0-rc3-00016-gb253435746d9:
>         avg     min     max     stdev
> real    0.725   0.200   18.090  2.279
> user    0.019   0.005   0.046   0.006
> sys     0.454   0.241   1.022   0.199

It's been a month and I don't remember details of which patches were
tested, when you say "test patch", which one exactly are you referring
to and what base was it applied on?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-04-21 12:34                                                       ` Jens Axboe
@ 2022-04-21 12:39                                                         ` Miklos Szeredi
  2022-04-21 12:41                                                           ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2022-04-21 12:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Thu, 21 Apr 2022 at 14:34, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/21/22 6:31 AM, Miklos Szeredi wrote:
> > On Tue, 5 Apr 2022 at 16:44, Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 4/5/22 1:45 AM, Miklos Szeredi wrote:
> >>> On Sat, 2 Apr 2022 at 03:17, Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> On 4/1/22 10:21 AM, Jens Axboe wrote:
> >>>>> On 4/1/22 10:02 AM, Miklos Szeredi wrote:
> >>>>>> On Fri, 1 Apr 2022 at 17:36, Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>>> I take it you're continually reusing those slots?
> >>>>>>
> >>>>>> Yes.
> >>>>>>
> >>>>>>>  If you have a test
> >>>>>>> case that'd be ideal. Agree that it sounds like we just need an
> >>>>>>> appropriate breather to allow fput/task_work to run. Or it could be the
> >>>>>>> deferral free of the fixed slot.
> >>>>>>
> >>>>>> Adding a breather could make the worst case latency be large.  I think
> >>>>>> doing the fput synchronously would be better in general.
> >>>>>
> >>>>> fput() isn't sync, it'll just offload to task_work. There are some
> >>>>> dependencies there that would need to be checked. But we'll find a way
> >>>>> to deal with it.
> >>>>>
> >>>>>> I test this on an VM with 8G of memory and run the following:
> >>>>>>
> >>>>>> ./forkbomb 14 &
> >>>>>> # wait till 16k processes are forked
> >>>>>> for i in `seq 1 100`; do ./procreads u; done
> >>>>>>
> >>>>>> You can compare performance with plain reads (./procreads p), the
> >>>>>> other tests don't work on public kernels.
> >>>>>
> >>>>> OK, I'll check up on this, but probably won't have time to do so before
> >>>>> early next week.
> >>>>
> >>>> Can you try with this patch? It's not complete yet, there's actually a
> >>>> bunch of things we can do to improve the direct descriptor case. But
> >>>> this one is easy enough to pull off, and I think it'll fix your OOM
> >>>> case. Not a proposed patch, but it'll prove the theory.
> >>>
> >>> Sorry for the delay..
> >>>
> >>> Patch works like charm.
> >>
> >> OK good, then it is the issue I suspected. Thanks for testing!
> >
> > Tested with v5.18-rc3 and performance seems significantly worse than
> > with the test patch:
> >
> > test patch:
> >         avg     min     max     stdev
> > real    0.205   0.190   0.266   0.011
> > user    0.017   0.007   0.029   0.004
> > sys     0.374   0.336   0.503   0.022
> >
> > 5.18.0-rc3-00016-gb253435746d9:
> >         avg     min     max     stdev
> > real    0.725   0.200   18.090  2.279
> > user    0.019   0.005   0.046   0.006
> > sys     0.454   0.241   1.022   0.199
>
> It's been a month and I don't remember details of which patches were
> tested, when you say "test patch", which one exactly are you referring
> to and what base was it applied on?

https://lore.kernel.org/all/47912c4c-ccc2-0678-6c2f-3e3c0dd1f04b@kernel.dk/

The base is a good question, it was after the basic fixed slot
assignment issues were fixed.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-04-21 12:39                                                         ` Miklos Szeredi
@ 2022-04-21 12:41                                                           ` Jens Axboe
  2022-04-21 13:10                                                             ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-04-21 12:41 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: io-uring

On 4/21/22 6:39 AM, Miklos Szeredi wrote:
> On Thu, 21 Apr 2022 at 14:34, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 4/21/22 6:31 AM, Miklos Szeredi wrote:
>>> On Tue, 5 Apr 2022 at 16:44, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 4/5/22 1:45 AM, Miklos Szeredi wrote:
>>>>> On Sat, 2 Apr 2022 at 03:17, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 4/1/22 10:21 AM, Jens Axboe wrote:
>>>>>>> On 4/1/22 10:02 AM, Miklos Szeredi wrote:
>>>>>>>> On Fri, 1 Apr 2022 at 17:36, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>
>>>>>>>>> I take it you're continually reusing those slots?
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>  If you have a test
>>>>>>>>> case that'd be ideal. Agree that it sounds like we just need an
>>>>>>>>> appropriate breather to allow fput/task_work to run. Or it could be the
>>>>>>>>> deferral free of the fixed slot.
>>>>>>>>
>>>>>>>> Adding a breather could make the worst case latency be large.  I think
>>>>>>>> doing the fput synchronously would be better in general.
>>>>>>>
>>>>>>> fput() isn't sync, it'll just offload to task_work. There are some
>>>>>>> dependencies there that would need to be checked. But we'll find a way
>>>>>>> to deal with it.
>>>>>>>
>>>>>>>> I test this on an VM with 8G of memory and run the following:
>>>>>>>>
>>>>>>>> ./forkbomb 14 &
>>>>>>>> # wait till 16k processes are forked
>>>>>>>> for i in `seq 1 100`; do ./procreads u; done
>>>>>>>>
>>>>>>>> You can compare performance with plain reads (./procreads p), the
>>>>>>>> other tests don't work on public kernels.
>>>>>>>
>>>>>>> OK, I'll check up on this, but probably won't have time to do so before
>>>>>>> early next week.
>>>>>>
>>>>>> Can you try with this patch? It's not complete yet, there's actually a
>>>>>> bunch of things we can do to improve the direct descriptor case. But
>>>>>> this one is easy enough to pull off, and I think it'll fix your OOM
>>>>>> case. Not a proposed patch, but it'll prove the theory.
>>>>>
>>>>> Sorry for the delay..
>>>>>
>>>>> Patch works like charm.
>>>>
>>>> OK good, then it is the issue I suspected. Thanks for testing!
>>>
>>> Tested with v5.18-rc3 and performance seems significantly worse than
>>> with the test patch:
>>>
>>> test patch:
>>>         avg     min     max     stdev
>>> real    0.205   0.190   0.266   0.011
>>> user    0.017   0.007   0.029   0.004
>>> sys     0.374   0.336   0.503   0.022
>>>
>>> 5.18.0-rc3-00016-gb253435746d9:
>>>         avg     min     max     stdev
>>> real    0.725   0.200   18.090  2.279
>>> user    0.019   0.005   0.046   0.006
>>> sys     0.454   0.241   1.022   0.199
>>
>> It's been a month and I don't remember details of which patches were
>> tested, when you say "test patch", which one exactly are you referring
>> to and what base was it applied on?
> 
> https://lore.kernel.org/all/47912c4c-ccc2-0678-6c2f-3e3c0dd1f04b@kernel.dk/
> 
> The base is a good question, it was after the basic fixed slot
> assignment issues were fixed.

Gotcha, ok then this makes sense. The ordering issues were sorted out
for 5.18-rc3, but the direct descriptor optimization is only in the 5.19
branch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: io_uring_prep_openat_direct() and link/drain
  2022-04-21 12:41                                                           ` Jens Axboe
@ 2022-04-21 13:10                                                             ` Miklos Szeredi
  0 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2022-04-21 13:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Thu, 21 Apr 2022 at 14:41, Jens Axboe <axboe@kernel.dk> wrote:

> Gotcha, ok then this makes sense. The ordering issues were sorted out
> for 5.18-rc3, but the direct descriptor optimization is only in the 5.19
> branch.

Okay, can you please point me to the branch that I could test.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-04-21 13:10 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-29 13:20 io_uring_prep_openat_direct() and link/drain Miklos Szeredi
2022-03-29 16:08 ` Jens Axboe
2022-03-29 17:04   ` Jens Axboe
2022-03-29 18:21     ` Miklos Szeredi
2022-03-29 18:26       ` Jens Axboe
2022-03-29 18:31         ` Miklos Szeredi
2022-03-29 18:40           ` Jens Axboe
2022-03-29 19:30             ` Miklos Szeredi
2022-03-29 20:03               ` Jens Axboe
2022-03-30  8:18                 ` Miklos Szeredi
2022-03-30 12:35                   ` Jens Axboe
2022-03-30 12:43                     ` Miklos Szeredi
2022-03-30 12:48                       ` Jens Axboe
2022-03-30 12:51                         ` Miklos Szeredi
2022-03-30 14:58                           ` Miklos Szeredi
2022-03-30 15:05                             ` Jens Axboe
2022-03-30 15:12                               ` Miklos Szeredi
2022-03-30 15:17                                 ` Jens Axboe
2022-03-30 15:53                                   ` Jens Axboe
2022-03-30 17:49                                     ` Jens Axboe
2022-04-01  8:40                                       ` Miklos Szeredi
2022-04-01 15:36                                         ` Jens Axboe
2022-04-01 16:02                                           ` Miklos Szeredi
2022-04-01 16:21                                             ` Jens Axboe
2022-04-02  1:17                                               ` Jens Axboe
2022-04-05  7:45                                                 ` Miklos Szeredi
2022-04-05 14:44                                                   ` Jens Axboe
2022-04-21 12:31                                                     ` Miklos Szeredi
2022-04-21 12:34                                                       ` Jens Axboe
2022-04-21 12:39                                                         ` Miklos Szeredi
2022-04-21 12:41                                                           ` Jens Axboe
2022-04-21 13:10                                                             ` Miklos Szeredi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.