linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: linux-kernel@vger.kernel.org, io-uring@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: [RFC PATCH] io_uring: add support for IORING_OP_GETDENTS64
Date: Sat, 23 Jan 2021 10:37:25 -0700	[thread overview]
Message-ID: <b5b978ee-1a56-ead7-43bc-83ae2398b160@kernel.dk> (raw)
In-Reply-To: <20210123114152.GA120281@wantstofly.org>

On 1/23/21 4:41 AM, Lennert Buytenhek wrote:
> IORING_OP_GETDENTS64 behaves like getdents64(2) and takes the same
> arguments.
> 
> Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
> ---
> This seems to work OK, but I'd appreciate a review from someone more
> familiar with io_uring internals than I am, as I'm not entirely sure
> I did everything quite right.
> 
> A dumb test program for IORING_OP_GETDENTS64 is available here:
> 
> 	https://krautbox.wantstofly.org/~buytenh/uringfind.c
> 
> This does more or less what find(1) does: it scans recursively through
> a directory tree and prints the names of all directories and files it
> encounters along the way -- but then using io_uring.  (The uring version
> prints the names of encountered files and directories in an order that's
> determined by SQE completion order, which is somewhat nondeterministic
> and likely to differ between runs.)
> 
> On a directory tree with 14-odd million files in it that's on a
> six-drive (spinning disk) btrfs raid, find(1) takes:
> 
> 	# echo 3 > /proc/sys/vm/drop_caches 
> 	# time find /mnt/repo > /dev/null
> 
> 	real    24m7.815s
> 	user    0m15.015s
> 	sys     0m48.340s
> 	#
> 
> And the io_uring version takes:
> 
> 	# echo 3 > /proc/sys/vm/drop_caches 
> 	# time ./uringfind /mnt/repo > /dev/null
> 
> 	real    10m29.064s
> 	user    0m4.347s
> 	sys     0m1.677s
> 	#
> 
> These timings are repeatable and consistent to within a few seconds.
> 
> (btrfs seems to be sending most metadata reads to the same drive in the
> array during this test, even though this filesystem is using the raid1c4
> profile for metadata, so I suspect that more drive-level parallelism can
> be extracted with some btrfs tweaks.)
> 
> The fully cached case also shows some speedup for the io_uring version:
> 
> 	# time find /mnt/repo > /dev/null
> 
> 	real    0m5.223s
> 	user    0m1.926s
> 	sys     0m3.268s
> 	#
> 
> vs:
> 
> 	# time ./uringfind /mnt/repo > /dev/null
> 
> 	real    0m3.604s
> 	user    0m2.417s
> 	sys     0m0.793s
> 	#
> 
> That said, the point of this patch isn't primarily to enable
> lightning-fast find(1) or du(1), but more to complete the set of
> filesystem I/O primitives available via io_uring, so that applications
> can do all of their filesystem I/O using the same mechanism, without
> having to manually punt some of their work out to worker threads -- and
> indeed, an object storage backend server that I wrote a while ago can
> run with a pure io_uring based event loop with this patch.

The results look nice for sure. Once concern is that io_uring generally
guarantees that any state passed in is stable once submit is done. For
the below implementation, that doesn't hold as the linux_dirent64 isn't
used until later in the process. That means if you do:

submit_getdents64(ring)
{
	struct linux_dirent64 dent;
	struct io_uring_sqe *sqe;

	sqe = io_uring_get_sqe(ring);
	io_uring_prep_getdents64(sqe, ..., &dent);
	io_uring_submit(ring);
}

other_func(ring)
{
	struct io_uring_cqe *cqe;

	submit_getdents64(ring);
	io_uring_wait_cqe(ring, &cqe);
	
}

then the kernel side might get garbage by the time the sqe is actually
submitted. This is true because you don't use it inline, only from the
out-of-line async context. Usually this is solved by having the prep
side copy in the necessary state, eg see io_openat2_prep() for how we
make filename and open_how stable by copying them into kernel memory.
That ensures that if/when these operations need to go async and finish
out-of-line, the contents are stable and there's no requirement for the
application to keep them valid once submission is done.

Not sure how best to solve that, since the vfs side relies heavily on
linux_dirent64 being a user pointer...

Outside of that, implementation looks straight forward.

-- 
Jens Axboe


  reply	other threads:[~2021-01-23 17:38 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-23 11:41 [RFC PATCH] io_uring: add support for IORING_OP_GETDENTS64 Lennert Buytenhek
2021-01-23 17:37 ` Jens Axboe [this message]
2021-01-23 18:16   ` Lennert Buytenhek
2021-01-23 18:22     ` Jens Axboe
2021-01-23 23:27 ` Matthew Wilcox
2021-01-23 23:33   ` Jens Axboe
2021-01-28 22:30     ` Lennert Buytenhek
2021-01-23 23:50 ` Andres Freund
2021-01-23 23:56   ` Andres Freund
2021-01-24  1:59   ` Al Viro
2021-01-24  2:17     ` Andres Freund
2021-01-24 22:21 ` David Laight
2021-01-28 23:07   ` Lennert Buytenhek
2021-01-29  5:37     ` Lennert Buytenhek
2021-01-29 10:20       ` Lennert Buytenhek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b5b978ee-1a56-ead7-43bc-83ae2398b160@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=buytenh@wantstofly.org \
    --cc=io-uring@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).