Re: Expense of read_iter

From: Mikulas Patocka <mpatocka@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Mingkai Dong <mingkaidong@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Steven Whitehouse <swhiteho@redhat.com>,
	Eric Sandeen <esandeen@redhat.com>,
	Dave Chinner <dchinner@redhat.com>, Theodore Ts'o <tytso@mit.edu>,
	Wang Jianchao <jianchao.wan9@gmail.com>,
	"Tadakamadla, Rajesh" <rajesh.tadakamadla@hpe.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvdimm@lists.01.org
Subject: Re: Expense of read_iter
Date: Mon, 11 Jan 2021 16:10:11 -0500 (EST)	[thread overview]
Message-ID: <alpine.LRH.2.02.2101111126150.31017@file01.intranet.prod.int.rdu2.redhat.com> (raw)
In-Reply-To: <20210111001805.GD35215@casper.infradead.org>



On Mon, 11 Jan 2021, Matthew Wilcox wrote:

> On Sun, Jan 10, 2021 at 04:19:15PM -0500, Mikulas Patocka wrote:
> > I put counters into vfs_read and vfs_readv.
> > 
> > After a fresh boot of the virtual machine, the counters show "13385 4". 
> > After a kernel compilation they show "4475220 8".
> > 
> > So, the readv path is almost unused.
> > 
> > My reasoning was that we should optimize for the "read" path and glue the 
> > "readv" path on the top of that. Currently, the kernel is doing the 
> > opposite - optimizing for "readv" and glueing "read" on the top of it.
> 
> But it's not about optimising for read vs readv.  read_iter handles
> a host of other cases, such as pread(), preadv(), AIO reads, splice,
> and reads to in-kernel buffers.

These things are used rarely compared to "read" and "pread". (BTW. "pread" 
could be handled by the read method too).

What's the reason why do you think that the "read" syscall should use the 
"read_iter" code path? Is it because duplicating the logic is discouraged? 
Or because of code size? Or something else?

> Some device drivers abused read() vs readv() to actually return different 
> information, depending which you called.  That's why there's now a
> prohibition against both.
> 
> So let's figure out how to make iter_read() perform well for sys_read().

I've got another idea - in nvfs_read_iter, test if the iovec contains just 
one entry and call nvfs_read_locked if it does.

diff --git a/file.c b/file.c
index f4b8a1a..e4d87b2 100644
--- a/file.c
+++ b/file.c
@@ -460,6 +460,10 @@ static ssize_t nvfs_read_iter(struct kiocb *iocb, struct iov_iter *iov)
 	if (!IS_DAX(&nmi->vfs_inode)) {
 		r = generic_file_read_iter(iocb, iov);
 	} else {
+		if (likely(iter_is_iovec(iov)) && likely(!iov->iov_offset) && likely(iov->nr_segs == 1)) {
+			r = nvfs_read_locked(nmi, iocb->ki_filp, iov->iov->iov_base, iov->count, true, &iocb->ki_pos);
+			goto unlock_ret;
+		}
 #if 1
 		r = nvfs_rw_iter_locked(iocb, iov, false);
 #else
@@ -467,6 +471,7 @@ static ssize_t nvfs_read_iter(struct kiocb *iocb, struct iov_iter *iov)
 #endif
 	}
 
+unlock_ret:
 	inode_unlock_shared(&nmi->vfs_inode);
 
 	return r;



The result is:

nvfs_read_iter			- 7.307s
Al Viro's read_iter_locked	- 7.147s
test for just one entry		- 7.010s
the read method			- 6.782s

So far, this is the best way how to do it, but it's still 3.3% worse than 
the read method. There's not anything more that could be optimized on the 
filesystem level - the rest of optimizations must be done in the VFS.

Mikulas
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org