linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2] nvfs: a filesystem for persistent memory
@ 2021-01-07 13:15 Mikulas Patocka
  2021-01-07 15:11 ` Expense of read_iter Matthew Wilcox
  2021-01-10 16:20 ` [RFC v2] nvfs: a filesystem for persistent memory Al Viro
  0 siblings, 2 replies; 30+ messages in thread
From: Mikulas Patocka @ 2021-01-07 13:15 UTC (permalink / raw)
  To: Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Matthew Wilcox, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh
  Cc: linux-kernel, linux-fsdevel, linux-nvdimm

Hi

I announce a new version of NVFS - a filesystem for persistent memory.
	http://people.redhat.com/~mpatocka/nvfs/
	git://leontynka.twibright.com/nvfs.git

Changes since the last release:

* I added a microjournal to the filesystem, it can hold up to 16 entries. 
  Each CPU has it's own journal, so that there is no lock contention. The 
  journal is used to provide atomicity of reaname() and extended attribute 
  replace.
  (note that file creation or deletion doesn't use the journal, because 
  these operations can be deterministically cleaned up by fsck)

* I created a framework that can be used to verify the filesystem driver. 
  It logs all writes and memory barriers to a file, the entries in the 
  file are randomly reordered (to simulate reordering in the CPU 
  write-combining buffers), the sequence is cut at a random point (to 
  simulate a system crash) and the result is replayed on a filesystem 
  image.
  With this framework, we can for example check that if a crash happens 
  during rename(), either old file or new file will be present in a 
  directory.
  This framework helped to find a few bugs in sequencing the writes.

* If we map an executable image, we turn off the DAX flag on the inode 
  (because executables run 4% slower from persistent memory). There is 
  also a switch that can turn DAX always off or always on.




I'd like to ask about this piece of code in __kernel_read:
	if (unlikely(!file->f_op->read_iter || file->f_op->read))
		return warn_unsupported...
and __kernel_write:
	if (unlikely(!file->f_op->write_iter || file->f_op->write))
		return warn_unsupported...

- It exits with an error if both read_iter and read or write_iter and 
write are present.

I found out that on NVFS, reading a file with the read method has 10% 
better performance than the read_iter method. The benchmark just reads the 
same 4k page over and over again - and the cost of creating and parsing 
the kiocb and iov_iter structures is just that high.

So, I'd like to have both read and read_iter methods. Could the above 
conditions be changed, so that they don't fail with an error if the "read" 
or "write" method is present?

Mikulas


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Expense of read_iter
  2021-01-07 13:15 [RFC v2] nvfs: a filesystem for persistent memory Mikulas Patocka
@ 2021-01-07 15:11 ` Matthew Wilcox
  2021-01-07 16:43   ` Mingkai Dong
  2021-01-07 18:59   ` Mikulas Patocka
  2021-01-10 16:20 ` [RFC v2] nvfs: a filesystem for persistent memory Al Viro
  1 sibling, 2 replies; 30+ messages in thread
From: Matthew Wilcox @ 2021-01-07 15:11 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Jan Kara, Steven Whitehouse, Eric Sandeen, Dave Chinner,
	Theodore Ts'o, Wang Jianchao, Kani, Toshi, Norton, Scott J,
	Tadakamadla, Rajesh, linux-kernel, linux-fsdevel, linux-nvdimm

On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> I'd like to ask about this piece of code in __kernel_read:
> 	if (unlikely(!file->f_op->read_iter || file->f_op->read))
> 		return warn_unsupported...
> and __kernel_write:
> 	if (unlikely(!file->f_op->write_iter || file->f_op->write))
> 		return warn_unsupported...
> 
> - It exits with an error if both read_iter and read or write_iter and 
> write are present.
> 
> I found out that on NVFS, reading a file with the read method has 10% 
> better performance than the read_iter method. The benchmark just reads the 
> same 4k page over and over again - and the cost of creating and parsing 
> the kiocb and iov_iter structures is just that high.

Which part of it is so expensive?  Is it worth, eg adding an iov_iter
type that points to a single buffer instead of a single-member iov?

+++ b/include/linux/uio.h
@@ -19,6 +19,7 @@ struct kvec {
 
 enum iter_type {
        /* iter types */
+       ITER_UBUF = 2,
        ITER_IOVEC = 4,
        ITER_KVEC = 8,
        ITER_BVEC = 16,
@@ -36,6 +36,7 @@ struct iov_iter {
        size_t iov_offset;
        size_t count;
        union {
+               void __user *buf;
                const struct iovec *iov;
                const struct kvec *kvec;
                const struct bio_vec *bvec;

and then doing all the appropriate changes to make that work.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-07 15:11 ` Expense of read_iter Matthew Wilcox
@ 2021-01-07 16:43   ` Mingkai Dong
  2021-01-12 13:45     ` Zhongwei Cai
  2021-01-07 18:59   ` Mikulas Patocka
  1 sibling, 1 reply; 30+ messages in thread
From: Mingkai Dong @ 2021-01-07 16:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mikulas Patocka, Andrew Morton, Jan Kara, Steven Whitehouse,
	Eric Sandeen, Dave Chinner, Theodore Ts'o, Wang Jianchao,
	Tadakamadla, Rajesh, linux-kernel, linux-fsdevel, linux-nvdimm,
	sunrise_l

Hi Matthew,

We have also discovered the expense of `->read_iter` in our study on Ext4-DAX.
In single-thread 4K-reads, the `->read` version could outperform `->read_iter`
by 41.6% in terms of throughput.

According to our observation and evaluation, at least for Ext4-DAX, the cost
also comes from the invocation of `->iomap_begin` (`ext4_iomap_begin`),
which might not be simply avoided by adding a new iter_type.
The slowdown is more significant when multiple threads reading different files
concurrently, due to the scalability issue (grabbing a read lock to check the
status of the journal) in `ext4_iomap_begin`.

In our solution, we implemented the `->read` and `->write` interfaces for
Ext4-DAX. Thus, we also think it would be good if both `->read` and `->read_iter`
could exist.

By the way, besides the implementation of `->read` and `->write`, we have
some other optimizations for Ext4-DAX and would like to share them once our
patches are prepared.

Thanks,
Mingkai


> On Jan 7, 2021, at 23:11, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
>> I'd like to ask about this piece of code in __kernel_read:
>> 	if (unlikely(!file->f_op->read_iter || file->f_op->read))
>> 		return warn_unsupported...
>> and __kernel_write:
>> 	if (unlikely(!file->f_op->write_iter || file->f_op->write))
>> 		return warn_unsupported...
>> 
>> - It exits with an error if both read_iter and read or write_iter and 
>> write are present.
>> 
>> I found out that on NVFS, reading a file with the read method has 10% 
>> better performance than the read_iter method. The benchmark just reads the 
>> same 4k page over and over again - and the cost of creating and parsing 
>> the kiocb and iov_iter structures is just that high.
> 
> Which part of it is so expensive?  Is it worth, eg adding an iov_iter
> type that points to a single buffer instead of a single-member iov?
> 
> +++ b/include/linux/uio.h
> @@ -19,6 +19,7 @@ struct kvec {
> 
> enum iter_type {
>        /* iter types */
> +       ITER_UBUF = 2,
>        ITER_IOVEC = 4,
>        ITER_KVEC = 8,
>        ITER_BVEC = 16,
> @@ -36,6 +36,7 @@ struct iov_iter {
>        size_t iov_offset;
>        size_t count;
>        union {
> +               void __user *buf;
>                const struct iovec *iov;
>                const struct kvec *kvec;
>                const struct bio_vec *bvec;
> 
> and then doing all the appropriate changes to make that work.
> _______________________________________________
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-07 15:11 ` Expense of read_iter Matthew Wilcox
  2021-01-07 16:43   ` Mingkai Dong
@ 2021-01-07 18:59   ` Mikulas Patocka
  2021-01-10  6:13     ` Matthew Wilcox
  1 sibling, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2021-01-07 18:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Jan Kara, Steven Whitehouse, Eric Sandeen, Dave Chinner,
	Theodore Ts'o, Wang Jianchao, Kani, Toshi, Norton, Scott J,
	Tadakamadla, Rajesh, linux-kernel, linux-fsdevel, linux-nvdimm



On Thu, 7 Jan 2021, Matthew Wilcox wrote:

> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > I'd like to ask about this piece of code in __kernel_read:
> > 	if (unlikely(!file->f_op->read_iter || file->f_op->read))
> > 		return warn_unsupported...
> > and __kernel_write:
> > 	if (unlikely(!file->f_op->write_iter || file->f_op->write))
> > 		return warn_unsupported...
> > 
> > - It exits with an error if both read_iter and read or write_iter and 
> > write are present.
> > 
> > I found out that on NVFS, reading a file with the read method has 10% 
> > better performance than the read_iter method. The benchmark just reads the 
> > same 4k page over and over again - and the cost of creating and parsing 
> > the kiocb and iov_iter structures is just that high.
> 
> Which part of it is so expensive?

The read_iter path is much bigger:
vfs_read		- 0x160 bytes
new_sync_read		- 0x160 bytes
nvfs_rw_iter		- 0x100 bytes
nvfs_rw_iter_locked	- 0x4a0 bytes
iov_iter_advance	- 0x300 bytes

If we go with the "read" method, there's just:
vfs_read		- 0x160 bytes
nvfs_read		- 0x200 bytes

> Is it worth, eg adding an iov_iter
> type that points to a single buffer instead of a single-member iov?
> 
> +++ b/include/linux/uio.h
> @@ -19,6 +19,7 @@ struct kvec {
>  
>  enum iter_type {
>         /* iter types */
> +       ITER_UBUF = 2,
>         ITER_IOVEC = 4,
>         ITER_KVEC = 8,
>         ITER_BVEC = 16,
> @@ -36,6 +36,7 @@ struct iov_iter {
>         size_t iov_offset;
>         size_t count;
>         union {
> +               void __user *buf;
>                 const struct iovec *iov;
>                 const struct kvec *kvec;
>                 const struct bio_vec *bvec;
> 
> and then doing all the appropriate changes to make that work.


I tried this benchmark on nvfs:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(void)
{
	unsigned long i;
	unsigned long l = 1UL << 38;
	unsigned s = 4096;
	void *a = valloc(s);
	if (!a) perror("malloc"), exit(1);
	for (i = 0; i < l; i += s) {
		if (pread(0, a, s, 0) != s) perror("read"), exit(1);
	}
	return 0;
}


Result, using the read_iter method:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 1049885560
#
# Overhead  Command  Shared Object     Symbol                               
# ........  .......  ................  .....................................
#
    47.32%  pread    [kernel.vmlinux]  [k] copy_user_generic_string
     7.83%  pread    [kernel.vmlinux]  [k] current_time
     6.57%  pread    [nvfs]            [k] nvfs_rw_iter_locked
     5.59%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64
     4.23%  pread    libc-2.31.so      [.] __libc_pread
     3.51%  pread    [kernel.vmlinux]  [k] syscall_return_via_sysret
     2.34%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     2.34%  pread    [kernel.vmlinux]  [k] vfs_read
     2.34%  pread    [kernel.vmlinux]  [k] __fsnotify_parent
     2.31%  pread    [kernel.vmlinux]  [k] new_sync_read
     2.21%  pread    [nvfs]            [k] nvfs_bmap
     1.89%  pread    [kernel.vmlinux]  [k] iov_iter_advance
     1.71%  pread    [kernel.vmlinux]  [k] __x64_sys_pread64
     1.59%  pread    [kernel.vmlinux]  [k] atime_needs_update
     1.24%  pread    [nvfs]            [k] nvfs_rw_iter
     0.94%  pread    [kernel.vmlinux]  [k] touch_atime
     0.75%  pread    [kernel.vmlinux]  [k] syscall_enter_from_user_mode
     0.72%  pread    [kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
     0.68%  pread    [kernel.vmlinux]  [k] down_read
     0.62%  pread    [kernel.vmlinux]  [k] exit_to_user_mode_prepare
     0.52%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode
     0.49%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
     0.47%  pread    [kernel.vmlinux]  [k] __fget_light
     0.46%  pread    [kernel.vmlinux]  [k] do_syscall_64
     0.42%  pread    pread             [.] main
     0.33%  pread    [kernel.vmlinux]  [k] up_read
     0.29%  pread    [kernel.vmlinux]  [k] iov_iter_init
     0.16%  pread    [kernel.vmlinux]  [k] __fdget
     0.10%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
     0.03%  pread    pread             [.] pread@plt
     0.00%  perf     [kernel.vmlinux]  [k] x86_pmu_enable_all


#
# (Tip: Use --symfs <dir> if your symbol files are in non-standard locations)
#



Result, using the read method:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 1312158116
#
# Overhead  Command  Shared Object     Symbol                               
# ........  .......  ................  .....................................
#
    60.77%  pread    [kernel.vmlinux]  [k] copy_user_generic_string
     6.14%  pread    [kernel.vmlinux]  [k] current_time
     3.88%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64
     3.55%  pread    libc-2.31.so      [.] __libc_pread
     3.04%  pread    [nvfs]            [k] nvfs_bmap
     2.84%  pread    [kernel.vmlinux]  [k] syscall_return_via_sysret
     2.71%  pread    [nvfs]            [k] nvfs_read
     2.56%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     2.00%  pread    [kernel.vmlinux]  [k] __x64_sys_pread64
     1.98%  pread    [kernel.vmlinux]  [k] __fsnotify_parent
     1.77%  pread    [kernel.vmlinux]  [k] vfs_read
     1.35%  pread    [kernel.vmlinux]  [k] atime_needs_update
     0.94%  pread    [kernel.vmlinux]  [k] exit_to_user_mode_prepare
     0.91%  pread    [kernel.vmlinux]  [k] __fget_light
     0.83%  pread    [kernel.vmlinux]  [k] syscall_enter_from_user_mode
     0.70%  pread    [kernel.vmlinux]  [k] down_read
     0.70%  pread    [kernel.vmlinux]  [k] touch_atime
     0.65%  pread    [kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
     0.55%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode
     0.49%  pread    [kernel.vmlinux]  [k] up_read
     0.44%  pread    [kernel.vmlinux]  [k] do_syscall_64
     0.39%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
     0.34%  pread    pread             [.] main
     0.26%  pread    [kernel.vmlinux]  [k] __fdget
     0.10%  pread    pread             [.] pread@plt
     0.10%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
     0.00%  perf     [kernel.vmlinux]  [k] x86_pmu_enable_all


#
# (Tip: To set sample time separation other than 100ms with --sort time use --time-quantum)
#


Note that if we sum the percentage of nvfs_iter_locked, new_sync_read, 
iov_iter_advance, nvfs_rw_iter, we get 12.01%. On the other hand, in the 
second trace, nvfs_read consumes just 2.71% - and it replaces 
functionality of all these functions.

That is the reason for that 10% degradation with read_iter.

Mikulas


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-07 18:59   ` Mikulas Patocka
@ 2021-01-10  6:13     ` Matthew Wilcox
  2021-01-10 21:19       ` Mikulas Patocka
  2021-01-11 10:11       ` David Laight
  0 siblings, 2 replies; 30+ messages in thread
From: Matthew Wilcox @ 2021-01-10  6:13 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Jan Kara, Steven Whitehouse, Eric Sandeen, Dave Chinner,
	Theodore Ts'o, Wang Jianchao, Kani, Toshi, Norton, Scott J,
	Tadakamadla, Rajesh, linux-kernel, linux-fsdevel, linux-nvdimm

On Thu, Jan 07, 2021 at 01:59:01PM -0500, Mikulas Patocka wrote:
> On Thu, 7 Jan 2021, Matthew Wilcox wrote:
> > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > > I'd like to ask about this piece of code in __kernel_read:
> > > 	if (unlikely(!file->f_op->read_iter || file->f_op->read))
> > > 		return warn_unsupported...
> > > and __kernel_write:
> > > 	if (unlikely(!file->f_op->write_iter || file->f_op->write))
> > > 		return warn_unsupported...
> > > 
> > > - It exits with an error if both read_iter and read or write_iter and 
> > > write are present.
> > > 
> > > I found out that on NVFS, reading a file with the read method has 10% 
> > > better performance than the read_iter method. The benchmark just reads the 
> > > same 4k page over and over again - and the cost of creating and parsing 
> > > the kiocb and iov_iter structures is just that high.
> > 
> > Which part of it is so expensive?
> 
> The read_iter path is much bigger:
> vfs_read		- 0x160 bytes
> new_sync_read		- 0x160 bytes
> nvfs_rw_iter		- 0x100 bytes
> nvfs_rw_iter_locked	- 0x4a0 bytes
> iov_iter_advance	- 0x300 bytes

Number of bytes in a function isn't really correlated with how expensive
a particular function is.  That said, looking at new_sync_read() shows
one part that's particularly bad, init_sync_kiocb():

static inline int iocb_flags(struct file *file)
{
        int res = 0;
        if (file->f_flags & O_APPEND)
                res |= IOCB_APPEND;
     7ec:       8b 57 40                mov    0x40(%rdi),%edx
     7ef:       48 89 75 80             mov    %rsi,-0x80(%rbp)
        if (file->f_flags & O_DIRECT)
     7f3:       89 d0                   mov    %edx,%eax
     7f5:       c1 e8 06                shr    $0x6,%eax
     7f8:       83 e0 10                and    $0x10,%eax
                res |= IOCB_DIRECT;
        if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))
     7fb:       89 c1                   mov    %eax,%ecx
     7fd:       81 c9 00 00 02 00       or     $0x20000,%ecx
     803:       f6 c6 40                test   $0x40,%dh
     806:       0f 45 c1                cmovne %ecx,%eax
                res |= IOCB_DSYNC;
     809:       f6 c6 10                test   $0x10,%dh
     80c:       75 18                   jne    826 <new_sync_read+0x66>
     80e:       48 8b 8f d8 00 00 00    mov    0xd8(%rdi),%rcx
     815:       48 8b 09                mov    (%rcx),%rcx
     818:       48 8b 71 28             mov    0x28(%rcx),%rsi
     81c:       f6 46 50 10             testb  $0x10,0x50(%rsi)
     820:       0f 84 e2 00 00 00       je     908 <new_sync_read+0x148>
        if (file->f_flags & __O_SYNC)
     826:       83 c8 02                or     $0x2,%eax
                res |= IOCB_SYNC;
        return res;
     829:       89 c1                   mov    %eax,%ecx
     82b:       83 c9 04                or     $0x4,%ecx
     82e:       81 e2 00 00 10 00       and    $0x100000,%edx

We could optimise this by, eg, checking for (__O_SYNC | O_DIRECT |
O_APPEND) and returning 0 if none of them are set, since they're all
pretty rare.  It might be better to maintain an f_iocb_flags in the
struct file and just copy that unconditionally.  We'd need to remember
to update it in fcntl(F_SETFL), but I think that's the only place.


> If we go with the "read" method, there's just:
> vfs_read		- 0x160 bytes
> nvfs_read		- 0x200 bytes
> 
> > Is it worth, eg adding an iov_iter
> > type that points to a single buffer instead of a single-member iov?

>      6.57%  pread    [nvfs]            [k] nvfs_rw_iter_locked
>      2.31%  pread    [kernel.vmlinux]  [k] new_sync_read
>      1.89%  pread    [kernel.vmlinux]  [k] iov_iter_advance
>      1.24%  pread    [nvfs]            [k] nvfs_rw_iter
>      0.29%  pread    [kernel.vmlinux]  [k] iov_iter_init

>      2.71%  pread    [nvfs]            [k] nvfs_read

> Note that if we sum the percentage of nvfs_iter_locked, new_sync_read, 
> iov_iter_advance, nvfs_rw_iter, we get 12.01%. On the other hand, in the 
> second trace, nvfs_read consumes just 2.71% - and it replaces 
> functionality of all these functions.
> 
> That is the reason for that 10% degradation with read_iter.

You seem to be focusing on your argument for "let's just permit
filesystems to implement both ->read and ->read_iter".  My suggestion
is that we need to optimise the ->read_iter path, but to do that we need
to know what's expensive.

nvfs_rw_iter_locked() looks very complicated.  I suspect it can
be simplified.  Of course new_sync_read() needs to be improved too,
as do the other functions here, but fully a third of the difference
between read() and read_iter() is the difference between nvfs_read()
and nvfs_rw_iter_locked().

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] nvfs: a filesystem for persistent memory
  2021-01-07 13:15 [RFC v2] nvfs: a filesystem for persistent memory Mikulas Patocka
  2021-01-07 15:11 ` Expense of read_iter Matthew Wilcox
@ 2021-01-10 16:20 ` Al Viro
  2021-01-10 16:51   ` Al Viro
                     ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Al Viro @ 2021-01-10 16:20 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Matthew Wilcox, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh, linux-kernel,
	linux-fsdevel, linux-nvdimm

On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> Hi
> 
> I announce a new version of NVFS - a filesystem for persistent memory.
> 	http://people.redhat.com/~mpatocka/nvfs/
Utilities, AFAICS

> 	git://leontynka.twibright.com/nvfs.git
Seems to hang on git pull at the moment...  Do you have it anywhere else?

> I found out that on NVFS, reading a file with the read method has 10% 
> better performance than the read_iter method. The benchmark just reads the 
> same 4k page over and over again - and the cost of creating and parsing 
> the kiocb and iov_iter structures is just that high.

Apples and oranges...  What happens if you take

ssize_t read_iter_locked(struct file *file, struct iov_iter *to, loff_t *ppos)
{
	struct inode *inode = file_inode(file);
	struct nvfs_memory_inode *nmi = i_to_nmi(inode);
	struct nvfs_superblock *nvs = inode->i_sb->s_fs_info;
	ssize_t total = 0;
	loff_t pos = *ppos;
	int r;
	int shift = nvs->log2_page_size;
	size_t i_size;

	i_size = inode->i_size;
	if (pos >= i_size)
		return 0;
	iov_iter_truncate(to, i_size - pos);

	while (iov_iter_count(to)) {
		void *blk, *ptr;
		size_t page_mask = (1UL << shift) - 1;
		unsigned page_offset = pos & page_mask;
		unsigned prealloc = (iov_iter_count(to) + page_mask) >> shift;
		unsigned size;

		blk = nvfs_bmap(nmi, pos >> shift, &prealloc, NULL, NULL, NULL);
		if (unlikely(IS_ERR(blk))) {
			r = PTR_ERR(blk);
			goto ret_r;
		}
		size = ((size_t)prealloc << shift) - page_offset;
		ptr = blk + page_offset;
		if (unlikely(!blk)) {
			size = min(size, (unsigned)PAGE_SIZE);
			ptr = empty_zero_page;
		}
		size = copy_to_iter(to, ptr, size);
		if (unlikely(!size)) {
			r = -EFAULT;
			goto ret_r;
		}

		pos += size;
		total += size;
	} while (iov_iter_count(to));

	r = 0;

ret_r:
	*ppos = pos;

	if (file)
		file_accessed(file);

	return total ? total : r;
}

and use that instead of your nvfs_rw_iter_locked() in your
->read_iter() for DAX read case?  Then the same with
s/copy_to_iter/_copy_to_iter/, to see how much of that is
"hardening" overhead.

Incidentally, what's the point of sharing nvfs_rw_iter() for
read and write cases?  They have practically no overlap -
count the lines common for wr and !wr cases.  And if you
do the same in nvfs_rw_iter_locked(), you'll see that the
shared parts _there_ are bloody pointless on the read side.
Not that it had been more useful on the write side, really,
but that's another story (nvfs_write_pages() handling of
copyin is... interesting).  Let's figure out what's going
on with the read overhead first...

lib/iov_iter.c primitives certainly could use massage for
better code generation, but let's find out how much of the
PITA is due to those and how much comes from you fighing
the damn thing instead of using it sanely...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] nvfs: a filesystem for persistent memory
  2021-01-10 16:20 ` [RFC v2] nvfs: a filesystem for persistent memory Al Viro
@ 2021-01-10 16:51   ` Al Viro
  2021-01-10 21:14   ` Mikulas Patocka
  2021-01-11 10:29   ` David Laight
  2 siblings, 0 replies; 30+ messages in thread
From: Al Viro @ 2021-01-10 16:51 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Matthew Wilcox, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh, linux-kernel,
	linux-fsdevel, linux-nvdimm

On Sun, Jan 10, 2021 at 04:20:08PM +0000, Al Viro wrote:
> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > Hi
> > 
> > I announce a new version of NVFS - a filesystem for persistent memory.
> > 	http://people.redhat.com/~mpatocka/nvfs/
> Utilities, AFAICS
> 
> > 	git://leontynka.twibright.com/nvfs.git
> Seems to hang on git pull at the moment...  Do you have it anywhere else?

D'oh...  In case it's not obvious from the rest of reply, I have managed to
grab it - and forgot to remove the question before sending the comments.
My apologies for the confusion; I plead the lack of coffee...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] nvfs: a filesystem for persistent memory
  2021-01-10 16:20 ` [RFC v2] nvfs: a filesystem for persistent memory Al Viro
  2021-01-10 16:51   ` Al Viro
@ 2021-01-10 21:14   ` Mikulas Patocka
  2021-01-10 23:40     ` Al Viro
  2021-01-11 10:29   ` David Laight
  2 siblings, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2021-01-10 21:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Matthew Wilcox, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh, linux-kernel,
	linux-fsdevel, linux-nvdimm



On Sun, 10 Jan 2021, Al Viro wrote:

> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > Hi
> > 
> > I announce a new version of NVFS - a filesystem for persistent memory.
> > 	http://people.redhat.com/~mpatocka/nvfs/
> Utilities, AFAICS
> 
> > 	git://leontynka.twibright.com/nvfs.git
> Seems to hang on git pull at the moment...  Do you have it anywhere else?

I saw some errors 'git-daemon: fatal: the remote end hung up unexpectedly' 
in syslog. I don't know what's causing them.

> > I found out that on NVFS, reading a file with the read method has 10% 
> > better performance than the read_iter method. The benchmark just reads the 
> > same 4k page over and over again - and the cost of creating and parsing 
> > the kiocb and iov_iter structures is just that high.
> 
> Apples and oranges...  What happens if you take
> 
> ssize_t read_iter_locked(struct file *file, struct iov_iter *to, loff_t *ppos)
> {
> 	struct inode *inode = file_inode(file);
> 	struct nvfs_memory_inode *nmi = i_to_nmi(inode);
> 	struct nvfs_superblock *nvs = inode->i_sb->s_fs_info;
> 	ssize_t total = 0;
> 	loff_t pos = *ppos;
> 	int r;
> 	int shift = nvs->log2_page_size;
> 	size_t i_size;
> 
> 	i_size = inode->i_size;
> 	if (pos >= i_size)
> 		return 0;
> 	iov_iter_truncate(to, i_size - pos);
> 
> 	while (iov_iter_count(to)) {
> 		void *blk, *ptr;
> 		size_t page_mask = (1UL << shift) - 1;
> 		unsigned page_offset = pos & page_mask;
> 		unsigned prealloc = (iov_iter_count(to) + page_mask) >> shift;
> 		unsigned size;
> 
> 		blk = nvfs_bmap(nmi, pos >> shift, &prealloc, NULL, NULL, NULL);
> 		if (unlikely(IS_ERR(blk))) {
> 			r = PTR_ERR(blk);
> 			goto ret_r;
> 		}
> 		size = ((size_t)prealloc << shift) - page_offset;
> 		ptr = blk + page_offset;
> 		if (unlikely(!blk)) {
> 			size = min(size, (unsigned)PAGE_SIZE);
> 			ptr = empty_zero_page;
> 		}
> 		size = copy_to_iter(to, ptr, size);
> 		if (unlikely(!size)) {
> 			r = -EFAULT;
> 			goto ret_r;
> 		}
> 
> 		pos += size;
> 		total += size;
> 	} while (iov_iter_count(to));
> 
> 	r = 0;
> 
> ret_r:
> 	*ppos = pos;
> 
> 	if (file)
> 		file_accessed(file);
> 
> 	return total ? total : r;
> }
> 
> and use that instead of your nvfs_rw_iter_locked() in your
> ->read_iter() for DAX read case?  Then the same with
> s/copy_to_iter/_copy_to_iter/, to see how much of that is
> "hardening" overhead.
> 
> Incidentally, what's the point of sharing nvfs_rw_iter() for
> read and write cases?  They have practically no overlap -
> count the lines common for wr and !wr cases.  And if you
> do the same in nvfs_rw_iter_locked(), you'll see that the
> shared parts _there_ are bloody pointless on the read side.

That's a good point. I split nvfs_rw_iter to separate functions 
nvfs_read_iter and nvfs_write_iter - and inlined nvfs_rw_iter_locked into 
both of them. It improved performance by 1.3%.

> Not that it had been more useful on the write side, really,
> but that's another story (nvfs_write_pages() handling of
> copyin is... interesting).  Let's figure out what's going
> on with the read overhead first...
> 
> lib/iov_iter.c primitives certainly could use massage for
> better code generation, but let's find out how much of the
> PITA is due to those and how much comes from you fighing
> the damn thing instead of using it sanely...

The results are:

read:                                           6.744s
read_iter:                                      7.417s
read_iter - separate read and write path:       7.321s
Al's read_iter:                                 7.182s
Al's read_iter with _copy_to_iter:              7.181s

Mikulas


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-10  6:13     ` Matthew Wilcox
@ 2021-01-10 21:19       ` Mikulas Patocka
  2021-01-11  0:18         ` Matthew Wilcox
  2021-01-11 10:11       ` David Laight
  1 sibling, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2021-01-10 21:19 UTC (permalink / raw)
  To: Matthew Wilcox, Al Viro
  Cc: Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Jan Kara, Steven Whitehouse, Eric Sandeen, Dave Chinner,
	Theodore Ts'o, Wang Jianchao, Kani, Toshi, Norton, Scott J,
	Tadakamadla, Rajesh, linux-kernel, linux-fsdevel, linux-nvdimm



On Sun, 10 Jan 2021, Matthew Wilcox wrote:

> > That is the reason for that 10% degradation with read_iter.
> 
> You seem to be focusing on your argument for "let's just permit
> filesystems to implement both ->read and ->read_iter".  My suggestion
> is that we need to optimise the ->read_iter path, but to do that we need
> to know what's expensive.
> 
> nvfs_rw_iter_locked() looks very complicated.  I suspect it can
> be simplified.

I split it to a separate read and write function and it improved 
performance by 1.3%. Using Al Viro's read_iter improves performance by 3%.

> Of course new_sync_read() needs to be improved too,
> as do the other functions here, but fully a third of the difference
> between read() and read_iter() is the difference between nvfs_read()
> and nvfs_rw_iter_locked().

I put counters into vfs_read and vfs_readv.

After a fresh boot of the virtual machine, the counters show "13385 4". 
After a kernel compilation they show "4475220 8".

So, the readv path is almost unused.

My reasoning was that we should optimize for the "read" path and glue the 
"readv" path on the top of that. Currently, the kernel is doing the 
opposite - optimizing for "readv" and glueing "read" on the top of it.

Mikulas


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] nvfs: a filesystem for persistent memory
  2021-01-10 21:14   ` Mikulas Patocka
@ 2021-01-10 23:40     ` Al Viro
  2021-01-11 11:41       ` Mikulas Patocka
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2021-01-10 23:40 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Matthew Wilcox, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh, linux-kernel,
	linux-fsdevel, linux-nvdimm

On Sun, Jan 10, 2021 at 04:14:55PM -0500, Mikulas Patocka wrote:

> That's a good point. I split nvfs_rw_iter to separate functions 
> nvfs_read_iter and nvfs_write_iter - and inlined nvfs_rw_iter_locked into 
> both of them. It improved performance by 1.3%.
> 
> > Not that it had been more useful on the write side, really,
> > but that's another story (nvfs_write_pages() handling of
> > copyin is... interesting).  Let's figure out what's going
> > on with the read overhead first...
> > 
> > lib/iov_iter.c primitives certainly could use massage for
> > better code generation, but let's find out how much of the
> > PITA is due to those and how much comes from you fighing
> > the damn thing instead of using it sanely...
> 
> The results are:
> 
> read:                                           6.744s
> read_iter:                                      7.417s
> read_iter - separate read and write path:       7.321s
> Al's read_iter:                                 7.182s
> Al's read_iter with _copy_to_iter:              7.181s

So
	* overhead of hardening stuff is noise here
	* switching to more straightforward ->read_iter() cuts
the overhead by about 1/3.

	Interesting...  I wonder how much of that is spent in
iterate_and_advance() glue inside copy_to_iter() here.  There's
certainly quite a bit of optimizations possible in those
primitives and your usecase makes a decent test for that...

	Could you profile that and see where is it spending
the time, on instruction level?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-10 21:19       ` Mikulas Patocka
@ 2021-01-11  0:18         ` Matthew Wilcox
  2021-01-11 21:10           ` Mikulas Patocka
  0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2021-01-11  0:18 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Al Viro, Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang,
	Ira Weiny, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh, linux-kernel,
	linux-fsdevel, linux-nvdimm

On Sun, Jan 10, 2021 at 04:19:15PM -0500, Mikulas Patocka wrote:
> I put counters into vfs_read and vfs_readv.
> 
> After a fresh boot of the virtual machine, the counters show "13385 4". 
> After a kernel compilation they show "4475220 8".
> 
> So, the readv path is almost unused.
> 
> My reasoning was that we should optimize for the "read" path and glue the 
> "readv" path on the top of that. Currently, the kernel is doing the 
> opposite - optimizing for "readv" and glueing "read" on the top of it.

But it's not about optimising for read vs readv.  read_iter handles
a host of other cases, such as pread(), preadv(), AIO reads, splice,
and reads to in-kernel buffers.

Some device drivers abused read() vs readv() to actually return different
information, depending which you called.  That's why there's now a
prohibition against both.

So let's figure out how to make iter_read() perform well for sys_read().

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Expense of read_iter
  2021-01-10  6:13     ` Matthew Wilcox
  2021-01-10 21:19       ` Mikulas Patocka
@ 2021-01-11 10:11       ` David Laight
  1 sibling, 0 replies; 30+ messages in thread
From: David Laight @ 2021-01-11 10:11 UTC (permalink / raw)
  To: 'Matthew Wilcox', Mikulas Patocka
  Cc: Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Jan Kara, Steven Whitehouse, Eric Sandeen, Dave Chinner,
	Theodore Ts'o, Wang Jianchao, Kani, Toshi, Norton, Scott J,
	Tadakamadla, Rajesh, linux-kernel, linux-fsdevel, linux-nvdimm

From: Matthew Wilcox
> Sent: 10 January 2021 06:13
...
> nvfs_rw_iter_locked() looks very complicated.  I suspect it can
> be simplified.  Of course new_sync_read() needs to be improved too,
> as do the other functions here, but fully a third of the difference
> between read() and read_iter() is the difference between nvfs_read()
> and nvfs_rw_iter_locked().

There is also the non-zero cost of import_iovec().
I've got some slight speedups, but haven't measured an
old kernel yet to see how much slower 5.11-rc1 made it.

Basic test is:
	fd = open("/dev/null", O_RDWR);
	for (1 = 0; 1 < 10000; i++) {
		start = rdtsc();
		writev(fd, iovec, count);
		histogram[rdtsc() - start]++;
	}

This doesn't actually copy any data - the iovec
isn't iterated.

I'm seeing pretty stable counts for most of the 10000 iterations.
But different program runs can give massively different timings.
I'm quessing that depends on cache collisions due to the addresses
(virtual of physical?) selected for some items.

For 5.11-rc2 -mx32 is slightly faster than 64bit.
Whereas -m32 has a much slower syscall entry/exit path,
but the difference between gettid() and writev() is lower.
The compat code for import_iovec() is actually faster.
This isn't really surprising since copy_from_user() is
absolutely horrid these days - especially with userspace hardening.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [RFC v2] nvfs: a filesystem for persistent memory
  2021-01-10 16:20 ` [RFC v2] nvfs: a filesystem for persistent memory Al Viro
  2021-01-10 16:51   ` Al Viro
  2021-01-10 21:14   ` Mikulas Patocka
@ 2021-01-11 10:29   ` David Laight
  2021-01-11 11:44     ` Mikulas Patocka
  2 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2021-01-11 10:29 UTC (permalink / raw)
  To: 'Al Viro', Mikulas Patocka
  Cc: Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Matthew Wilcox, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh, linux-kernel,
	linux-fsdevel, linux-nvdimm

From: Al Viro <viro@ftp.linux.org.uk> On Behalf Of Al Viro
> Sent: 10 January 2021 16:20
> 
> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > Hi
> >
> > I announce a new version of NVFS - a filesystem for persistent memory.
> > 	http://people.redhat.com/~mpatocka/nvfs/
> Utilities, AFAICS
> 
> > 	git://leontynka.twibright.com/nvfs.git
> Seems to hang on git pull at the moment...  Do you have it anywhere else?
> 
> > I found out that on NVFS, reading a file with the read method has 10%
> > better performance than the read_iter method. The benchmark just reads the
> > same 4k page over and over again - and the cost of creating and parsing
> > the kiocb and iov_iter structures is just that high.
> 
> Apples and oranges...  What happens if you take
> 
> ssize_t read_iter_locked(struct file *file, struct iov_iter *to, loff_t *ppos)
> {
> 	struct inode *inode = file_inode(file);
> 	struct nvfs_memory_inode *nmi = i_to_nmi(inode);
> 	struct nvfs_superblock *nvs = inode->i_sb->s_fs_info;
> 	ssize_t total = 0;
> 	loff_t pos = *ppos;
> 	int r;
> 	int shift = nvs->log2_page_size;
> 	size_t i_size;
> 
> 	i_size = inode->i_size;
> 	if (pos >= i_size)
> 		return 0;
> 	iov_iter_truncate(to, i_size - pos);
> 
> 	while (iov_iter_count(to)) {
> 		void *blk, *ptr;
> 		size_t page_mask = (1UL << shift) - 1;
> 		unsigned page_offset = pos & page_mask;
> 		unsigned prealloc = (iov_iter_count(to) + page_mask) >> shift;
> 		unsigned size;
> 
> 		blk = nvfs_bmap(nmi, pos >> shift, &prealloc, NULL, NULL, NULL);
> 		if (unlikely(IS_ERR(blk))) {
> 			r = PTR_ERR(blk);
> 			goto ret_r;
> 		}
> 		size = ((size_t)prealloc << shift) - page_offset;
> 		ptr = blk + page_offset;
> 		if (unlikely(!blk)) {
> 			size = min(size, (unsigned)PAGE_SIZE);
> 			ptr = empty_zero_page;
> 		}
> 		size = copy_to_iter(to, ptr, size);
> 		if (unlikely(!size)) {
> 			r = -EFAULT;
> 			goto ret_r;
> 		}
> 
> 		pos += size;
> 		total += size;
> 	} while (iov_iter_count(to));

That isn't the best formed loop!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] nvfs: a filesystem for persistent memory
  2021-01-10 23:40     ` Al Viro
@ 2021-01-11 11:41       ` Mikulas Patocka
  0 siblings, 0 replies; 30+ messages in thread
From: Mikulas Patocka @ 2021-01-11 11:41 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Matthew Wilcox, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh, linux-kernel,
	linux-fsdevel, linux-nvdimm



On Sun, 10 Jan 2021, Al Viro wrote:

> On Sun, Jan 10, 2021 at 04:14:55PM -0500, Mikulas Patocka wrote:
> 
> > That's a good point. I split nvfs_rw_iter to separate functions 
> > nvfs_read_iter and nvfs_write_iter - and inlined nvfs_rw_iter_locked into 
> > both of them. It improved performance by 1.3%.
> > 
> > > Not that it had been more useful on the write side, really,
> > > but that's another story (nvfs_write_pages() handling of
> > > copyin is... interesting).  Let's figure out what's going
> > > on with the read overhead first...
> > > 
> > > lib/iov_iter.c primitives certainly could use massage for
> > > better code generation, but let's find out how much of the
> > > PITA is due to those and how much comes from you fighing
> > > the damn thing instead of using it sanely...
> > 
> > The results are:
> > 
> > read:                                           6.744s
> > read_iter:                                      7.417s
> > read_iter - separate read and write path:       7.321s
> > Al's read_iter:                                 7.182s
> > Al's read_iter with _copy_to_iter:              7.181s
> 
> So
> 	* overhead of hardening stuff is noise here
> 	* switching to more straightforward ->read_iter() cuts
> the overhead by about 1/3.
> 
> 	Interesting...  I wonder how much of that is spent in
> iterate_and_advance() glue inside copy_to_iter() here.  There's
> certainly quite a bit of optimizations possible in those
> primitives and your usecase makes a decent test for that...
> 
> 	Could you profile that and see where is it spending
> the time, on instruction level?

This is the read method profile:

time	9.056s
    52.69%  pread    [kernel.vmlinux]  [k] copy_user_generic_string
     6.24%  pread    [kernel.vmlinux]  [k] current_time
     6.22%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64
     4.88%  pread    libc-2.31.so      [.] __libc_pread
     3.75%  pread    [kernel.vmlinux]  [k] syscall_return_via_sysret
     3.63%  pread    [nvfs]            [k] nvfs_read
     2.83%  pread    [nvfs]            [k] nvfs_bmap
     2.81%  pread    [kernel.vmlinux]  [k] vfs_read
     2.63%  pread    [kernel.vmlinux]  [k] __x64_sys_pread64
     2.27%  pread    [kernel.vmlinux]  [k] __fsnotify_parent
     2.19%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     1.55%  pread    [kernel.vmlinux]  [k] atime_needs_update
     1.17%  pread    [kernel.vmlinux]  [k] syscall_enter_from_user_mode
     1.15%  pread    [kernel.vmlinux]  [k] touch_atime
     0.84%  pread    [kernel.vmlinux]  [k] down_read
     0.82%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode
     0.71%  pread    [kernel.vmlinux]  [k] do_syscall_64
     0.68%  pread    [kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
     0.66%  pread    [kernel.vmlinux]  [k] __fget_light
     0.53%  pread    [kernel.vmlinux]  [k] exit_to_user_mode_prepare
     0.45%  pread    [kernel.vmlinux]  [k] up_read
     0.44%  pread    pread             [.] main
     0.44%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
     0.26%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
     0.12%  pread    pread             [.] pread@plt
     0.07%  pread    [kernel.vmlinux]  [k] __fdget
     0.00%  perf     [kernel.vmlinux]  [k] x86_pmu_enable_all


This is profile of "read_iter - separate read and write path":

time	10.058s
    53.05%  pread    [kernel.vmlinux]  [k] copy_user_generic_string
     6.82%  pread    [kernel.vmlinux]  [k] current_time
     6.27%  pread    [nvfs]            [k] nvfs_read_iter
     4.70%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64
     3.20%  pread    libc-2.31.so      [.] __libc_pread
     2.77%  pread    [kernel.vmlinux]  [k] syscall_return_via_sysret
     2.31%  pread    [kernel.vmlinux]  [k] vfs_read
     2.15%  pread    [kernel.vmlinux]  [k] new_sync_read
     2.06%  pread    [kernel.vmlinux]  [k] __fsnotify_parent
     2.02%  pread    [nvfs]            [k] nvfs_bmap
     1.87%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     1.86%  pread    [kernel.vmlinux]  [k] iov_iter_advance
     1.62%  pread    [kernel.vmlinux]  [k] __x64_sys_pread64
     1.40%  pread    [kernel.vmlinux]  [k] atime_needs_update
     0.99%  pread    [kernel.vmlinux]  [k] syscall_enter_from_user_mode
     0.85%  pread    [kernel.vmlinux]  [k] touch_atime
     0.85%  pread    [kernel.vmlinux]  [k] down_read
     0.84%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode
     0.78%  pread    [kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
     0.65%  pread    [kernel.vmlinux]  [k] __fget_light
     0.57%  pread    [kernel.vmlinux]  [k] exit_to_user_mode_prepare
     0.53%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
     0.45%  pread    pread             [.] main
     0.43%  pread    [kernel.vmlinux]  [k] up_read
     0.43%  pread    [kernel.vmlinux]  [k] do_syscall_64
     0.28%  pread    [kernel.vmlinux]  [k] iov_iter_init
     0.16%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
     0.09%  pread    pread             [.] pread@plt
     0.03%  pread    [kernel.vmlinux]  [k] __fdget
     0.00%  pread    [kernel.vmlinux]  [k] update_rt_rq_load_avg
     0.00%  perf     [kernel.vmlinux]  [k] x86_pmu_enable_all


This is your read_iter_locked profile (read_iter_locked is inlined to 
nvfs_read_iter):

time	10.056s
    50.71%  pread    [kernel.vmlinux]  [k] copy_user_generic_string
     6.95%  pread    [kernel.vmlinux]  [k] current_time
     5.22%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64
     4.29%  pread    libc-2.31.so      [.] __libc_pread
     4.17%  pread    [nvfs]            [k] nvfs_read_iter
     3.20%  pread    [kernel.vmlinux]  [k] syscall_return_via_sysret
     2.66%  pread    [kernel.vmlinux]  [k] _copy_to_iter
     2.44%  pread    [kernel.vmlinux]  [k] __x64_sys_pread64
     2.38%  pread    [kernel.vmlinux]  [k] new_sync_read
     2.37%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     2.26%  pread    [kernel.vmlinux]  [k] vfs_read
     2.02%  pread    [nvfs]            [k] nvfs_bmap
     1.88%  pread    [kernel.vmlinux]  [k] __fsnotify_parent
     1.46%  pread    [kernel.vmlinux]  [k] atime_needs_update
     1.08%  pread    [kernel.vmlinux]  [k] touch_atime
     0.83%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode
     0.82%  pread    [kernel.vmlinux]  [k] syscall_enter_from_user_mode
     0.75%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
     0.73%  pread    [kernel.vmlinux]  [k] __fget_light
     0.65%  pread    [kernel.vmlinux]  [k] down_read
     0.58%  pread    pread             [.] main
     0.58%  pread    [kernel.vmlinux]  [k] exit_to_user_mode_prepare
     0.52%  pread    [kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
     0.48%  pread    [kernel.vmlinux]  [k] up_read
     0.42%  pread    [kernel.vmlinux]  [k] do_syscall_64
     0.28%  pread    [kernel.vmlinux]  [k] iov_iter_init
     0.13%  pread    [kernel.vmlinux]  [k] __fdget
     0.12%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
     0.03%  pread    pread             [.] pread@plt
     0.00%  perf     [kernel.vmlinux]  [k] x86_pmu_enable_all

Mikulas


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [RFC v2] nvfs: a filesystem for persistent memory
  2021-01-11 10:29   ` David Laight
@ 2021-01-11 11:44     ` Mikulas Patocka
  2021-01-11 11:57       ` David Laight
  0 siblings, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2021-01-11 11:44 UTC (permalink / raw)
  To: David Laight
  Cc: 'Al Viro',
	Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Matthew Wilcox, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh, linux-kernel,
	linux-fsdevel, linux-nvdimm



On Mon, 11 Jan 2021, David Laight wrote:

> From: Al Viro <viro@ftp.linux.org.uk> On Behalf Of Al Viro
> > Sent: 10 January 2021 16:20
> > 
> > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > > Hi
> > >
> > > I announce a new version of NVFS - a filesystem for persistent memory.
> > > 	http://people.redhat.com/~mpatocka/nvfs/
> > Utilities, AFAICS
> > 
> > > 	git://leontynka.twibright.com/nvfs.git
> > Seems to hang on git pull at the moment...  Do you have it anywhere else?
> > 
> > > I found out that on NVFS, reading a file with the read method has 10%
> > > better performance than the read_iter method. The benchmark just reads the
> > > same 4k page over and over again - and the cost of creating and parsing
> > > the kiocb and iov_iter structures is just that high.
> > 
> > Apples and oranges...  What happens if you take
> > 
> > ssize_t read_iter_locked(struct file *file, struct iov_iter *to, loff_t *ppos)
> > {
> > 	struct inode *inode = file_inode(file);
> > 	struct nvfs_memory_inode *nmi = i_to_nmi(inode);
> > 	struct nvfs_superblock *nvs = inode->i_sb->s_fs_info;
> > 	ssize_t total = 0;
> > 	loff_t pos = *ppos;
> > 	int r;
> > 	int shift = nvs->log2_page_size;
> > 	size_t i_size;
> > 
> > 	i_size = inode->i_size;
> > 	if (pos >= i_size)
> > 		return 0;
> > 	iov_iter_truncate(to, i_size - pos);
> > 
> > 	while (iov_iter_count(to)) {
> > 		void *blk, *ptr;
> > 		size_t page_mask = (1UL << shift) - 1;
> > 		unsigned page_offset = pos & page_mask;
> > 		unsigned prealloc = (iov_iter_count(to) + page_mask) >> shift;
> > 		unsigned size;
> > 
> > 		blk = nvfs_bmap(nmi, pos >> shift, &prealloc, NULL, NULL, NULL);
> > 		if (unlikely(IS_ERR(blk))) {
> > 			r = PTR_ERR(blk);
> > 			goto ret_r;
> > 		}
> > 		size = ((size_t)prealloc << shift) - page_offset;
> > 		ptr = blk + page_offset;
> > 		if (unlikely(!blk)) {
> > 			size = min(size, (unsigned)PAGE_SIZE);
> > 			ptr = empty_zero_page;
> > 		}
> > 		size = copy_to_iter(to, ptr, size);
> > 		if (unlikely(!size)) {
> > 			r = -EFAULT;
> > 			goto ret_r;
> > 		}
> > 
> > 		pos += size;
> > 		total += size;
> > 	} while (iov_iter_count(to));
> 
> That isn't the best formed loop!
> 
> 	David

I removed the second "while" statement and fixed the arguments to 
copy_to_iter - other than that, Al's function works.

Mikuklas


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [RFC v2] nvfs: a filesystem for persistent memory
  2021-01-11 11:44     ` Mikulas Patocka
@ 2021-01-11 11:57       ` David Laight
  2021-01-11 14:43         ` Al Viro
  0 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2021-01-11 11:57 UTC (permalink / raw)
  To: 'Mikulas Patocka'
  Cc: 'Al Viro',
	Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Matthew Wilcox, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh, linux-kernel,
	linux-fsdevel, linux-nvdimm

From: Mikulas Patocka
> Sent: 11 January 2021 11:44
> On Mon, 11 Jan 2021, David Laight wrote:
> 
> > From: Al Viro <viro@ftp.linux.org.uk> On Behalf Of Al Viro
> > > Sent: 10 January 2021 16:20
...
...
> > > 	while (iov_iter_count(to)) {
...
> > > 		size = copy_to_iter(to, ptr, size);
> > > 		if (unlikely(!size)) {
> > > 			r = -EFAULT;
> > > 			goto ret_r;
> > > 		}
> > >
> > > 		pos += size;
> > > 		total += size;
> > > 	} while (iov_iter_count(to));
> >
> > That isn't the best formed loop!
> >
> > 	David
> 
> I removed the second "while" statement and fixed the arguments to
> copy_to_iter - other than that, Al's function works.

The extra while is easy to write and can be difficult to spot.
I've found them looking as the object code before now!

Oh - the error return for copy_to_iter() is wrong.
It should (probably) return 'total' if it is nonzero.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v2] nvfs: a filesystem for persistent memory
  2021-01-11 11:57       ` David Laight
@ 2021-01-11 14:43         ` Al Viro
  2021-01-11 14:54           ` David Laight
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2021-01-11 14:43 UTC (permalink / raw)
  To: David Laight
  Cc: 'Mikulas Patocka',
	Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Matthew Wilcox, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh, linux-kernel,
	linux-fsdevel, linux-nvdimm

On Mon, Jan 11, 2021 at 11:57:08AM +0000, David Laight wrote:
> > > > 		size = copy_to_iter(to, ptr, size);
> > > > 		if (unlikely(!size)) {
> > > > 			r = -EFAULT;
> > > > 			goto ret_r;
> > > > 		}
> > > >
> > > > 		pos += size;
> > > > 		total += size;
> > > > 	} while (iov_iter_count(to));
> > >
> > > That isn't the best formed loop!
> > >
> > > 	David
> > 
> > I removed the second "while" statement and fixed the arguments to
> > copy_to_iter - other than that, Al's function works.
> 
> The extra while is easy to write and can be difficult to spot.
> I've found them looking as the object code before now!

	That extra while comes from editing cut'n'pasted piece of
source while writing a reply.  Hint: original had been a do-while.

> Oh - the error return for copy_to_iter() is wrong.
> It should (probably) return 'total' if it is nonzero.

	copy_to_iter() call there has an obvious problem
(arguments in the wrong order), but return value is handled
correctly.  It does not do a blind return -EFAULT.  RTFS...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [RFC v2] nvfs: a filesystem for persistent memory
  2021-01-11 14:43         ` Al Viro
@ 2021-01-11 14:54           ` David Laight
  0 siblings, 0 replies; 30+ messages in thread
From: David Laight @ 2021-01-11 14:54 UTC (permalink / raw)
  To: 'Al Viro'
  Cc: 'Mikulas Patocka',
	Andrew Morton, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Matthew Wilcox, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh, linux-kernel,
	linux-fsdevel, linux-nvdimm

From: Al Viro <viro@ftp.linux.org.uk> On Behalf Of Al Viro
> Sent: 11 January 2021 14:44
> On Mon, Jan 11, 2021 at 11:57:08AM +0000, David Laight wrote:
> > > > > 		size = copy_to_iter(to, ptr, size);
> > > > > 		if (unlikely(!size)) {
> > > > > 			r = -EFAULT;
> > > > > 			goto ret_r;
> > > > > 		}
> > > > >
> > > > > 		pos += size;
> > > > > 		total += size;
> > > > > 	}
> > > >
> > > > 	David
> > >
> > > I fixed the arguments to
> > > copy_to_iter - other than that, Al's function works.
> >
> 
> > Oh - the error return for copy_to_iter() is wrong.
> > It should (probably) return 'total' if it is nonzero.
> 
> 	copy_to_iter() call there has an obvious problem
> (arguments in the wrong order), but return value is handled
> correctly.  It does not do a blind return -EFAULT.  RTFS...

Ah I was looking at the version I'd cut the tail off...

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-11  0:18         ` Matthew Wilcox
@ 2021-01-11 21:10           ` Mikulas Patocka
  0 siblings, 0 replies; 30+ messages in thread
From: Mikulas Patocka @ 2021-01-11 21:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Al Viro, Mingkai Dong, Andrew Morton, Dan Williams, Vishal Verma,
	Dave Jiang, Ira Weiny, Jan Kara, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Theodore Ts'o, Wang Jianchao, Kani, Toshi,
	Norton, Scott J, Tadakamadla, Rajesh, linux-kernel,
	linux-fsdevel, linux-nvdimm



On Mon, 11 Jan 2021, Matthew Wilcox wrote:

> On Sun, Jan 10, 2021 at 04:19:15PM -0500, Mikulas Patocka wrote:
> > I put counters into vfs_read and vfs_readv.
> > 
> > After a fresh boot of the virtual machine, the counters show "13385 4". 
> > After a kernel compilation they show "4475220 8".
> > 
> > So, the readv path is almost unused.
> > 
> > My reasoning was that we should optimize for the "read" path and glue the 
> > "readv" path on the top of that. Currently, the kernel is doing the 
> > opposite - optimizing for "readv" and glueing "read" on the top of it.
> 
> But it's not about optimising for read vs readv.  read_iter handles
> a host of other cases, such as pread(), preadv(), AIO reads, splice,
> and reads to in-kernel buffers.

These things are used rarely compared to "read" and "pread". (BTW. "pread" 
could be handled by the read method too).

What's the reason why do you think that the "read" syscall should use the 
"read_iter" code path? Is it because duplicating the logic is discouraged? 
Or because of code size? Or something else?

> Some device drivers abused read() vs readv() to actually return different 
> information, depending which you called.  That's why there's now a
> prohibition against both.
> 
> So let's figure out how to make iter_read() perform well for sys_read().

I've got another idea - in nvfs_read_iter, test if the iovec contains just 
one entry and call nvfs_read_locked if it does.

diff --git a/file.c b/file.c
index f4b8a1a..e4d87b2 100644
--- a/file.c
+++ b/file.c
@@ -460,6 +460,10 @@ static ssize_t nvfs_read_iter(struct kiocb *iocb, struct iov_iter *iov)
 	if (!IS_DAX(&nmi->vfs_inode)) {
 		r = generic_file_read_iter(iocb, iov);
 	} else {
+		if (likely(iter_is_iovec(iov)) && likely(!iov->iov_offset) && likely(iov->nr_segs == 1)) {
+			r = nvfs_read_locked(nmi, iocb->ki_filp, iov->iov->iov_base, iov->count, true, &iocb->ki_pos);
+			goto unlock_ret;
+		}
 #if 1
 		r = nvfs_rw_iter_locked(iocb, iov, false);
 #else
@@ -467,6 +471,7 @@ static ssize_t nvfs_read_iter(struct kiocb *iocb, struct iov_iter *iov)
 #endif
 	}
 
+unlock_ret:
 	inode_unlock_shared(&nmi->vfs_inode);
 
 	return r;



The result is:

nvfs_read_iter			- 7.307s
Al Viro's read_iter_locked	- 7.147s
test for just one entry		- 7.010s
the read method			- 6.782s

So far, this is the best way how to do it, but it's still 3.3% worse than 
the read method. There's not anything more that could be optimized on the 
filesystem level - the rest of optimizations must be done in the VFS.

Mikulas


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-07 16:43   ` Mingkai Dong
@ 2021-01-12 13:45     ` Zhongwei Cai
  2021-01-12 14:06       ` David Laight
  2021-01-13 16:44       ` Mikulas Patocka
  0 siblings, 2 replies; 30+ messages in thread
From: Zhongwei Cai @ 2021-01-12 13:45 UTC (permalink / raw)
  To: Mingkai Dong, Matthew Wilcox
  Cc: Mikulas Patocka, Andrew Morton, Jan Kara, Steven Whitehouse,
	Eric Sandeen, Dave Chinner, Theodore Ts'o, Wang Jianchao,
	Tadakamadla, Rajesh, linux-kernel, linux-fsdevel, linux-nvdimm


I'm working with Mingkai on optimizations for Ext4-dax.
We think that optmizing the read-iter method cannot achieve the
same performance as the read method for Ext4-dax. 
We tried Mikulas's benchmark on Ext4-dax. The overall time and perf
results are listed below:

Overall time of 2^26 4KB read.

Method       Time
read         26.782s
read-iter    36.477s

Perf result, using the read_iter method:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 13379476464
#
# Overhead  Command  Shared Object     Symbol                                 
# ........  .......  ................  .......................................
#
    20.09%  pread    [kernel.vmlinux]  [k] copy_user_generic_string
     6.58%  pread    [kernel.vmlinux]  [k] iomap_apply
     6.01%  pread    [kernel.vmlinux]  [k] syscall_return_via_sysret
     4.85%  pread    libc-2.31.so      [.] __libc_pread
     3.61%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     3.25%  pread    [kernel.vmlinux]  [k] _raw_read_lock
     2.80%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64
     2.71%  pread    [ext4]            [k] ext4_es_lookup_extent
     2.71%  pread    [kernel.vmlinux]  [k] __fsnotify_parent
     2.63%  pread    [kernel.vmlinux]  [k] __srcu_read_unlock
     2.55%  pread    [kernel.vmlinux]  [k] new_sync_read
     2.39%  pread    [ext4]            [k] ext4_iomap_begin
     2.38%  pread    [kernel.vmlinux]  [k] vfs_read
     2.30%  pread    [kernel.vmlinux]  [k] dax_iomap_actor
     2.30%  pread    [kernel.vmlinux]  [k] __srcu_read_lock
     2.14%  pread    [ext4]            [k] ext4_inode_block_valid
     1.97%  pread    [kernel.vmlinux]  [k] _copy_mc_to_iter
     1.97%  pread    [ext4]            [k] ext4_map_blocks
     1.89%  pread    [kernel.vmlinux]  [k] down_read
     1.89%  pread    [kernel.vmlinux]  [k] up_read
     1.65%  pread    [ext4]            [k] ext4_file_read_iter
     1.48%  pread    [kernel.vmlinux]  [k] dax_iomap_rw
     1.48%  pread    [jbd2]            [k] jbd2_transaction_committed
     1.15%  pread    [nd_pmem]         [k] __pmem_direct_access
     1.15%  pread    [kernel.vmlinux]  [k] ksys_pread64
     1.15%  pread    [kernel.vmlinux]  [k] __fget_light
     1.15%  pread    [ext4]            [k] ext4_set_iomap
     1.07%  pread    [kernel.vmlinux]  [k] atime_needs_update
     0.82%  pread    pread             [.] main
     0.82%  pread    [kernel.vmlinux]  [k] do_syscall_64
     0.74%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
     0.66%  pread    [kernel.vmlinux]  [k] __x86_indirect_thunk_rax
     0.66%  pread    [nd_pmem]         [k] 0x00000000000001d0
     0.59%  pread    [kernel.vmlinux]  [k] dax_direct_access
     0.58%  pread    [nd_pmem]         [k] 0x00000000000001de
     0.58%  pread    [kernel.vmlinux]  [k] bdev_dax_pgoff
     0.49%  pread    [kernel.vmlinux]  [k] syscall_enter_from_user_mode
     0.49%  pread    [kernel.vmlinux]  [k] exit_to_user_mode_prepare
     0.49%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode
     0.41%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
     0.33%  pread    [nd_pmem]         [k] 0x0000000000001083
     0.33%  pread    [kernel.vmlinux]  [k] dax_get_private
     0.33%  pread    [kernel.vmlinux]  [k] timestamp_truncate
     0.33%  pread    [kernel.vmlinux]  [k] percpu_counter_add_batch
     0.33%  pread    [kernel.vmlinux]  [k] copyout_mc
     0.33%  pread    [ext4]            [k] __check_block_validity.constprop.80
     0.33%  pread    [kernel.vmlinux]  [k] touch_atime
     0.25%  pread    [nd_pmem]         [k] 0x000000000000107f
     0.25%  pread    [kernel.vmlinux]  [k] rw_verify_area
     0.25%  pread    [ext4]            [k] ext4_iomap_end
     0.25%  pread    [kernel.vmlinux]  [k] _cond_resched
     0.25%  pread    [kernel.vmlinux]  [k] rcu_all_qs
     0.16%  pread    [kernel.vmlinux]  [k] __fdget
     0.16%  pread    [kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
     0.16%  pread    [kernel.vmlinux]  [k] iov_iter_init
     0.16%  pread    [kernel.vmlinux]  [k] current_time
     0.16%  pread    [nd_pmem]         [k] 0x0000000000001075
     0.16%  pread    [ext4]            [k] ext4_inode_datasync_dirty
     0.16%  pread    [kernel.vmlinux]  [k] copy_mc_to_user
     0.08%  pread    pread             [.] pread@plt
     0.08%  pread    [kernel.vmlinux]  [k] __x86_indirect_thunk_r11
     0.08%  pread    [kernel.vmlinux]  [k] security_file_permission
     0.08%  pread    [kernel.vmlinux]  [k] dax_read_unlock
     0.08%  pread    [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
     0.08%  pread    [nd_pmem]         [k] 0x000000000000108f
     0.08%  pread    [nd_pmem]         [k] 0x0000000000001095
     0.08%  pread    [kernel.vmlinux]  [k] rcu_read_unlock_strict
     0.00%  pread    [kernel.vmlinux]  [k] native_write_msr


#
# (Tip: Show current config key-value pairs: perf config --list)
#

Perf result, using the read method we added for Ext4-dax:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 13364755903
#
# Overhead  Command  Shared Object     Symbol                                 
# ........  .......  ................  .......................................
#
    28.65%  pread    [kernel.vmlinux]  [k] copy_user_generic_string
     7.99%  pread    [ext4]            [k] ext4_dax_read
     6.50%  pread    [kernel.vmlinux]  [k] syscall_return_via_sysret
     5.43%  pread    libc-2.31.so      [.] __libc_pread
     4.45%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64
     4.20%  pread    [kernel.vmlinux]  [k] down_read
     3.38%  pread    [kernel.vmlinux]  [k] _raw_read_lock
     3.13%  pread    [ext4]            [k] ext4_es_lookup_extent
     3.05%  pread    [kernel.vmlinux]  [k] __srcu_read_lock
     2.72%  pread    [kernel.vmlinux]  [k] __fsnotify_parent
     2.55%  pread    [kernel.vmlinux]  [k] __srcu_read_unlock
     2.47%  pread    [kernel.vmlinux]  [k] vfs_read
     2.31%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     1.89%  pread    [kernel.vmlinux]  [k] up_read
     1.73%  pread    [ext4]            [k] ext4_map_blocks
     1.65%  pread    pread             [.] main
     1.56%  pread    [kernel.vmlinux]  [k] __fget_light
     1.48%  pread    [ext4]            [k] ext4_inode_block_valid
     1.34%  pread    [kernel.vmlinux]  [k] ksys_pread64
     1.23%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
     1.08%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode
     1.07%  pread    [nd_pmem]         [k] __pmem_direct_access
     0.99%  pread    [kernel.vmlinux]  [k] atime_needs_update
     0.91%  pread    [kernel.vmlinux]  [k] security_file_permission
     0.91%  pread    [kernel.vmlinux]  [k] syscall_enter_from_user_mode
     0.66%  pread    [kernel.vmlinux]  [k] timestamp_truncate
     0.58%  pread    [kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
     0.49%  pread    pread             [.] pread@plt
     0.41%  pread    [kernel.vmlinux]  [k] current_time
     0.41%  pread    [kernel.vmlinux]  [k] dax_direct_access
     0.41%  pread    [kernel.vmlinux]  [k] do_syscall_64
     0.41%  pread    [kernel.vmlinux]  [k] exit_to_user_mode_prepare
     0.41%  pread    [kernel.vmlinux]  [k] percpu_counter_add_batch
     0.33%  pread    [kernel.vmlinux]  [k] touch_atime
     0.33%  pread    [ext4]            [k] __check_block_validity.constprop.80
     0.33%  pread    [kernel.vmlinux]  [k] copy_mc_to_user
     0.25%  pread    [kernel.vmlinux]  [k] dax_get_private
     0.25%  pread    [kernel.vmlinux]  [k] rcu_all_qs
     0.25%  pread    [nd_pmem]         [k] 0x0000000000001095
     0.16%  pread    [kernel.vmlinux]  [k] _raw_spin_lock_irqsave
     0.16%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
     0.16%  pread    [nd_pmem]         [k] 0x0000000000001083
     0.16%  pread    [kernel.vmlinux]  [k] rw_verify_area
     0.16%  pread    [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
     0.16%  pread    [kernel.vmlinux]  [k] __fdget
     0.16%  pread    [kernel.vmlinux]  [k] dax_read_lock
     0.16%  pread    [kernel.vmlinux]  [k] __x86_indirect_thunk_rax
     0.08%  pread    [kernel.vmlinux]  [k] rcu_read_unlock_strict
     0.08%  pread    [kernel.vmlinux]  [k] dax_read_unlock
     0.08%  pread    [kernel.vmlinux]  [k] update_irq_load_avg
     0.08%  pread    [nd_pmem]         [k] 0x000000000000109d
     0.08%  pread    [nd_pmem]         [k] 0x000000000000107a
     0.08%  pread    [kernel.vmlinux]  [k] __x64_sys_pread64
     0.00%  pread    [kernel.vmlinux]  [k] native_write_msr


#
# (Tip: Sample related events with: perf record -e '{cycles,instructions}:S')
#

Note that the overall time of read method is 73.42% of the read-iter method.
If we sum up the percentage of read-iter specific functions (including
ext4_file_read_iter, iomap_apply, dax_iomap_actor, _copy_mc_to_iter,
ext4_iomap_begin, jbd2_transaction_committed, new_sync_read, dax_iomap_rw,
ext4_set_iomap, ext4_iomap_end and iov_iter_init), we will get 20.81%.
In the second trace, ext4_dax_read only consumes 7.99%, which can replace
all these functions.

The overhead mainly consists of two parts. The first is constructing
struct iov_iter and iterating it (i.e., new_sync, _copy_mc_to_iter and
iov_iter_init). The second is the dax io mechanism provided by VFS (i.e.,
dax_iomap_rw, iomap_apply and ext4_iomap_begin).

There could be two approaches to optimizing: 1) implementing the read method
without the complexity of iterators and dax_iomap_rw; 2) optimizing both
iterators and how dax_iomap_rw works. Since dax_iomap_rw requires
ext4_iomap_begin, which further involves the iomap structure and others
(e.g., journaling status locks in Ext4), we think implementing the read
method would be easier.

Thanks,
Zhongwei


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Expense of read_iter
  2021-01-12 13:45     ` Zhongwei Cai
@ 2021-01-12 14:06       ` David Laight
  2021-01-13 16:44       ` Mikulas Patocka
  1 sibling, 0 replies; 30+ messages in thread
From: David Laight @ 2021-01-12 14:06 UTC (permalink / raw)
  To: 'Zhongwei Cai', Mingkai Dong, Matthew Wilcox, viro
  Cc: Mikulas Patocka, Andrew Morton, Jan Kara, Steven Whitehouse,
	Eric Sandeen, Dave Chinner, Theodore Ts'o, Wang Jianchao,
	Tadakamadla, Rajesh, linux-kernel, linux-fsdevel, linux-nvdimm

From: Zhongwei Cai
> Sent: 12 January 2021 13:45
..
> The overhead mainly consists of two parts. The first is constructing
> struct iov_iter and iterating it (i.e., new_sync, _copy_mc_to_iter and
> iov_iter_init). The second is the dax io mechanism provided by VFS (i.e.,
> dax_iomap_rw, iomap_apply and ext4_iomap_begin).

Setting up an iov_iter with a single buffer ought to be relatively
cheap - compared to a file system read.

The iteration should be over the total length
calling copy_from/to_iter() for 'chunks' that don't
depend on the size of the iov[] fragments.

So copy_to/from_iter() should directly replace the copy_to/from_user()
calls in the 'read' method.
For a single buffer this really ought to be noise as well.

Clearly is the iov[] has a lot of short fragments the copy
will be more expensive.

Access to /dev/null and /dev/zero are much more likely to show
the additional costs of the iov_iter code than fs code.


	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-12 13:45     ` Zhongwei Cai
  2021-01-12 14:06       ` David Laight
@ 2021-01-13 16:44       ` Mikulas Patocka
  2021-01-15  9:40         ` Zhongwei Cai
  1 sibling, 1 reply; 30+ messages in thread
From: Mikulas Patocka @ 2021-01-13 16:44 UTC (permalink / raw)
  To: Zhongwei Cai
  Cc: Mingkai Dong, Matthew Wilcox, Andrew Morton, Jan Kara,
	Steven Whitehouse, Eric Sandeen, Dave Chinner, Theodore Ts'o,
	Wang Jianchao, Tadakamadla, Rajesh, linux-kernel, linux-fsdevel,
	linux-nvdimm



On Tue, 12 Jan 2021, Zhongwei Cai wrote:

> 
> I'm working with Mingkai on optimizations for Ext4-dax.

What specific patch are you working on? Please, post it somewhere.

> We think that optmizing the read-iter method cannot achieve the
> same performance as the read method for Ext4-dax. 
> We tried Mikulas's benchmark on Ext4-dax. The overall time and perf
> results are listed below:
> 
> Overall time of 2^26 4KB read.
> 
> Method       Time
> read         26.782s
> read-iter    36.477s

What happens if you use this trick ( https://lkml.org/lkml/2021/1/11/1612 )
- detect in the "read_iter" method that there is just one segment and 
treat it like a "read" method. I think that it should improve performance 
for your case.

Mikulas


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-13 16:44       ` Mikulas Patocka
@ 2021-01-15  9:40         ` Zhongwei Cai
  2021-01-20  4:47           ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Zhongwei Cai @ 2021-01-15  9:40 UTC (permalink / raw)
  To: Mikulas Patocka, Theodore Ts'o, Matthew Wilcox, David Laight
  Cc: Mingkai Dong, Andrew Morton, Jan Kara, Steven Whitehouse,
	Eric Sandeen, Dave Chinner, Wang Jianchao, Rajesh Tadakamadla,
	linux-kernel, linux-fsdevel, linux-nvdimm

On Thu, 14 Jan 2021, Mikulas wrote:

>> I'm working with Mingkai on optimizations for Ext4-dax.
>
> What specific patch are you working on? Please, post it somewhere.

Here is the work-in-progress patch: https://ipads.se.sjtu.edu.cn:1312/opensource/linux/-/tree/ext4-read
It only contains the "read" implementation for Ext4-dax now, though, we
will put other optimizations on it later.

> What happens if you use this trick ( https://lkml.org/lkml/2021/1/11/1612 )
> - detect in the "read_iter" method that there is just one segment and 
> treat it like a "read" method. I think that it should improve performance 
> for your case.

Note that the original Ext4-dax does not implement the "read" method. Instead, it
calls the "dax_iomap_rw" method provided by VFS. So we firstly rewrite
the "read-iter" method which iterates struct iov_iter and calls our
"read" method as a baseline for comparison.

Overall time of 2^26 4KB read:
"read-iter" method with dax-iomap-rw (original)              - 36.477s
"read_iter" method wraps our "read" method                   - 28.950s
"read_iter" method tests for one entry proposed by Mikulas   - 27.947s
"read" method                                                - 26.899s

As we mentioned in the previous email (https://lkml.org/lkml/2021/1/12/710),
the overhead mainly consists of two parts. The first is constructing
struct iov_iter and iterating it (i.e., new_sync, _copy_mc_to_iter and
iov_iter_init). The second is the dax io mechanism provided by VFS (i.e.,
dax_iomap_rw, iomap_apply and ext4_iomap_begin).

For Ext4-dax, the overhead of dax_iomap_rw is significant
compared to the overhead of struct iov_iter. Although methods
proposed by Mikulas can eliminate the overhead of iov_iter
well, they can not be applied in Ext4-dax unless we implement an
internal "read" method in Ext4-dax.

For Ext4-dax, there could be two approaches to optimizing:
1) implementing the internal "read" method without the complexity
of iterators and dax_iomap_rw; 2) optimizing how dax_iomap_rw works.
Since dax_iomap_rw requires ext4_iomap_begin, which further involves
the iomap structure and others (e.g., journaling status locks in Ext4),
we think implementing the internal "read" method would be easier.

As for whether the external .read interface in VFS should be reserved,
since there is still a performance gap (3.9%) between the "read" method
and the optimized "read_iter" method, we think reserving it is better.

Thanks,
Zhongwei

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-15  9:40         ` Zhongwei Cai
@ 2021-01-20  4:47           ` Dave Chinner
  2021-01-20 14:18             ` Jan Kara
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2021-01-20  4:47 UTC (permalink / raw)
  To: Zhongwei Cai
  Cc: Mikulas Patocka, Theodore Ts'o, Matthew Wilcox, David Laight,
	Mingkai Dong, Andrew Morton, Jan Kara, Steven Whitehouse,
	Eric Sandeen, Dave Chinner, Wang Jianchao, Rajesh Tadakamadla,
	linux-kernel, linux-fsdevel, linux-nvdimm

On Fri, Jan 15, 2021 at 05:40:43PM +0800, Zhongwei Cai wrote:
> On Thu, 14 Jan 2021, Mikulas wrote:
> For Ext4-dax, the overhead of dax_iomap_rw is significant
> compared to the overhead of struct iov_iter. Although methods
> proposed by Mikulas can eliminate the overhead of iov_iter
> well, they can not be applied in Ext4-dax unless we implement an
> internal "read" method in Ext4-dax.
> 
> For Ext4-dax, there could be two approaches to optimizing:
> 1) implementing the internal "read" method without the complexity
> of iterators and dax_iomap_rw;

Please do not go an re-invent the wheel just for ext4. If there's a
problem in a shared path - ext2, FUSE and XFS all use dax_iomap_rw()
as well, so any improvements to that path benefit all DAX users, not
just ext4.

> 2) optimizing how dax_iomap_rw works.
> Since dax_iomap_rw requires ext4_iomap_begin, which further involves
> the iomap structure and others (e.g., journaling status locks in Ext4),
> we think implementing the internal "read" method would be easier.

Maybe it is, but it's also very selfish. The DAX iomap path was
written to be correct for all users, not inecessarily provide
optimal performance. There will be lots of things that could be done
to optimise it, so rather than creating a special snowflake in ext4
that makes DAX in ext4 much harder to maintain for non-ext4 DAX
developers, please work to improve the common DAX IO path and so
provide the same benefit to all the filesystems that use it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-20  4:47           ` Dave Chinner
@ 2021-01-20 14:18             ` Jan Kara
  2021-01-20 15:12               ` Mikulas Patocka
  2021-01-21 16:30               ` Zhongwei Cai
  0 siblings, 2 replies; 30+ messages in thread
From: Jan Kara @ 2021-01-20 14:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Zhongwei Cai, Mikulas Patocka, Theodore Ts'o, Matthew Wilcox,
	David Laight, Mingkai Dong, Andrew Morton, Jan Kara,
	Steven Whitehouse, Eric Sandeen, Dave Chinner, Wang Jianchao,
	Rajesh Tadakamadla, linux-kernel, linux-fsdevel, linux-nvdimm

On Wed 20-01-21 15:47:00, Dave Chinner wrote:
> On Fri, Jan 15, 2021 at 05:40:43PM +0800, Zhongwei Cai wrote:
> > On Thu, 14 Jan 2021, Mikulas wrote:
> > For Ext4-dax, the overhead of dax_iomap_rw is significant
> > compared to the overhead of struct iov_iter. Although methods
> > proposed by Mikulas can eliminate the overhead of iov_iter
> > well, they can not be applied in Ext4-dax unless we implement an
> > internal "read" method in Ext4-dax.
> > 
> > For Ext4-dax, there could be two approaches to optimizing:
> > 1) implementing the internal "read" method without the complexity
> > of iterators and dax_iomap_rw;
> 
> Please do not go an re-invent the wheel just for ext4. If there's a
> problem in a shared path - ext2, FUSE and XFS all use dax_iomap_rw()
> as well, so any improvements to that path benefit all DAX users, not
> just ext4.
> 
> > 2) optimizing how dax_iomap_rw works.
> > Since dax_iomap_rw requires ext4_iomap_begin, which further involves
> > the iomap structure and others (e.g., journaling status locks in Ext4),
> > we think implementing the internal "read" method would be easier.
> 
> Maybe it is, but it's also very selfish. The DAX iomap path was
> written to be correct for all users, not inecessarily provide
> optimal performance. There will be lots of things that could be done
> to optimise it, so rather than creating a special snowflake in ext4
> that makes DAX in ext4 much harder to maintain for non-ext4 DAX
> developers, please work to improve the common DAX IO path and so
> provide the same benefit to all the filesystems that use it.

Yeah, I agree. I'm against ext4 private solution for this read problem. And
I'm also against duplicating ->read_iter functionatily in ->read handler.
The maintenance burden of this code duplication is IMHO just too big. We
rather need to improve the generic code so that the fast path is faster.
And every filesystem will benefit because this is not ext4 specific
problem.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-20 14:18             ` Jan Kara
@ 2021-01-20 15:12               ` Mikulas Patocka
  2021-01-20 15:44                 ` David Laight
  2021-01-21 15:47                 ` Matthew Wilcox
  2021-01-21 16:30               ` Zhongwei Cai
  1 sibling, 2 replies; 30+ messages in thread
From: Mikulas Patocka @ 2021-01-20 15:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Zhongwei Cai, Theodore Ts'o, Matthew Wilcox,
	David Laight, Mingkai Dong, Andrew Morton, Steven Whitehouse,
	Eric Sandeen, Dave Chinner, Wang Jianchao, Rajesh Tadakamadla,
	linux-kernel, linux-fsdevel, linux-nvdimm



On Wed, 20 Jan 2021, Jan Kara wrote:

> Yeah, I agree. I'm against ext4 private solution for this read problem. And
> I'm also against duplicating ->read_iter functionatily in ->read handler.
> The maintenance burden of this code duplication is IMHO just too big. We
> rather need to improve the generic code so that the fast path is faster.
> And every filesystem will benefit because this is not ext4 specific
> problem.
> 
> 								Honza

Do you have some idea how to optimize the generic code that calls 
->read_iter?

vfs_read calls ->read if it is present. If not, it calls new_sync_read. 
new_sync_read's frame size is 128 bytes - it holds the structures iovec, 
kiocb and iov_iter. new_sync_read calls ->read_iter.

I have found out that the cost of calling new_sync_read is 3.3%, Zhongwei 
found out 3.9%. (the benchmark repeatedy reads the same 4k page)

I don't see any way how to optimize new_sync_read or how to reduce its 
frame size. Do you?

Mikulas


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Expense of read_iter
  2021-01-20 15:12               ` Mikulas Patocka
@ 2021-01-20 15:44                 ` David Laight
  2021-01-21 15:47                 ` Matthew Wilcox
  1 sibling, 0 replies; 30+ messages in thread
From: David Laight @ 2021-01-20 15:44 UTC (permalink / raw)
  To: 'Mikulas Patocka', Jan Kara
  Cc: Dave Chinner, Zhongwei Cai, Theodore Ts'o, Matthew Wilcox,
	Mingkai Dong, Andrew Morton, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Wang Jianchao, Rajesh Tadakamadla, linux-kernel,
	linux-fsdevel, linux-nvdimm

From: Mikulas Patocka
> Sent: 20 January 2021 15:12
> 
> On Wed, 20 Jan 2021, Jan Kara wrote:
> 
> > Yeah, I agree. I'm against ext4 private solution for this read problem. And
> > I'm also against duplicating ->read_iter functionatily in ->read handler.
> > The maintenance burden of this code duplication is IMHO just too big. We
> > rather need to improve the generic code so that the fast path is faster.
> > And every filesystem will benefit because this is not ext4 specific
> > problem.
> >
> > 								Honza
> 
> Do you have some idea how to optimize the generic code that calls
> ->read_iter?
> 
> vfs_read calls ->read if it is present. If not, it calls new_sync_read.
> new_sync_read's frame size is 128 bytes - it holds the structures iovec,
> kiocb and iov_iter. new_sync_read calls ->read_iter.
> 
> I have found out that the cost of calling new_sync_read is 3.3%, Zhongwei
> found out 3.9%. (the benchmark repeatedy reads the same 4k page)
> 
> I don't see any way how to optimize new_sync_read or how to reduce its
> frame size. Do you?

Why is the 'read_iter' path not just the same as the 'read' one
but calling copy_to_iter() instead of copy_to_user().

For a single fragment iov[] the difference might just be
measurable for a single byte read.
But by the time you are transferring 4k it ought to be miniscule.

It isn't as though you have the cost of reading the iov[] from userspace.
(That hits sendmsg() v send().)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-20 15:12               ` Mikulas Patocka
  2021-01-20 15:44                 ` David Laight
@ 2021-01-21 15:47                 ` Matthew Wilcox
  2021-01-21 16:06                   ` Mikulas Patocka
  1 sibling, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2021-01-21 15:47 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jan Kara, Dave Chinner, Zhongwei Cai, Theodore Ts'o,
	David Laight, Mingkai Dong, Andrew Morton, Steven Whitehouse,
	Eric Sandeen, Dave Chinner, Wang Jianchao, Rajesh Tadakamadla,
	linux-kernel, linux-fsdevel, linux-nvdimm

On Wed, Jan 20, 2021 at 10:12:01AM -0500, Mikulas Patocka wrote:
> Do you have some idea how to optimize the generic code that calls 
> ->read_iter?

Yes.

> It might be better to maintain an f_iocb_flags in the
> struct file and just copy that unconditionally.  We'd need to remember
> to update it in fcntl(F_SETFL), but I think that's the only place.

Want to give that a try?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-21 15:47                 ` Matthew Wilcox
@ 2021-01-21 16:06                   ` Mikulas Patocka
  0 siblings, 0 replies; 30+ messages in thread
From: Mikulas Patocka @ 2021-01-21 16:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Dave Chinner, Zhongwei Cai, Theodore Ts'o,
	David Laight, Mingkai Dong, Andrew Morton, Steven Whitehouse,
	Eric Sandeen, Dave Chinner, Wang Jianchao, Rajesh Tadakamadla,
	linux-kernel, linux-fsdevel, linux-nvdimm



On Thu, 21 Jan 2021, Matthew Wilcox wrote:

> On Wed, Jan 20, 2021 at 10:12:01AM -0500, Mikulas Patocka wrote:
> > Do you have some idea how to optimize the generic code that calls 
> > ->read_iter?
> 
> Yes.
> 
> > It might be better to maintain an f_iocb_flags in the
> > struct file and just copy that unconditionally.  We'd need to remember
> > to update it in fcntl(F_SETFL), but I think that's the only place.
> 
> Want to give that a try?

Yes - send me the patch and I'll benchmark it.

Mikulas


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Expense of read_iter
  2021-01-20 14:18             ` Jan Kara
  2021-01-20 15:12               ` Mikulas Patocka
@ 2021-01-21 16:30               ` Zhongwei Cai
  1 sibling, 0 replies; 30+ messages in thread
From: Zhongwei Cai @ 2021-01-21 16:30 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner
  Cc: Mikulas Patocka, Theodore Ts'o, Matthew Wilcox, David Laight,
	Mingkai Dong, Andrew Morton, Steven Whitehouse, Eric Sandeen,
	Dave Chinner, Wang Jianchao, Rajesh Tadakamadla, linux-kernel,
	linux-fsdevel, linux-nvdimm


On Wed, 20 Jan 2021, Jan Kara wrote:
> On Wed 20-01-21 15:47:00, Dave Chinner wrote:
> > On Fri, Jan 15, 2021 at 05:40:43PM +0800, Zhongwei Cai wrote:
> > > On Thu, 14 Jan 2021, Mikulas wrote:
> > > For Ext4-dax, the overhead of dax_iomap_rw is significant
> > > compared to the overhead of struct iov_iter. Although methods
> > > proposed by Mikulas can eliminate the overhead of iov_iter
> > > well, they can not be applied in Ext4-dax unless we implement an
> > > internal "read" method in Ext4-dax.
> > >
>> > For Ext4-dax, there could be two approaches to optimizing:
> > > 1) implementing the internal "read" method without the complexity
> > > of iterators and dax_iomap_rw;
> >
> > Please do not go an re-invent the wheel just for ext4. If there's a
> > problem in a shared path - ext2, FUSE and XFS all use dax_iomap_rw()
> > as well, so any improvements to that path benefit all DAX users, not
> > just ext4.
> >
> > > 2) optimizing how dax_iomap_rw works.
> > > Since dax_iomap_rw requires ext4_iomap_begin, which further involves
> > > the iomap structure and others (e.g., journaling status locks in Ext4),
> > > we think implementing the internal "read" method would be easier.
> >
> > Maybe it is, but it's also very selfish. The DAX iomap path was
> > written to be correct for all users, not inecessarily provide
> > optimal performance. There will be lots of things that could be done
> > to optimise it, so rather than creating a special snowflake in ext4
> > that makes DAX in ext4 much harder to maintain for non-ext4 DAX
> > developers, please work to improve the common DAX IO path and so
> > provide the same benefit to all the filesystems that use it.
>
> Yeah, I agree. I'm against ext4 private solution for this read problem. And
> I'm also against duplicating ->read_iter functionatily in ->read handler.
> The maintenance burden of this code duplication is IMHO just too big. We
> rather need to improve the generic code so that the fast path is faster.
> And every filesystem will benefit because this is not ext4 specific
> problem.
> 
>                                                                 Honza

We agree that maintenance burden could be a problem here. So we will take
your suggestions and try to optimize on the generic path. But as Mikulas
said ( https://lkml.org/lkml/2021/1/20/618 ), it seems that some parts of
the overhead are hard to avoid, such as new_sync_read, and we are concerned
that optimizing on the generic path will have limited effect. Nevertheless,
we will try to optimize the generic path and see how much we can improve.

Zhongwei

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-01-21 16:59 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-07 13:15 [RFC v2] nvfs: a filesystem for persistent memory Mikulas Patocka
2021-01-07 15:11 ` Expense of read_iter Matthew Wilcox
2021-01-07 16:43   ` Mingkai Dong
2021-01-12 13:45     ` Zhongwei Cai
2021-01-12 14:06       ` David Laight
2021-01-13 16:44       ` Mikulas Patocka
2021-01-15  9:40         ` Zhongwei Cai
2021-01-20  4:47           ` Dave Chinner
2021-01-20 14:18             ` Jan Kara
2021-01-20 15:12               ` Mikulas Patocka
2021-01-20 15:44                 ` David Laight
2021-01-21 15:47                 ` Matthew Wilcox
2021-01-21 16:06                   ` Mikulas Patocka
2021-01-21 16:30               ` Zhongwei Cai
2021-01-07 18:59   ` Mikulas Patocka
2021-01-10  6:13     ` Matthew Wilcox
2021-01-10 21:19       ` Mikulas Patocka
2021-01-11  0:18         ` Matthew Wilcox
2021-01-11 21:10           ` Mikulas Patocka
2021-01-11 10:11       ` David Laight
2021-01-10 16:20 ` [RFC v2] nvfs: a filesystem for persistent memory Al Viro
2021-01-10 16:51   ` Al Viro
2021-01-10 21:14   ` Mikulas Patocka
2021-01-10 23:40     ` Al Viro
2021-01-11 11:41       ` Mikulas Patocka
2021-01-11 10:29   ` David Laight
2021-01-11 11:44     ` Mikulas Patocka
2021-01-11 11:57       ` David Laight
2021-01-11 14:43         ` Al Viro
2021-01-11 14:54           ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).