From: Zhongwei Cai <sunrise_l@sjtu.edu.cn>
To: Mingkai Dong <mingkaidong@gmail.com>,
Matthew Wilcox <willy@infradead.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Jan Kara <jack@suse.cz>, Steven Whitehouse <swhiteho@redhat.com>,
Eric Sandeen <esandeen@redhat.com>,
Dave Chinner <dchinner@redhat.com>, Theodore Ts'o <tytso@mit.edu>,
Wang Jianchao <jianchao.wan9@gmail.com>,
"Tadakamadla, Rajesh" <rajesh.tadakamadla@hpe.com>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-nvdimm@lists.01.org
Subject: Re: Expense of read_iter
Date: Tue, 12 Jan 2021 21:45:00 +0800 (CST) [thread overview]
Message-ID: <2041983017.5681521.1610459100858.JavaMail.zimbra@sjtu.edu.cn> (raw)
In-Reply-To: <17045315-CC1F-4165-B8E3-BA55DD16D46B@gmail.com>
I'm working with Mingkai on optimizations for Ext4-dax.
We think that optmizing the read-iter method cannot achieve the
same performance as the read method for Ext4-dax.
We tried Mikulas's benchmark on Ext4-dax. The overall time and perf
results are listed below:
Overall time of 2^26 4KB read.
Method Time
read 26.782s
read-iter 36.477s
Perf result, using the read_iter method:
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 13379476464
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ .......................................
#
20.09% pread [kernel.vmlinux] [k] copy_user_generic_string
6.58% pread [kernel.vmlinux] [k] iomap_apply
6.01% pread [kernel.vmlinux] [k] syscall_return_via_sysret
4.85% pread libc-2.31.so [.] __libc_pread
3.61% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
3.25% pread [kernel.vmlinux] [k] _raw_read_lock
2.80% pread [kernel.vmlinux] [k] entry_SYSCALL_64
2.71% pread [ext4] [k] ext4_es_lookup_extent
2.71% pread [kernel.vmlinux] [k] __fsnotify_parent
2.63% pread [kernel.vmlinux] [k] __srcu_read_unlock
2.55% pread [kernel.vmlinux] [k] new_sync_read
2.39% pread [ext4] [k] ext4_iomap_begin
2.38% pread [kernel.vmlinux] [k] vfs_read
2.30% pread [kernel.vmlinux] [k] dax_iomap_actor
2.30% pread [kernel.vmlinux] [k] __srcu_read_lock
2.14% pread [ext4] [k] ext4_inode_block_valid
1.97% pread [kernel.vmlinux] [k] _copy_mc_to_iter
1.97% pread [ext4] [k] ext4_map_blocks
1.89% pread [kernel.vmlinux] [k] down_read
1.89% pread [kernel.vmlinux] [k] up_read
1.65% pread [ext4] [k] ext4_file_read_iter
1.48% pread [kernel.vmlinux] [k] dax_iomap_rw
1.48% pread [jbd2] [k] jbd2_transaction_committed
1.15% pread [nd_pmem] [k] __pmem_direct_access
1.15% pread [kernel.vmlinux] [k] ksys_pread64
1.15% pread [kernel.vmlinux] [k] __fget_light
1.15% pread [ext4] [k] ext4_set_iomap
1.07% pread [kernel.vmlinux] [k] atime_needs_update
0.82% pread pread [.] main
0.82% pread [kernel.vmlinux] [k] do_syscall_64
0.74% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
0.66% pread [kernel.vmlinux] [k] __x86_indirect_thunk_rax
0.66% pread [nd_pmem] [k] 0x00000000000001d0
0.59% pread [kernel.vmlinux] [k] dax_direct_access
0.58% pread [nd_pmem] [k] 0x00000000000001de
0.58% pread [kernel.vmlinux] [k] bdev_dax_pgoff
0.49% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
0.49% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.49% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
0.41% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.33% pread [nd_pmem] [k] 0x0000000000001083
0.33% pread [kernel.vmlinux] [k] dax_get_private
0.33% pread [kernel.vmlinux] [k] timestamp_truncate
0.33% pread [kernel.vmlinux] [k] percpu_counter_add_batch
0.33% pread [kernel.vmlinux] [k] copyout_mc
0.33% pread [ext4] [k] __check_block_validity.constprop.80
0.33% pread [kernel.vmlinux] [k] touch_atime
0.25% pread [nd_pmem] [k] 0x000000000000107f
0.25% pread [kernel.vmlinux] [k] rw_verify_area
0.25% pread [ext4] [k] ext4_iomap_end
0.25% pread [kernel.vmlinux] [k] _cond_resched
0.25% pread [kernel.vmlinux] [k] rcu_all_qs
0.16% pread [kernel.vmlinux] [k] __fdget
0.16% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.16% pread [kernel.vmlinux] [k] iov_iter_init
0.16% pread [kernel.vmlinux] [k] current_time
0.16% pread [nd_pmem] [k] 0x0000000000001075
0.16% pread [ext4] [k] ext4_inode_datasync_dirty
0.16% pread [kernel.vmlinux] [k] copy_mc_to_user
0.08% pread pread [.] pread@plt
0.08% pread [kernel.vmlinux] [k] __x86_indirect_thunk_r11
0.08% pread [kernel.vmlinux] [k] security_file_permission
0.08% pread [kernel.vmlinux] [k] dax_read_unlock
0.08% pread [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
0.08% pread [nd_pmem] [k] 0x000000000000108f
0.08% pread [nd_pmem] [k] 0x0000000000001095
0.08% pread [kernel.vmlinux] [k] rcu_read_unlock_strict
0.00% pread [kernel.vmlinux] [k] native_write_msr
#
# (Tip: Show current config key-value pairs: perf config --list)
#
Perf result, using the read method we added for Ext4-dax:
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 13364755903
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ .......................................
#
28.65% pread [kernel.vmlinux] [k] copy_user_generic_string
7.99% pread [ext4] [k] ext4_dax_read
6.50% pread [kernel.vmlinux] [k] syscall_return_via_sysret
5.43% pread libc-2.31.so [.] __libc_pread
4.45% pread [kernel.vmlinux] [k] entry_SYSCALL_64
4.20% pread [kernel.vmlinux] [k] down_read
3.38% pread [kernel.vmlinux] [k] _raw_read_lock
3.13% pread [ext4] [k] ext4_es_lookup_extent
3.05% pread [kernel.vmlinux] [k] __srcu_read_lock
2.72% pread [kernel.vmlinux] [k] __fsnotify_parent
2.55% pread [kernel.vmlinux] [k] __srcu_read_unlock
2.47% pread [kernel.vmlinux] [k] vfs_read
2.31% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
1.89% pread [kernel.vmlinux] [k] up_read
1.73% pread [ext4] [k] ext4_map_blocks
1.65% pread pread [.] main
1.56% pread [kernel.vmlinux] [k] __fget_light
1.48% pread [ext4] [k] ext4_inode_block_valid
1.34% pread [kernel.vmlinux] [k] ksys_pread64
1.23% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
1.08% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
1.07% pread [nd_pmem] [k] __pmem_direct_access
0.99% pread [kernel.vmlinux] [k] atime_needs_update
0.91% pread [kernel.vmlinux] [k] security_file_permission
0.91% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
0.66% pread [kernel.vmlinux] [k] timestamp_truncate
0.58% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.49% pread pread [.] pread@plt
0.41% pread [kernel.vmlinux] [k] current_time
0.41% pread [kernel.vmlinux] [k] dax_direct_access
0.41% pread [kernel.vmlinux] [k] do_syscall_64
0.41% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.41% pread [kernel.vmlinux] [k] percpu_counter_add_batch
0.33% pread [kernel.vmlinux] [k] touch_atime
0.33% pread [ext4] [k] __check_block_validity.constprop.80
0.33% pread [kernel.vmlinux] [k] copy_mc_to_user
0.25% pread [kernel.vmlinux] [k] dax_get_private
0.25% pread [kernel.vmlinux] [k] rcu_all_qs
0.25% pread [nd_pmem] [k] 0x0000000000001095
0.16% pread [kernel.vmlinux] [k] _raw_spin_lock_irqsave
0.16% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.16% pread [nd_pmem] [k] 0x0000000000001083
0.16% pread [kernel.vmlinux] [k] rw_verify_area
0.16% pread [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
0.16% pread [kernel.vmlinux] [k] __fdget
0.16% pread [kernel.vmlinux] [k] dax_read_lock
0.16% pread [kernel.vmlinux] [k] __x86_indirect_thunk_rax
0.08% pread [kernel.vmlinux] [k] rcu_read_unlock_strict
0.08% pread [kernel.vmlinux] [k] dax_read_unlock
0.08% pread [kernel.vmlinux] [k] update_irq_load_avg
0.08% pread [nd_pmem] [k] 0x000000000000109d
0.08% pread [nd_pmem] [k] 0x000000000000107a
0.08% pread [kernel.vmlinux] [k] __x64_sys_pread64
0.00% pread [kernel.vmlinux] [k] native_write_msr
#
# (Tip: Sample related events with: perf record -e '{cycles,instructions}:S')
#
Note that the overall time of read method is 73.42% of the read-iter method.
If we sum up the percentage of read-iter specific functions (including
ext4_file_read_iter, iomap_apply, dax_iomap_actor, _copy_mc_to_iter,
ext4_iomap_begin, jbd2_transaction_committed, new_sync_read, dax_iomap_rw,
ext4_set_iomap, ext4_iomap_end and iov_iter_init), we will get 20.81%.
In the second trace, ext4_dax_read only consumes 7.99%, which can replace
all these functions.
The overhead mainly consists of two parts. The first is constructing
struct iov_iter and iterating it (i.e., new_sync, _copy_mc_to_iter and
iov_iter_init). The second is the dax io mechanism provided by VFS (i.e.,
dax_iomap_rw, iomap_apply and ext4_iomap_begin).
There could be two approaches to optimizing: 1) implementing the read method
without the complexity of iterators and dax_iomap_rw; 2) optimizing both
iterators and how dax_iomap_rw works. Since dax_iomap_rw requires
ext4_iomap_begin, which further involves the iomap structure and others
(e.g., journaling status locks in Ext4), we think implementing the read
method would be easier.
Thanks,
Zhongwei
next prev parent reply other threads:[~2021-01-12 13:54 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-07 13:15 [RFC v2] nvfs: a filesystem for persistent memory Mikulas Patocka
2021-01-07 15:11 ` Expense of read_iter Matthew Wilcox
2021-01-07 16:43 ` Mingkai Dong
2021-01-12 13:45 ` Zhongwei Cai [this message]
2021-01-12 14:06 ` David Laight
2021-01-13 16:44 ` Mikulas Patocka
2021-01-15 9:40 ` Zhongwei Cai
2021-01-20 4:47 ` Dave Chinner
2021-01-20 14:18 ` Jan Kara
2021-01-20 15:12 ` Mikulas Patocka
2021-01-20 15:44 ` David Laight
2021-01-21 15:47 ` Matthew Wilcox
2021-01-21 16:06 ` Mikulas Patocka
2021-01-21 16:30 ` Zhongwei Cai
2021-01-07 18:59 ` Mikulas Patocka
2021-01-10 6:13 ` Matthew Wilcox
2021-01-10 21:19 ` Mikulas Patocka
2021-01-11 0:18 ` Matthew Wilcox
2021-01-11 21:10 ` Mikulas Patocka
2021-01-11 10:11 ` David Laight
2021-01-10 16:20 ` [RFC v2] nvfs: a filesystem for persistent memory Al Viro
2021-01-10 16:51 ` Al Viro
2021-01-10 21:14 ` Mikulas Patocka
2021-01-10 23:40 ` Al Viro
2021-01-11 11:41 ` Mikulas Patocka
2021-01-11 10:29 ` David Laight
2021-01-11 11:44 ` Mikulas Patocka
2021-01-11 11:57 ` David Laight
2021-01-11 14:43 ` Al Viro
2021-01-11 14:54 ` David Laight
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2041983017.5681521.1610459100858.JavaMail.zimbra@sjtu.edu.cn \
--to=sunrise_l@sjtu.edu.cn \
--cc=akpm@linux-foundation.org \
--cc=dchinner@redhat.com \
--cc=esandeen@redhat.com \
--cc=jack@suse.cz \
--cc=jianchao.wan9@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=mingkaidong@gmail.com \
--cc=mpatocka@redhat.com \
--cc=rajesh.tadakamadla@hpe.com \
--cc=swhiteho@redhat.com \
--cc=tytso@mit.edu \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).