Re: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector updates

From: "Verma, Vishal L" <vishal.l.verma@intel.com>
To: "Elliott@hp.com" <Elliott@hp.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"mingo@kernel.org" <mingo@kernel.org>,
	"neilb@suse.de" <neilb@suse.de>, "hch@lst.de" <hch@lst.de>,
	"Williams, Dan J" <dan.j.williams@intel.com>,
	"axboe@fb.com" <axboe@fb.com>, "hpa@zytor.com" <hpa@zytor.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"luto@amacapital.net" <luto@amacapital.net>,
	"david@fromorbit.com" <david@fromorbit.com>
Subject: Re: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector updates
Date: Mon, 18 May 2015 22:38:11 +0000	[thread overview]
Message-ID: <1431988690.15857.31.camel@intel.com> (raw)
In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40295A9097E5@G9W0745.americas.hpqcorp.net>

On Sun, 2015-05-17 at 01:19 +0000, Elliott, Robert (Server Storage)
wrote:
> > -----Original Message-----
> > From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
> > Dan Williams
> > Sent: Tuesday, April 28, 2015 1:26 PM
> > To: linux-nvdimm@lists.01.org
> > Cc: Ingo Molnar; Neil Brown; Greg KH; Dave Chinner; linux-
> > kernel@vger.kernel.org; Andy Lutomirski; Jens Axboe; H. Peter Anvin;
> > Christoph Hellwig
> > Subject: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector updates
> > 
> > From: Vishal Verma <vishal.l.verma@linux.intel.com>
> > 
> > BTT stands for Block Translation Table, and is a way to provide power
> > fail sector atomicity semantics for block devices that have the ability
> > to perform byte granularity IO. It relies on the ->rw_bytes() capability
> > of provided nd namespace devices.
> > 
> > The BTT works as a stacked blocked device, and reserves a chunk of space
> > from the backing device for its accounting metadata.  BLK namespaces may
> > mandate use of a BTT and expect the bus to initialize a BTT if not
> > already present.  Otherwise if a BTT is desired for other namespaces (or
> > partitions of a namespace) a BTT may be manually configured.
> ...
> 
> Running btt above pmem with a variety of workloads, I see an awful lot 
> of time spent in two places:
> * _raw_spin_lock 
> * btt_make_request

I'm curious about why _raw_spin_lock might be taking up so much time -
does the test system have more than 256 CPUs? Even ignoring Dan's fix
for nr_cpus which will allow the lockless path optimization to kick in
(coming in v3), the lane spinlocks should be largely uncontested unless
you have more than 256 CPUs.

One explanation could be that this is being dominated by 'map_locks' -
I'd need to get my perf setup going again to find out. (The more I think
about it, the more I'm convinced this is the case. Fortunately it is
simple to reduce map_lock contention by increasing the number of locks).

> 
> This occurs for fio to raw /dev/ndN devices, ddpt over ext4 or xfs,
> cp -R of large directories, and running make on the linux kernel.
> 
> Some specific results:
> 
> fio 4 KiB random reads, WC cache type, memcpy:
> * 43175 MB/s,   8 M IOPS  pmem0 and pmem1
> * 18500 MB/s, 1.5 M IOPS  nd0 and nd1
> 
> fio 4 KiB random reads, WC cache type, memcpy with non-temporal
> loads (when everything is 64-byte aligned):
> * 33814 MB/s, 4.3 M IOPS  nd0 and nd1
> 
> Zeroing out 32 MiB with ddpt:
> * 19 s, 1800 MiB/s	pmem
> * 55 s,  625 MiB/s	btt
> 
> If btt_make_request needs to stall this much, maybe it'd be better
> to utilize the blk-mq request queues, keeping requests in per-CPU
> queues while they're waiting, and using IPIs for completion 
> interrupts when they're finally done.
> 
> 
> fio 4 KiB random reads without non-temporal memcpy
> ==================================================
> perf top shows memcpy_erms taking all the time, a function that
> uses 8-byte REP; MOVSB instructions:
>  85.78%  [kernel]             [k] memcpy_erms
>   1.21%  [kernel]             [k] _raw_spin_lock
>   0.72%  [nd_btt]             [k] btt_make_request
>   0.67%  [kernel]             [k] do_blockdev_direct_IO
>   0.47%  fio                  [.] get_io_u
> 
> fio 4 KiB random reads with non-temporal memcpy
> ===============================================
> perf top shows there are still quite a few unaligned accesses
> resulting in legacy memcpy, but about equal time is now spent
> in legacy vs NT memcpy:

I take it memcpy_lnt_st_64 expects 64B aligned accesses? The BTT will
always have some unaligned accesses - the map updates are 4B long. Flog
updates are 16B long, but about half of them will not start at a 64B
alignment due to how it is laid out. I'm open to hearing ideas about how
this could be improved. One way, as Dan already said is reducing the
number of metadata updates by configuring a larger (4K) sector size.

>  30.47%  [kernel]            [k] memcpy_erms
>  26.27%  [kernel]            [k] memcpy_lnt_st_64
>   5.37%  [kernel]            [k] _raw_spin_lock
>   2.20%  [kernel]            [k] btt_make_request
>   2.03%  [kernel]            [k] do_blockdev_direct_IO
>   1.41%  fio                 [.] get_io_u
>   1.22%  [kernel]            [k] btt_map_read
>   1.15%  [kernel]            [k] pmem_rw_bytes
>   1.01%  [kernel]            [k] nd_btt_rw_bytes
>   0.98%  [kernel]            [k] nd_region_acquire_lane
>   0.89%  fio                 [.] get_next_rand_block
>   0.88%  fio                 [.] thread_main
>   0.79%  fio                 [.] ios_completed
>   0.76%  fio                 [.] td_io_queue
>   0.75%  [kernel]            [k] _raw_spin_lock_irqsave
>   0.68%  [kernel]            [k] kmem_cache_free
>   0.66%  [kernel]            [k] kmem_cache_alloc
>   0.59%  [kernel]            [k] __audit_syscall_exit
>   0.57%  [kernel]            [k] aio_complete
>   0.54%  [kernel]            [k] do_io_submit
>   0.52%  [kernel]            [k] _raw_spin_unlock_irqrestore
> 
> fio randrw workload
> ===================
> perf top shows that adding writes to the mix brings btt_make_request
> its cpu_relax() loop to the forefront:

This is good data - we've thought about benchmarking an alternate
strategy for the 'Read Tracking Table' - instead of a linear array, use
a hash table (hash new_postmap into an RTT slot) . It will add false
waiting due to hash collisions, but the writers won't have to walk the
array - they can just hash into the right spot.

Another option that Dan suggested could be instead of waiting in the
cpu_relax() loop, we drop the lane, and start over with _acquire_lane.
This way, we are waiting in the spin_lock stage, thus giving a chance to
other threads/CPUs that resolve to the same spinlock to run.

>   21.09%  [nd_btt]                              [k] btt_make_request 
>   19.06%  [kernel]                              [k] memcpy_erms  
>   14.35%  [kernel]                              [k] _raw_spin_lock   
>   10.38%  [nd_pmem]                             [k] memcpy_lnt_st_64    
>    1.57%  [kernel]                              [k] do_blockdev_direct_IO   
>    1.51%  [nd_pmem]                             [k] memcpy_lt_snt_64      
>    1.43%  [nd_btt]                              [k] nd_btt_rw_bytes       
>    1.39%  [kernel]                              [k] radix_tree_next_chunk  
>    1.33%  [kernel]                              [k] put_page             
>    1.21%  [nd_pmem]                             [k] pmem_rw_bytes      
>    1.11%  fio                                   [.] get_io_u          
>    0.90%  fio                                   [.] io_u_queued_complete  
>    0.74%  [kernel]                              [k] system_call         
>    0.72%  [libnd]                               [k] nd_region_acquire_lane   
>    0.71%  [nd_btt]                              [k] btt_map_read            
>    0.62%  fio                                   [.] thread_main           
> 
> inside btt_make_request:
> 
>        ¦                     /* Wait if the new block is being read from */
>        ¦                     for (i = 0; i < arena->nfree; i++)
>   2.98 ¦     ? je     2b4
>   0.05 ¦       mov    0x60(%r14),%rax
>   0.00 ¦       mov    %ebx,%edx
>        ¦       xor    %esi,%esi
>   0.03 ¦       or     $0x80000000,%edx
>   0.05 ¦       nop
>        ¦                             while (arena->rtt[i] == (RTT_VALID | new_postmap))
>  22.98 ¦290:   mov    %esi,%edi
>   0.01 ¦       cmp    %edx,(%rax,%rdi,4)
>  30.97 ¦       lea    0x0(,%rdi,4),%rcx
>  21.05 ¦     ? jne    2ab
>        ¦       nop
>        ¦     }
>        ¦
>        ¦     /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
>        ¦     static inline void rep_nop(void)
>        ¦     {
>        ¦             asm volatile("rep; nop" ::: "memory");
>        ¦2a0:   pause
>        ¦       mov    0x60(%r14),%rax
>        ¦       cmp    (%rax,%rcx,1),%edx
>        ¦     ? je     2a0
>        ¦                     }
> 
> 
> ddpt zeroing out
> ================
> perf top shows 27% in spinlocks, and 14% in btt_make_request (all in 
> the "wait if the new block is being read from" loop).
> 
>   26.48%  [kernel]                      [k] _raw_spin_lock   
>   14.46%  [nd_btt]                      [k] btt_make_request  
>   13.14%  [kernel]                      [k] memcpy_erms    
>   10.34%  [kernel]                      [k] copy_user_enhanced_fast_string 
>    3.12%  [nd_pmem]                     [k] memcpy_lt_snt_64  
>    1.15%  [kernel]                      [k] __block_commit_write.isra.21 
>    0.96%  [nd_pmem]                     [k] pmem_rw_bytes 
>    0.96%  [nd_btt]                      [k] nd_btt_rw_bytes 
>    0.86%  [kernel]                      [k] unlock_page     
>    0.65%  [kernel]                      [k] _raw_spin_lock_irqsave 
>    0.58%  [kernel]                      [k] bdev_read_only 
>    0.56%  [kernel]                      [k] release_pages  
>    0.54%  [nd_pmem]                     [k] memcpy_lnt_st_64  
>    0.53%  [ext4]                        [k] ext4_mark_iloc_dirty   
>    0.52%  [kernel]                      [k] __wake_up_bit   
>    0.52%  [kernel]                      [k] __clear_user   
> 
> ---
> Robert Elliott, HP Server Storage

Thanks, this is great analysis - I'd be quite interested in how these
numbers look like with the two changes Dan mentioned.

	-Vishal