All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request
@ 2019-04-12  6:47 Li RongQing
  2019-04-12 16:32 ` Dan Williams
  0 siblings, 1 reply; 6+ messages in thread
From: Li RongQing @ 2019-04-12  6:47 UTC (permalink / raw)
  To: linux-nvdimm

flush nvdimm for write request can speed the random write, give
about 20% performance

The below is result of fio 4k random write nvdimm as /dev/pmem0

Before:
Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta 01m:43s]
After:
Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s]

This makes sure that the newly written data is durable too

Co-developed-by: Liang ZhiCheng <liangzhicheng@baidu.com>
Signed-off-by: Liang ZhiCheng <liangzhicheng@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
This test is done on intel AEP

 drivers/nvdimm/pmem.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 1d432c5ed..9f8f25880 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -197,6 +197,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long start;
 	struct bio_vec bvec;
 	struct bvec_iter iter;
+	unsigned int op = bio_op(bio);
 	struct pmem_device *pmem = q->queuedata;
 	struct nd_region *nd_region = to_region(pmem);
 
@@ -206,7 +207,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter) {
 		rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
-				bvec.bv_offset, bio_op(bio), iter.bi_sector);
+				bvec.bv_offset, op, iter.bi_sector);
 		if (rc) {
 			bio->bi_status = rc;
 			break;
@@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	if (do_acct)
 		nd_iostat_end(bio, start);
 
-	if (bio->bi_opf & REQ_FUA)
+	if (bio->bi_opf & REQ_FUA || op_is_write(op))
 		nvdimm_flush(nd_region);
 
 	bio_endio(bio);
-- 
2.16.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request
  2019-04-12  6:47 [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request Li RongQing
@ 2019-04-12 16:32 ` Dan Williams
  2019-04-13 13:04   ` 答复: " Li,Rongqing
  0 siblings, 1 reply; 6+ messages in thread
From: Dan Williams @ 2019-04-12 16:32 UTC (permalink / raw)
  To: Li RongQing; +Cc: linux-nvdimm

On Thu, Apr 11, 2019 at 11:47 PM Li RongQing <lirongqing@baidu.com> wrote:
>
> flush nvdimm for write request can speed the random write, give
> about 20% performance
>
> The below is result of fio 4k random write nvdimm as /dev/pmem0
>
> Before:
> Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta 01m:43s]
> After:
> Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s]

Interesting result. Another experiment proposed below...

> This makes sure that the newly written data is durable too

It's overkill for durability because ADR already handles the necessary
write-queue flush at power-loss.

> Co-developed-by: Liang ZhiCheng <liangzhicheng@baidu.com>
> Signed-off-by: Liang ZhiCheng <liangzhicheng@baidu.com>
> Signed-off-by: Li RongQing <lirongqing@baidu.com>
> ---
> This test is done on intel AEP
>
>  drivers/nvdimm/pmem.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 1d432c5ed..9f8f25880 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -197,6 +197,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
>         unsigned long start;
>         struct bio_vec bvec;
>         struct bvec_iter iter;
> +       unsigned int op = bio_op(bio);
>         struct pmem_device *pmem = q->queuedata;
>         struct nd_region *nd_region = to_region(pmem);
>
> @@ -206,7 +207,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
>         do_acct = nd_iostat_start(bio, &start);
>         bio_for_each_segment(bvec, bio, iter) {
>                 rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
> -                               bvec.bv_offset, bio_op(bio), iter.bi_sector);
> +                               bvec.bv_offset, op, iter.bi_sector);
>                 if (rc) {
>                         bio->bi_status = rc;
>                         break;
> @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
>         if (do_acct)
>                 nd_iostat_end(bio, start);
>
> -       if (bio->bi_opf & REQ_FUA)
> +       if (bio->bi_opf & REQ_FUA || op_is_write(op))
>                 nvdimm_flush(nd_region);

One question is whether this benefit is coming from just the write
ordering from the barrier, or the WPQ flush itself. Can you try the
following and see if you get the same result?

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index bc2f700feef8..10c24de52395 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -217,6 +217,8 @@ static blk_qc_t pmem_make_request(struct
request_queue *q, struct bio *bio)

        if (bio->bi_opf & REQ_FUA)
                nvdimm_flush(nd_region);
+       else if (op_is_write(bio_op(bio)))
+               wmb();

        bio_endio(bio);
        return BLK_QC_T_NONE;
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* 答复: [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request
  2019-04-12 16:32 ` Dan Williams
@ 2019-04-13 13:04   ` Li,Rongqing
  2019-04-14  3:20     ` Elliott, Robert (Servers)
  0 siblings, 1 reply; 6+ messages in thread
From: Li,Rongqing @ 2019-04-13 13:04 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm




________________________________
发件人: Dan Williams <dan.j.williams@intel.com>
发送时间: 2019年4月13日 0:32
收件人: Li,Rongqing
抄送: linux-nvdimm
主题: Re: [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request

>--- a/drivers/nvdimm/pmem.c
>+++ b/drivers/nvdimm/pmem.c
>@@ -217,6 +217,8 @@ static blk_qc_t pmem_make_request(struct
>request_queue *q, struct bio *bio)
>
 >       if (bio->bi_opf & REQ_FUA)
>                nvdimm_flush(nd_region);
>+       else if (op_is_write(bio_op(bio)))
>+               wmb();

Test shows this change have no performance improvement

-RongQing

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request
  2019-04-13 13:04   ` 答复: " Li,Rongqing
@ 2019-04-14  3:20     ` Elliott, Robert (Servers)
  2019-04-15  3:26       ` 答复: " Li,Rongqing
  0 siblings, 1 reply; 6+ messages in thread
From: Elliott, Robert (Servers) @ 2019-04-14  3:20 UTC (permalink / raw)
  To: Li,Rongqing, Dan Williams; +Cc: linux-nvdimm



>> @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
>> 	if (do_acct)
>> 		nd_iostat_end(bio, start);
>> 
>> -	if (bio->bi_opf & REQ_FUA)
>> +	if (bio->bi_opf & REQ_FUA || op_is_write(op))
>>  		nvdimm_flush(nd_region);
...
>> Before:
>> Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta 01m:43s]
>> After:
>> Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s]
>>
>> -RongQing


Doing more work cannot be faster than doing less work, so something else
must be happening here.

Please post the full fio job file and how you invoke it (i.e., with numactl).

These tools help show what is happening on the CPUs and memory channels:
    perf top
    pcm.x
    pcm-memory.x -pmm


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 6+ messages in thread

* 答复: [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request
  2019-04-14  3:20     ` Elliott, Robert (Servers)
@ 2019-04-15  3:26       ` Li,Rongqing
  2019-04-15 17:04         ` Dan Williams
  0 siblings, 1 reply; 6+ messages in thread
From: Li,Rongqing @ 2019-04-15  3:26 UTC (permalink / raw)
  To: Elliott, Robert (Servers), Dan Williams; +Cc: linux-nvdimm



> -----邮件原件-----
> 发件人: Elliott, Robert (Servers) [mailto:elliott@hpe.com]
> 发送时间: 2019年4月14日 11:21
> 收件人: Li,Rongqing <lirongqing@baidu.com>; Dan Williams
> <dan.j.williams@intel.com>
> 抄送: linux-nvdimm <linux-nvdimm@lists.01.org>
> 主题: RE: [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request
> 
> 
> 
> >> @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct
> request_queue *q, struct bio *bio)
> >> 	if (do_acct)
> >> 		nd_iostat_end(bio, start);
> >>
> >> -	if (bio->bi_opf & REQ_FUA)
> >> +	if (bio->bi_opf & REQ_FUA || op_is_write(op))
> >>  		nvdimm_flush(nd_region);
> ...
> >> Before:
> >> Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta
> >> 01m:43s]
> >> After:
> >> Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s]
> >>
> >> -RongQing
> 
> 
> Doing more work cannot be faster than doing less work, so something else
> must be happening here.
> 

Dan Williams maybe know more.


> Please post the full fio job file and how you invoke it (i.e., with numactl).
> 
This fio file is below, and we bind fio with cpu and node:  numactl --membind=0 taskset -c 2-24 ./fio test_io_raw

[global]
numjobs=32
direct=1
filename=/dev/pmem0.1
iodepth=32
ioengine=libaio
group_reporting=1
bs=4K
time_based=1
 
[write1]
rw=randwrite
runtime=60
stonewall

> These tools help show what is happening on the CPUs and memory channels:
>     perf top

62.40%  [kernel]            [k] memcpy_flushcache                                                                                                                     
 21.17%  [kernel]            [k] fput                                                                                                                                  
  6.12%  [kernel]            [k] apic_timer_interrupt                                                                                                                  
  0.89%  [kernel]            [k] rq_qos_done_bio                                                                                                                       
  0.66%  [kernel]            [k] bio_endio                                                                                                                             
  0.44%  [kernel]            [k] aio_complete_rw                                                                                                                       
  0.39%  [kernel]            [k] blkdev_bio_end_io                                                                                                                     
  0.31%  [kernel]            [k] entry_SYSCALL_64                                                                                                                      
  0.26%  [kernel]            [k] bio_disassociate_task                                                                                                                 
  0.23%  [kernel]            [k] read_tsc                                                                                                                              
  0.21%  fio                 [.] axmap_isset                                                                                                                           
  0.20%  [kernel]            [k] ktime_get_raw_ts64                                                                                                                    
  0.19%  [vdso]              [.] 0x7ffc475e2b30                                                                                                                        
  0.18%  [kernel]            [k] gup_pgd_range                                                                                                                         
  0.18%  [kernel]            [k] entry_SYSCALL_64_after_hwframe                                                                                                        
  0.16%  [kernel]            [k] __audit_syscall_exit                                                                                                                  
  0.16%  [kernel]            [k] __x86_indirect_thunk_rax                                                                                                              
  0.13%  [kernel]            [k] copy_user_enhanced_fast_string                                                                                                        
  0.13%  [kernel]            [k] syscall_return_via_sysret                                                                                                             
  0.12%  [kernel]            [k] preempt_count_add                                                                                                                     
  0.12%  [kernel]            [k] preempt_count_sub                                                                                                                     
  0.11%  [kernel]            [k] __x64_sys_clock_gettime                                                                                                               
  0.11%  [kernel]            [k] tracer_hardirqs_off                                                                                                                   
  0.10%  [kernel]            [k] native_write_msr                                                                                                                      
  0.10%  [kernel]            [k] posix_get_monotonic_raw                                                                                                               
  0.10%  fio                 [.] get_io_u

>     pcm.x

http://pasted.co/6fc93b42

>     pcm-memory.x -pmm
> 
http://pasted.co/d5c0c96b


If not bind fio with cpu and numa node, the performance will larger lower, but this optimization is suitable both condition , it will about 40% improvement sometime.

-Li
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request
  2019-04-15  3:26       ` 答复: " Li,Rongqing
@ 2019-04-15 17:04         ` Dan Williams
  0 siblings, 0 replies; 6+ messages in thread
From: Dan Williams @ 2019-04-15 17:04 UTC (permalink / raw)
  To: Li,Rongqing; +Cc: linux-nvdimm

On Sun, Apr 14, 2019 at 8:26 PM Li,Rongqing <lirongqing@baidu.com> wrote:
>
>
>
> > -----邮件原件-----
> > 发件人: Elliott, Robert (Servers) [mailto:elliott@hpe.com]
> > 发送时间: 2019年4月14日 11:21
> > 收件人: Li,Rongqing <lirongqing@baidu.com>; Dan Williams
> > <dan.j.williams@intel.com>
> > 抄送: linux-nvdimm <linux-nvdimm@lists.01.org>
> > 主题: RE: [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request
> >
> >
> >
> > >> @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct
> > request_queue *q, struct bio *bio)
> > >>    if (do_acct)
> > >>            nd_iostat_end(bio, start);
> > >>
> > >> -  if (bio->bi_opf & REQ_FUA)
> > >> +  if (bio->bi_opf & REQ_FUA || op_is_write(op))
> > >>            nvdimm_flush(nd_region);
> > ...
> > >> Before:
> > >> Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta
> > >> 01m:43s]
> > >> After:
> > >> Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s]
> > >>
> > >> -RongQing
> >
> >
> > Doing more work cannot be faster than doing less work, so something else
> > must be happening here.
> >
>
> Dan Williams maybe know more.

One thought is that back-pressure from awaiting write-posted-queue
flush completion causes full buffer writes to coalesce at the
controller, i.e. write-combining effects from the flush-delay.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-04-15 17:05 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-12  6:47 [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request Li RongQing
2019-04-12 16:32 ` Dan Williams
2019-04-13 13:04   ` 答复: " Li,Rongqing
2019-04-14  3:20     ` Elliott, Robert (Servers)
2019-04-15  3:26       ` 答复: " Li,Rongqing
2019-04-15 17:04         ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.