From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.suse.de (mx2.suse.de [195.135.220.15]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id DD0B321E1DAD2 for ; Thu, 3 Aug 2017 01:04:05 -0700 (PDT) Date: Thu, 3 Aug 2017 10:06:15 +0200 From: Johannes Thumshirn Subject: Re: [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq Message-ID: <20170803080615.GB4333@linux-x5ow.site> References: <150153948477.49768.5767882242140065474.stgit@djiang5-desk3.ch.intel.com> <150153988620.49768.12914164179718467335.stgit@djiang5-desk3.ch.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Dan Williams Cc: Vinod Koul , "dmaengine@vger.kernel.org" , "linux-nvdimm@lists.01.org" List-ID: On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote: > On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn = wrote: > > Dave Jiang writes: > > > >> Adding DMA support for pmem blk reads. This provides signficant CPU > >> reduction with large memory reads with good performance. DMAs are trig= gered > >> with test against bio_multiple_segment(), so the small I/Os (4k or les= s?) > >> are still performed by the CPU in order to reduce latency. By default > >> the pmem driver will be using blk-mq with DMA. > >> > >> Numbers below are measured against pmem simulated via DRAM using > >> memmap=3DNN!SS. DMA engine used is the ioatdma on Intel Skylake Xeon > >> platform. Keep in mind the performance for actual persistent memory > >> will differ. > >> Fio 2.21 was used. > >> > >> 64k: 1 task queuedepth=3D1 > >> CPU Read: 7631 MB/s 99.7% CPU DMA Read: 2415 MB/s 54% CPU > >> CPU Write: 3552 MB/s 100% CPU DMA Write 2173 MB/s 54% CPU > >> > >> 64k: 16 tasks queuedepth=3D16 > >> CPU Read: 36800 MB/s 1593% CPU DMA Read: 29100 MB/s 607% CPU > >> CPU Write 20900 MB/s 1589% CPU DMA Write: 23400 MB/s 585% CPU > >> > >> 2M: 1 task queuedepth=3D1 > >> CPU Read: 6013 MB/s 99.3% CPU DMA Read: 7986 MB/s 59.3% CPU > >> CPU Write: 3579 MB/s 100% CPU DMA Write: 5211 MB/s 58.3% CPU > >> > >> 2M: 16 tasks queuedepth=3D16 > >> CPU Read: 18100 MB/s 1588% CPU DMA Read: 21300 MB/s 180.9% CPU > >> CPU Write: 14100 MB/s 1594% CPU DMA Write: 20400 MB/s 446.9% CPU > >> > >> Signed-off-by: Dave Jiang > >> --- > > > > Hi Dave, > > > > The above table shows that there's a performance benefit for 2M > > transfers but a regression for 64k transfers, if we forget about the CPU > > utilization for a second. > = > I don't think we can forget about cpu utilization, and I would expect > most users would value cpu over small bandwidth increases. Especially > when those numbers show efficiency like this > = > +160% cpu +26% read bandwidth > +171% cpu -10% write bandwidth > = > > Would it be beneficial to have heuristics on > > the transfer size that decide when to use dma and when not? You > > introduced this hunk: > > > > - rc =3D pmem_handle_cmd(cmd); > > + if (cmd->chan && bio_multiple_segments(req->bio)) > > + rc =3D pmem_handle_cmd_dma(cmd, op_is_write(req_op(req))); > > + else > > + rc =3D pmem_handle_cmd(cmd); > > > > Which utilizes dma for bios with multiple segments and for single > > segment bios you use the old path, maybe the single/multi segment logic > > can be amended to have something like: > > > > if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH) > > rc =3D pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)); > > else > > rc =3D pmem_handle_cmd(cmd); > > > > Just something woth considering IMHO. > = > The thing we want to avoid most is having the cpu stall doing nothing > waiting for memory access, so I think once the efficiency of adding > more cpu goes non-linear it's better to let the dma engine handle > that. The current heuristic seems to achieve that, but the worry is > does it over-select the cpu for cases where requests could be merged > into a bulkier i/o that is more suitable for dma. > = > I'm not sure we want another tunable vs predictable / efficient > behavior by default. Sorry for my late reply, but lots of automated performance regression CI to= ols _will_ report errors because of the performance drop. Yes these may be false positives, but it'll cost hours to resolve them. If we have a tunable it's = way easier to set it correctly for the use cases. And please keep in mind some people don't really case about CPU utilization but maxiumum throughput. Thanks, Johannes -- = Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=FCrnberg GF: Felix Imend=F6rffer, Jane Smithard, Graham Norton HRB 21284 (AG N=FCrnberg) Key fingerprint =3D EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm