From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lgeamrelo12.lge.com (LGEAMRELO12.lge.com [156.147.23.52]) by ml01.01.org (Postfix) with ESMTP id 063A52095D9C8 for ; Fri, 4 Aug 2017 01:15:31 -0700 (PDT) Date: Fri, 4 Aug 2017 17:17:40 +0900 From: Minchan Kim Subject: Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt Message-ID: <20170804081740.GA2083@bbox> References: <20170728165604.10455-1-ross.zwisler@linux.intel.com> <20170728173143.GE15980@bombadil.infradead.org> <20170802221359.GA20666@linux.intel.com> <20170803001315.GF32020@bbox> <20170803211335.GA1260@linux.intel.com> <20170804035441.GA305@bbox> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20170804035441.GA305@bbox> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Ross Zwisler Cc: Jens Axboe , Jerome Marchand , linux-nvdimm@lists.01.org, Dave Chinner , linux-kernel@vger.kernel.org, Matthew Wilcox , Christoph Hellwig , Jan Kara , Andrew Morton , "karam . lee" , seungho1.park@lge.com, Nitin Gupta List-ID: On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote: > On Thu, Aug 03, 2017 at 03:13:35PM -0600, Ross Zwisler wrote: > > On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote: > > > Hi Ross, > > > > > > On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote: > > > > On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote: > > > > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote: > > > > > > Dan Williams and Christoph Hellwig have recently expressed doubt about > > > > > > whether the rw_page() interface made sense for synchronous memory drivers > > > > > > [1][2]. It's unclear whether this interface has any performance benefit > > > > > > for these drivers, but as we continue to fix bugs it is clear that it does > > > > > > have a maintenance burden. This series removes the rw_page() > > > > > > implementations in brd, pmem and btt to relieve this burden. > > > > > > > > > > Why don't you measure whether it has performance benefits? I don't > > > > > understand why zram would see performance benefits and not other drivers. > > > > > If it's going to be removed, then the whole interface should be removed, > > > > > not just have the implementations removed from some drivers. > > > > > > > > Okay, I've run a bunch of performance tests with the PMEM and with BTT entry > > > > points for rw_pages() in a swap workload, and in all cases I do see an > > > > improvement over the code when rw_pages() is removed. Here are the results > > > > from my random lab box: > > > > > > > > Average latency of swap_writepage() > > > > +------+------------+---------+-------------+ > > > > | | no rw_page | rw_page | Improvement | > > > > +-------------------------------------------+ > > > > | PMEM | 5.0 us | 4.7 us | 6% | > > > > +-------------------------------------------+ > > > > | BTT | 6.8 us | 6.1 us | 10% | > > > > +------+------------+---------+-------------+ > > > > > > > > Average latency of swap_readpage() > > > > +------+------------+---------+-------------+ > > > > | | no rw_page | rw_page | Improvement | > > > > +-------------------------------------------+ > > > > | PMEM | 3.3 us | 2.9 us | 12% | > > > > +-------------------------------------------+ > > > > | BTT | 3.7 us | 3.4 us | 8% | > > > > +------+------------+---------+-------------+ > > > > > > > > The workload was pmbench, a memory benchmark, run on a system where I had > > > > severely restricted the amount of memory in the system with the 'mem' kernel > > > > command line parameter. The benchmark was set up to test more memory than I > > > > allowed the OS to have so it spilled over into swap. > > > > > > > > The PMEM or BTT device was set up as my swap device, and during the test I got > > > > a few hundred thousand samples of each of swap_writepage() and > > > > swap_writepage(). The PMEM/BTT device was just memory reserved with the > > > > memmap kernel command line parameter. > > > > > > > > Thanks, Matthew, for asking for performance data. It looks like removing this > > > > code would have been a mistake. > > > > > > By suggestion of Christoph Hellwig, I made a quick patch which does IO without > > > dynamic bio allocation for swap IO. Actually, it's not formal patch to be > > > worth to send mainline yet but I believe it's enough to test the improvement. > > > > > > Could you test patchset on pmem and btt without rw_page? > > > > > > For working the patch, block drivers need to declare it's synchronous IO > > > device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO > > > comes from (sis->flags & SWP_SYNC_IO) with removing condition check > > > > > > if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page. > > > > > > Patchset is based on 4.13-rc3. > > > > Thanks for the patch, here are the updated results from my test box: > > > > Average latency of swap_writepage() > > +------+------------+---------+---------+ > > | | no rw_page | minchan | rw_page | > > +---------------------------------------- > > | PMEM | 5.0 us | 4.98 us | 4.7 us | > > +---------------------------------------- > > | BTT | 6.8 us | 6.3 us | 6.1 us | > > +------+------------+---------+---------+ > > > > Average latency of swap_readpage() > > +------+------------+---------+---------+ > > | | no rw_page | minchan | rw_page | > > +---------------------------------------- > > | PMEM | 3.3 us | 3.27 us | 2.9 us | > > +---------------------------------------- > > | BTT | 3.7 us | 3.44 us | 3.4 us | > > +------+------------+---------+---------+ > > > > I've added another digit in precision in some cases to help differentiate the > > various results. > > > > In all cases your patches did perform better than with the regularly allocated > > BIO, but again for all cases the rw_page() path was the fastest, even if only > > marginally. > > Thanks for the testing. Your testing number is within noise level? > > I cannot understand why PMEM doesn't have enough gain while BTT is significant > win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic > allocation and mine and rw_page testing reduced it significantly. However, > in no rw_page with pmem, there wasn't many cases to wait bio allocations due > to the device is so fast so the number comes from purely the number of > instructions has done. At a quick glance of bio init/submit, it's not trivial > so indeed, i understand where the 12% enhancement comes from but I'm not sure > it's really big difference in real practice at the cost of maintaince burden. I tested pmbench 10 times in my local machine(4 core) with zram-swap. In my machine, even, on-stack bio is faster than rw_page. Unbelievable. I guess it's really hard to get stable result in severe memory pressure. It would be a result within noise level(see below stddev). So, I think it's hard to conclude rw_page is far faster than onstack-bio. rw_page avg 5.54us stddev 8.89% max 6.02us min 4.20us onstack bio avg 5.27us stddev 13.03% max 5.96us min 3.55us _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751383AbdHDIRt (ORCPT ); Fri, 4 Aug 2017 04:17:49 -0400 Received: from LGEAMRELO12.lge.com ([156.147.23.52]:45079 "EHLO lgeamrelo12.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751283AbdHDIRp (ORCPT ); Fri, 4 Aug 2017 04:17:45 -0400 X-Original-SENDERIP: 156.147.1.151 X-Original-MAILFROM: minchan@kernel.org X-Original-SENDERIP: 10.177.220.163 X-Original-MAILFROM: minchan@kernel.org Date: Fri, 4 Aug 2017 17:17:40 +0900 From: Minchan Kim To: Ross Zwisler Cc: Matthew Wilcox , Andrew Morton , linux-kernel@vger.kernel.org, "karam . lee" , Jerome Marchand , Nitin Gupta , seungho1.park@lge.com, Christoph Hellwig , Dan Williams , Dave Chinner , Jan Kara , Jens Axboe , Vishal Verma , linux-nvdimm@lists.01.org Subject: Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt Message-ID: <20170804081740.GA2083@bbox> References: <20170728165604.10455-1-ross.zwisler@linux.intel.com> <20170728173143.GE15980@bombadil.infradead.org> <20170802221359.GA20666@linux.intel.com> <20170803001315.GF32020@bbox> <20170803211335.GA1260@linux.intel.com> <20170804035441.GA305@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170804035441.GA305@bbox> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote: > On Thu, Aug 03, 2017 at 03:13:35PM -0600, Ross Zwisler wrote: > > On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote: > > > Hi Ross, > > > > > > On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote: > > > > On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote: > > > > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote: > > > > > > Dan Williams and Christoph Hellwig have recently expressed doubt about > > > > > > whether the rw_page() interface made sense for synchronous memory drivers > > > > > > [1][2]. It's unclear whether this interface has any performance benefit > > > > > > for these drivers, but as we continue to fix bugs it is clear that it does > > > > > > have a maintenance burden. This series removes the rw_page() > > > > > > implementations in brd, pmem and btt to relieve this burden. > > > > > > > > > > Why don't you measure whether it has performance benefits? I don't > > > > > understand why zram would see performance benefits and not other drivers. > > > > > If it's going to be removed, then the whole interface should be removed, > > > > > not just have the implementations removed from some drivers. > > > > > > > > Okay, I've run a bunch of performance tests with the PMEM and with BTT entry > > > > points for rw_pages() in a swap workload, and in all cases I do see an > > > > improvement over the code when rw_pages() is removed. Here are the results > > > > from my random lab box: > > > > > > > > Average latency of swap_writepage() > > > > +------+------------+---------+-------------+ > > > > | | no rw_page | rw_page | Improvement | > > > > +-------------------------------------------+ > > > > | PMEM | 5.0 us | 4.7 us | 6% | > > > > +-------------------------------------------+ > > > > | BTT | 6.8 us | 6.1 us | 10% | > > > > +------+------------+---------+-------------+ > > > > > > > > Average latency of swap_readpage() > > > > +------+------------+---------+-------------+ > > > > | | no rw_page | rw_page | Improvement | > > > > +-------------------------------------------+ > > > > | PMEM | 3.3 us | 2.9 us | 12% | > > > > +-------------------------------------------+ > > > > | BTT | 3.7 us | 3.4 us | 8% | > > > > +------+------------+---------+-------------+ > > > > > > > > The workload was pmbench, a memory benchmark, run on a system where I had > > > > severely restricted the amount of memory in the system with the 'mem' kernel > > > > command line parameter. The benchmark was set up to test more memory than I > > > > allowed the OS to have so it spilled over into swap. > > > > > > > > The PMEM or BTT device was set up as my swap device, and during the test I got > > > > a few hundred thousand samples of each of swap_writepage() and > > > > swap_writepage(). The PMEM/BTT device was just memory reserved with the > > > > memmap kernel command line parameter. > > > > > > > > Thanks, Matthew, for asking for performance data. It looks like removing this > > > > code would have been a mistake. > > > > > > By suggestion of Christoph Hellwig, I made a quick patch which does IO without > > > dynamic bio allocation for swap IO. Actually, it's not formal patch to be > > > worth to send mainline yet but I believe it's enough to test the improvement. > > > > > > Could you test patchset on pmem and btt without rw_page? > > > > > > For working the patch, block drivers need to declare it's synchronous IO > > > device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO > > > comes from (sis->flags & SWP_SYNC_IO) with removing condition check > > > > > > if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page. > > > > > > Patchset is based on 4.13-rc3. > > > > Thanks for the patch, here are the updated results from my test box: > > > > Average latency of swap_writepage() > > +------+------------+---------+---------+ > > | | no rw_page | minchan | rw_page | > > +---------------------------------------- > > | PMEM | 5.0 us | 4.98 us | 4.7 us | > > +---------------------------------------- > > | BTT | 6.8 us | 6.3 us | 6.1 us | > > +------+------------+---------+---------+ > > > > Average latency of swap_readpage() > > +------+------------+---------+---------+ > > | | no rw_page | minchan | rw_page | > > +---------------------------------------- > > | PMEM | 3.3 us | 3.27 us | 2.9 us | > > +---------------------------------------- > > | BTT | 3.7 us | 3.44 us | 3.4 us | > > +------+------------+---------+---------+ > > > > I've added another digit in precision in some cases to help differentiate the > > various results. > > > > In all cases your patches did perform better than with the regularly allocated > > BIO, but again for all cases the rw_page() path was the fastest, even if only > > marginally. > > Thanks for the testing. Your testing number is within noise level? > > I cannot understand why PMEM doesn't have enough gain while BTT is significant > win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic > allocation and mine and rw_page testing reduced it significantly. However, > in no rw_page with pmem, there wasn't many cases to wait bio allocations due > to the device is so fast so the number comes from purely the number of > instructions has done. At a quick glance of bio init/submit, it's not trivial > so indeed, i understand where the 12% enhancement comes from but I'm not sure > it's really big difference in real practice at the cost of maintaince burden. I tested pmbench 10 times in my local machine(4 core) with zram-swap. In my machine, even, on-stack bio is faster than rw_page. Unbelievable. I guess it's really hard to get stable result in severe memory pressure. It would be a result within noise level(see below stddev). So, I think it's hard to conclude rw_page is far faster than onstack-bio. rw_page avg 5.54us stddev 8.89% max 6.02us min 4.20us onstack bio avg 5.27us stddev 13.03% max 5.96us min 3.55us