All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: axboe@fb.com, jack@suse.cz, linux-nvdimm@lists.01.org,
	linux-kernel@vger.kernel.org, Jeff Moyer <jmoyer@redhat.com>,
	Jan Kara <jack@suse.com>,
	ross.zwisler@linux.intel.com, hch@lst.de
Subject: Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
Date: Tue, 3 Nov 2015 11:51:13 +1100	[thread overview]
Message-ID: <20151103005113.GN10656@dastard> (raw)
In-Reply-To: <20151102042952.6610.7185.stgit@dwillia2-desk3.amr.corp.intel.com>

On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> dax_clear_blocks is currently performing a cond_resched() after every
> PAGE_SIZE memset.  We need not check so frequently, for example md-raid
> only calls cond_resched() at stripe granularity.  Also, in preparation
> for introducing a dax_map_atomic() operation that temporarily pins a dax
> mapping move the call to cond_resched() to the outer loop.
> 
> Reviewed-by: Jan Kara <jack@suse.com>
> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/dax.c |   27 ++++++++++++---------------
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5dc33d788d50..f8e543839e5c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -28,6 +28,7 @@
>  #include <linux/sched.h>
>  #include <linux/uio.h>
>  #include <linux/vmstat.h>
> +#include <linux/sizes.h>
>  
>  int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  {
> @@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  	do {
>  		void __pmem *addr;
>  		unsigned long pfn;
> -		long count;
> +		long count, sz;
>  
> -		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
> +		sz = min_t(long, size, SZ_1M);
> +		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
>  		if (count < 0)
>  			return count;
> -		BUG_ON(size < count);
> -		while (count > 0) {
> -			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
> -			if (pgsz > count)
> -				pgsz = count;
> -			clear_pmem(addr, pgsz);
> -			addr += pgsz;
> -			size -= pgsz;
> -			count -= pgsz;
> -			BUG_ON(pgsz & 511);
> -			sector += pgsz / 512;
> -			cond_resched();
> -		}
> +		if (count < sz)
> +			sz = count;
> +		clear_pmem(addr, sz);
> +		addr += sz;
> +		size -= sz;
> +		BUG_ON(sz & 511);
> +		sector += sz / 512;
> +		cond_resched();
>  	} while (size);
>  
>  	wmb_pmem();

dax_clear_blocks() needs to go away and be replaced by a driver
level implementation of blkdev_issue_zerout(). This is effectively a
block device operation (we're taking sector addresses and zeroing
them), so it really belongs in the pmem drivers rather than the DAX
code.

I suspect a REQ_WRITE_SAME implementation is the way to go here, as
then the filesystems can just call sb_issue_zerout() and the block
layer zeroing will work on all types of storage without the
filesystem having to care whether DAX is in use or not.

Putting the implementation of the zeroing in the pmem drivers will
enable the drivers to optimise the caching behaviour of block
zeroing.  The synchronous cache flushing behaviour of this function
is a performance killer as we are now block zeroing on allocation
and that results in two synchronous data writes (zero on alloc,
commit, write data, commit) for each page.

The zeroing (and the data, for that matter) doesn't need to be
committed to persistent store until the allocation is written and
committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
write, so it makes sense to deploy the big hammer and delay the
blocking CPU cache flushes until the last possible moment in cases
like this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: axboe@fb.com, jack@suse.cz, linux-nvdimm@ml01.01.org,
	linux-kernel@vger.kernel.org, Jeff Moyer <jmoyer@redhat.com>,
	Jan Kara <jack@suse.com>,
	ross.zwisler@linux.intel.com, hch@lst.de
Subject: Re: [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations
Date: Tue, 3 Nov 2015 11:51:13 +1100	[thread overview]
Message-ID: <20151103005113.GN10656@dastard> (raw)
In-Reply-To: <20151102042952.6610.7185.stgit@dwillia2-desk3.amr.corp.intel.com>

On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> dax_clear_blocks is currently performing a cond_resched() after every
> PAGE_SIZE memset.  We need not check so frequently, for example md-raid
> only calls cond_resched() at stripe granularity.  Also, in preparation
> for introducing a dax_map_atomic() operation that temporarily pins a dax
> mapping move the call to cond_resched() to the outer loop.
> 
> Reviewed-by: Jan Kara <jack@suse.com>
> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/dax.c |   27 ++++++++++++---------------
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5dc33d788d50..f8e543839e5c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -28,6 +28,7 @@
>  #include <linux/sched.h>
>  #include <linux/uio.h>
>  #include <linux/vmstat.h>
> +#include <linux/sizes.h>
>  
>  int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  {
> @@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  	do {
>  		void __pmem *addr;
>  		unsigned long pfn;
> -		long count;
> +		long count, sz;
>  
> -		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
> +		sz = min_t(long, size, SZ_1M);
> +		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
>  		if (count < 0)
>  			return count;
> -		BUG_ON(size < count);
> -		while (count > 0) {
> -			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
> -			if (pgsz > count)
> -				pgsz = count;
> -			clear_pmem(addr, pgsz);
> -			addr += pgsz;
> -			size -= pgsz;
> -			count -= pgsz;
> -			BUG_ON(pgsz & 511);
> -			sector += pgsz / 512;
> -			cond_resched();
> -		}
> +		if (count < sz)
> +			sz = count;
> +		clear_pmem(addr, sz);
> +		addr += sz;
> +		size -= sz;
> +		BUG_ON(sz & 511);
> +		sector += sz / 512;
> +		cond_resched();
>  	} while (size);
>  
>  	wmb_pmem();

dax_clear_blocks() needs to go away and be replaced by a driver
level implementation of blkdev_issue_zerout(). This is effectively a
block device operation (we're taking sector addresses and zeroing
them), so it really belongs in the pmem drivers rather than the DAX
code.

I suspect a REQ_WRITE_SAME implementation is the way to go here, as
then the filesystems can just call sb_issue_zerout() and the block
layer zeroing will work on all types of storage without the
filesystem having to care whether DAX is in use or not.

Putting the implementation of the zeroing in the pmem drivers will
enable the drivers to optimise the caching behaviour of block
zeroing.  The synchronous cache flushing behaviour of this function
is a performance killer as we are now block zeroing on allocation
and that results in two synchronous data writes (zero on alloc,
commit, write data, commit) for each page.

The zeroing (and the data, for that matter) doesn't need to be
committed to persistent store until the allocation is written and
committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
write, so it makes sense to deploy the big hammer and delay the
blocking CPU cache flushes until the last possible moment in cases
like this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2015-11-03  0:51 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-02  4:29 [PATCH v3 00/15] block, dax updates for 4.4 Dan Williams
2015-11-02  4:29 ` Dan Williams
2015-11-02  4:29 ` [PATCH v3 01/15] pmem, dax: clean up clear_pmem() Dan Williams
2015-11-02  4:29   ` Dan Williams
2015-11-02  4:29 ` [PATCH v3 02/15] dax: increase granularity of dax_clear_blocks() operations Dan Williams
2015-11-02  4:29   ` Dan Williams
2015-11-03  0:51   ` Dave Chinner [this message]
2015-11-03  0:51     ` Dave Chinner
2015-11-03  3:27     ` Dan Williams
2015-11-03  3:27       ` Dan Williams
2015-11-03  4:48       ` Dave Chinner
2015-11-03  4:48         ` Dave Chinner
2015-11-03  5:31         ` Dan Williams
2015-11-03  5:31           ` Dan Williams
2015-11-03  5:52           ` Dave Chinner
2015-11-03  5:52             ` Dave Chinner
2015-11-03  7:24             ` Dan Williams
2015-11-03  7:24               ` Dan Williams
2015-11-03 16:21           ` Jan Kara
2015-11-03 16:21             ` Jan Kara
2015-11-03 17:57           ` Ross Zwisler
2015-11-03 17:57             ` Ross Zwisler
2015-11-03 20:59             ` Dave Chinner
2015-11-03 20:59               ` Dave Chinner
2015-11-02  4:29 ` [PATCH v3 03/15] block, dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
2015-11-02  4:29   ` Dan Williams
2015-11-03 19:01   ` Ross Zwisler
2015-11-03 19:01     ` Ross Zwisler
2015-11-03 19:09     ` Jeff Moyer
2015-11-03 22:50     ` Dan Williams
2015-11-03 22:50       ` Dan Williams
2016-01-18 10:42   ` Geert Uytterhoeven
2016-01-18 10:42     ` Geert Uytterhoeven
2015-11-02  4:30 ` [PATCH v3 04/15] libnvdimm, pmem: move request_queue allocation earlier in probe Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-03 19:15   ` Ross Zwisler
2015-11-03 19:15     ` Ross Zwisler
2015-11-02  4:30 ` [PATCH v3 05/15] libnvdimm, pmem: fix size trim in pmem_direct_access() Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-03 19:32   ` Ross Zwisler
2015-11-03 19:32     ` Ross Zwisler
2015-11-03 21:39     ` Dan Williams
2015-11-03 21:39       ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 06/15] um: kill pfn_t Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 07/15] kvm: rename pfn_t to kvm_pfn_t Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 08/15] mm, dax, pmem: introduce pfn_t Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02 16:30   ` Joe Perches
2015-11-02 16:30     ` Joe Perches
2015-11-02  4:30 ` [PATCH v3 09/15] block: notify queue death confirmation Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 10/15] dax, pmem: introduce zone_device_revoke() and devm_memunmap_pages() Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 11/15] block: introduce bdev_file_inode() Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 12/15] block: enable dax for raw block devices Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 13/15] block, dax: make dax mappings opt-in by default Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-03  0:32   ` Dave Chinner
2015-11-03  0:32     ` Dave Chinner
2015-11-03  7:35     ` Dan Williams
2015-11-03  7:35       ` Dan Williams
2015-11-03 20:20       ` Dave Chinner
2015-11-03 20:20         ` Dave Chinner
2015-11-03 23:04         ` Dan Williams
2015-11-03 23:04           ` Dan Williams
2015-11-04 19:23           ` Dan Williams
2015-11-04 19:23             ` Dan Williams
2015-11-02  4:30 ` [PATCH v3 14/15] dax: dirty extent notification Dan Williams
2015-11-02  4:30   ` Dan Williams
2015-11-03  1:16   ` Dave Chinner
2015-11-03  1:16     ` Dave Chinner
2015-11-03  4:56     ` Dan Williams
2015-11-03  4:56       ` Dan Williams
2015-11-03  5:40       ` Dave Chinner
2015-11-03  5:40         ` Dave Chinner
2015-11-03  7:20         ` Dan Williams
2015-11-03  7:20           ` Dan Williams
2015-11-03 20:51           ` Dave Chinner
2015-11-03 20:51             ` Dave Chinner
2015-11-03 21:19             ` Dan Williams
2015-11-03 21:19               ` Dan Williams
2015-11-03 21:37             ` Ross Zwisler
2015-11-03 21:37               ` Ross Zwisler
2015-11-03 21:43               ` Dan Williams
2015-11-03 21:43                 ` Dan Williams
2015-11-03 21:18       ` Ross Zwisler
2015-11-03 21:18         ` Ross Zwisler
2015-11-03 21:34         ` Dan Williams
2015-11-03 21:34           ` Dan Williams
2015-11-02  4:31 ` [PATCH v3 15/15] pmem: blkdev_issue_flush support Dan Williams
2015-11-02  4:31   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151103005113.GN10656@dastard \
    --to=david@fromorbit.com \
    --cc=axboe@fb.com \
    --cc=dan.j.williams@intel.com \
    --cc=hch@lst.de \
    --cc=jack@suse.com \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=ross.zwisler@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.