linux-nvdimm.lists.01.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] dax: Do not try to clear poison for partial pages
@ 2020-01-29 21:03 Vivek Goyal
  2020-01-31  5:42 ` Christoph Hellwig
  2020-02-05 20:26 ` jane.chu
  0 siblings, 2 replies; 7+ messages in thread
From: Vivek Goyal @ 2020-01-29 21:03 UTC (permalink / raw)
  To: Dan Williams, vishal.l.verma, Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel

I am looking into getting rid of dependency on block device in dax
path. One such place is __dax_zero_page_range() which checks if
range being zeroed is aligned to block device logical size, then
it calls bdev_issue_zeroout() instead of doing memset(). Calling
blkdev_issue_zeroout() also clears bad blocks and poison if any
in that range.

This path is used by iomap_zero_range() which in-turn is being
used by filesystems to zero partial filesystem system blocks.
For zeroing full filesystem blocks we seem to be calling
blkdev_issue_zeroout() which clears bad blocks and poison in that
range.

So this code currently only seems to help with partial filesystem
block zeroing. That too only if truncation/hole_punch happens on
device logical block size boundary.

To avoid using blkdev_issue_zeroout() in this path, I proposed another
patch here which adds another dax operation to zero out a rangex and
clear poison.

https://lore.kernel.org/linux-fsdevel/20200123165249.GA7664@redhat.com/

Thinking more about it, it might be an overkill to solve this problem.

How about if we just not clear poison/bad blocks when a partial page
is being zeroed. IOW, users will need to hole_punch whole filesystem
block worth of data which will free that block and it will be zeroed
some other time and clear poison in the process. For partial fs block
truncation/punch_hole, we don't clear poison.

If above interface is acceptable, then we can get rid of code which
tries to call blkdev_issue_zeroout() in iomap_zero_range() path and
we don't have to implement another dax operation.

Looking for some feedback on this.

Vivek
 
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c |   50 +++++++++++++++-----------------------------------
 1 file changed, 15 insertions(+), 35 deletions(-)

Index: redhat-linux/fs/dax.c
===================================================================
--- redhat-linux.orig/fs/dax.c	2020-01-29 15:19:18.551902448 -0500
+++ redhat-linux/fs/dax.c	2020-01-29 15:40:56.584824549 -0500
@@ -1044,47 +1044,27 @@ static vm_fault_t dax_load_hole(struct x
 	return ret;
 }
 
-static bool dax_range_is_aligned(struct block_device *bdev,
-				 unsigned int offset, unsigned int length)
-{
-	unsigned short sector_size = bdev_logical_block_size(bdev);
-
-	if (!IS_ALIGNED(offset, sector_size))
-		return false;
-	if (!IS_ALIGNED(length, sector_size))
-		return false;
-
-	return true;
-}
-
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int size)
 {
-	if (dax_range_is_aligned(bdev, offset, size)) {
-		sector_t start_sector = sector + (offset >> 9);
-
-		return blkdev_issue_zeroout(bdev, start_sector,
-				size >> 9, GFP_NOFS, 0);
-	} else {
-		pgoff_t pgoff;
-		long rc, id;
-		void *kaddr;
-
-		rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
-		if (rc)
-			return rc;
-
-		id = dax_read_lock();
-		rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
-		if (rc < 0) {
-			dax_read_unlock(id);
-			return rc;
-		}
-		memset(kaddr + offset, 0, size);
-		dax_flush(dax_dev, kaddr + offset, size);
+	pgoff_t pgoff;
+	long rc, id;
+	void *kaddr;
+
+	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
+	if (rc)
+		return rc;
+
+	id = dax_read_lock();
+	rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
+	if (rc < 0) {
 		dax_read_unlock(id);
+		return rc;
 	}
+	memset(kaddr + offset, 0, size);
+	dax_flush(dax_dev, kaddr + offset, size);
+	dax_read_unlock(id);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__dax_zero_page_range);
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] dax: Do not try to clear poison for partial pages
  2020-01-29 21:03 [RFC][PATCH] dax: Do not try to clear poison for partial pages Vivek Goyal
@ 2020-01-31  5:42 ` Christoph Hellwig
  2020-02-05 20:26 ` jane.chu
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2020-01-31  5:42 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel

On Wed, Jan 29, 2020 at 04:03:37PM -0500, Vivek Goyal wrote:
> I am looking into getting rid of dependency on block device in dax
> path. One such place is __dax_zero_page_range() which checks if
> range being zeroed is aligned to block device logical size, then
> it calls bdev_issue_zeroout() instead of doing memset(). Calling
> blkdev_issue_zeroout() also clears bad blocks and poison if any
> in that range.
> 
> This path is used by iomap_zero_range() which in-turn is being
> used by filesystems to zero partial filesystem system blocks.
> For zeroing full filesystem blocks we seem to be calling
> blkdev_issue_zeroout() which clears bad blocks and poison in that
> range.
> 
> So this code currently only seems to help with partial filesystem
> block zeroing. That too only if truncation/hole_punch happens on
> device logical block size boundary.
> 
> To avoid using blkdev_issue_zeroout() in this path, I proposed another
> patch here which adds another dax operation to zero out a rangex and
> clear poison.

I'll have to defer to Dan and other on the poison clearing issue,
as it keeps confusing me everytime I look into how it is supposed
to work..  But in the end we'll need a path that doesn't realy on
the block device to clear poison anyway, so I think a method will
ultimatively be needed.  That being said this patch looks nice :)
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] dax: Do not try to clear poison for partial pages
  2020-01-29 21:03 [RFC][PATCH] dax: Do not try to clear poison for partial pages Vivek Goyal
  2020-01-31  5:42 ` Christoph Hellwig
@ 2020-02-05 20:26 ` jane.chu
  2020-02-06  0:37   ` Dan Williams
  1 sibling, 1 reply; 7+ messages in thread
From: jane.chu @ 2020-02-05 20:26 UTC (permalink / raw)
  To: Vivek Goyal, Dan Williams, vishal.l.verma, Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel

Hello,

I haven't seen response to this proposal, unsure if there is a different
but related discussion ongoing...

I'd like to express my wish: please make it easier for the pmem 
applications when possible.

If kernel does not clear poison when it could legitimately do so,
applications have to go through lengths to clear poisons.
For Cloud pmem applications that have upper bound on error recovery
time, not clearing poison while zeroing-out is quite undesirable.

Thanks!
-jane

On 1/29/20 1:03 PM, Vivek Goyal wrote:
> I am looking into getting rid of dependency on block device in dax
> path. One such place is __dax_zero_page_range() which checks if
> range being zeroed is aligned to block device logical size, then
> it calls bdev_issue_zeroout() instead of doing memset(). Calling
> blkdev_issue_zeroout() also clears bad blocks and poison if any
> in that range.
> 
> This path is used by iomap_zero_range() which in-turn is being
> used by filesystems to zero partial filesystem system blocks.
> For zeroing full filesystem blocks we seem to be calling
> blkdev_issue_zeroout() which clears bad blocks and poison in that
> range.
> 
> So this code currently only seems to help with partial filesystem
> block zeroing. That too only if truncation/hole_punch happens on
> device logical block size boundary.
> 
> To avoid using blkdev_issue_zeroout() in this path, I proposed another
> patch here which adds another dax operation to zero out a rangex and
> clear poison.
> 
> https://lore.kernel.org/linux-fsdevel/20200123165249.GA7664@redhat.com/
> 
> Thinking more about it, it might be an overkill to solve this problem.
> 
> How about if we just not clear poison/bad blocks when a partial page
> is being zeroed. IOW, users will need to hole_punch whole filesystem
> block worth of data which will free that block and it will be zeroed
> some other time and clear poison in the process. For partial fs block
> truncation/punch_hole, we don't clear poison.
> 
> If above interface is acceptable, then we can get rid of code which
> tries to call blkdev_issue_zeroout() in iomap_zero_range() path and
> we don't have to implement another dax operation.
> 
> Looking for some feedback on this.
> 
> Vivek
>   
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>   fs/dax.c |   50 +++++++++++++++-----------------------------------
>   1 file changed, 15 insertions(+), 35 deletions(-)
> 
> Index: redhat-linux/fs/dax.c
> ===================================================================
> --- redhat-linux.orig/fs/dax.c	2020-01-29 15:19:18.551902448 -0500
> +++ redhat-linux/fs/dax.c	2020-01-29 15:40:56.584824549 -0500
> @@ -1044,47 +1044,27 @@ static vm_fault_t dax_load_hole(struct x
>   	return ret;
>   }
>   
> -static bool dax_range_is_aligned(struct block_device *bdev,
> -				 unsigned int offset, unsigned int length)
> -{
> -	unsigned short sector_size = bdev_logical_block_size(bdev);
> -
> -	if (!IS_ALIGNED(offset, sector_size))
> -		return false;
> -	if (!IS_ALIGNED(length, sector_size))
> -		return false;
> -
> -	return true;
> -}
> -
>   int __dax_zero_page_range(struct block_device *bdev,
>   		struct dax_device *dax_dev, sector_t sector,
>   		unsigned int offset, unsigned int size)
>   {
> -	if (dax_range_is_aligned(bdev, offset, size)) {
> -		sector_t start_sector = sector + (offset >> 9);
> -
> -		return blkdev_issue_zeroout(bdev, start_sector,
> -				size >> 9, GFP_NOFS, 0);
> -	} else {
> -		pgoff_t pgoff;
> -		long rc, id;
> -		void *kaddr;
> -
> -		rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
> -		if (rc)
> -			return rc;
> -
> -		id = dax_read_lock();
> -		rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
> -		if (rc < 0) {
> -			dax_read_unlock(id);
> -			return rc;
> -		}
> -		memset(kaddr + offset, 0, size);
> -		dax_flush(dax_dev, kaddr + offset, size);
> +	pgoff_t pgoff;
> +	long rc, id;
> +	void *kaddr;
> +
> +	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
> +	if (rc)
> +		return rc;
> +
> +	id = dax_read_lock();
> +	rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
> +	if (rc < 0) {
>   		dax_read_unlock(id);
> +		return rc;
>   	}
> +	memset(kaddr + offset, 0, size);
> +	dax_flush(dax_dev, kaddr + offset, size);
> +	dax_read_unlock(id);
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(__dax_zero_page_range);
> _______________________________________________
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] dax: Do not try to clear poison for partial pages
  2020-02-05 20:26 ` jane.chu
@ 2020-02-06  0:37   ` Dan Williams
  2020-02-18 19:50     ` Jeff Moyer
  0 siblings, 1 reply; 7+ messages in thread
From: Dan Williams @ 2020-02-06  0:37 UTC (permalink / raw)
  To: Jane Chu; +Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel

On Wed, Feb 5, 2020 at 12:27 PM <jane.chu@oracle.com> wrote:
>
> Hello,
>
> I haven't seen response to this proposal, unsure if there is a different
> but related discussion ongoing...
>
> I'd like to express my wish: please make it easier for the pmem
> applications when possible.
>
> If kernel does not clear poison when it could legitimately do so,

The only path where this happens today is write() syscalls in dax
mode, otherwise fallocate(PUNCH_HOLE) is currently the only guaranteed
way to trigger error clearing from userspace (outside of sending raw
commands to the device).

> applications have to go through lengths to clear poisons.
> For Cloud pmem applications that have upper bound on error recovery
> time, not clearing poison while zeroing-out is quite undesirable.

The complicating factor in all of this is the alignment requirement
for clearing and the inability for native cpu instructions to clear
errors. On current platforms talking to firmware is required and that
interface may require 256-byte block clearing. This is why the
implementation glommed on to clearing errors on block-I/O path writes
because we at least knew that all of those I/Os were 512-byte aligned.

This gets better with cpus that support the movdir64b instruction, in
that case there is still a 64-byte alignment requirement, but there's
no need to talk to the BIOS and therefore no need to talk to a driver.

So we have this awkward dependency on block-device I/O semantics only
because it happened to organize i/o in a way that supports error
clearing.

Right now the kernel does not install a pte on faults that land on a
page with known poison, but only because the error clearing path is so
convoluted and could only claim that fallocate(PUNCH_HOLE) cleared
errors because that was guaranteed to send 512-byte aligned zero's
down the block-I/O path when the fs-blocks got reallocated. In a world
where native cpu instructions can clear errors the dax write() syscall
case could be covered (modulo 64-byte alignment), and the kernel could
just let the page be mapped so that the application could attempt it's
own fine-grained clearing without calling back into the kernel.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] dax: Do not try to clear poison for partial pages
  2020-02-06  0:37   ` Dan Williams
@ 2020-02-18 19:50     ` Jeff Moyer
  2020-02-18 20:45       ` jane.chu
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff Moyer @ 2020-02-18 19:50 UTC (permalink / raw)
  To: Dan Williams; +Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel

Dan Williams <dan.j.williams@intel.com> writes:

> Right now the kernel does not install a pte on faults that land on a
> page with known poison, but only because the error clearing path is so
> convoluted and could only claim that fallocate(PUNCH_HOLE) cleared
> errors because that was guaranteed to send 512-byte aligned zero's
> down the block-I/O path when the fs-blocks got reallocated. In a world
> where native cpu instructions can clear errors the dax write() syscall
> case could be covered (modulo 64-byte alignment), and the kernel could
> just let the page be mapped so that the application could attempt it's
> own fine-grained clearing without calling back into the kernel.

I'm not sure we'd want to do allow mapping the PTEs even if there was
support for clearing errors via CPU instructions.  Any load from a
poisoned page will result in an MCE, and there exists the possiblity
that you will hit an unrecoverable error (Processor Context Corrupt).
It's just safer to catch these cases by not mapping the page, and
forcing recovery through the driver.

-Jeff
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] dax: Do not try to clear poison for partial pages
  2020-02-18 19:50     ` Jeff Moyer
@ 2020-02-18 20:45       ` jane.chu
  2020-02-18 22:50         ` jane.chu
  0 siblings, 1 reply; 7+ messages in thread
From: jane.chu @ 2020-02-18 20:45 UTC (permalink / raw)
  To: Jeff Moyer, Dan Williams; +Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel

On 2/18/20 11:50 AM, Jeff Moyer wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
> 
>> Right now the kernel does not install a pte on faults that land on a
>> page with known poison, but only because the error clearing path is so
>> convoluted and could only claim that fallocate(PUNCH_HOLE) cleared
>> errors because that was guaranteed to send 512-byte aligned zero's
>> down the block-I/O path when the fs-blocks got reallocated. In a world
>> where native cpu instructions can clear errors the dax write() syscall
>> case could be covered (modulo 64-byte alignment), and the kernel could
>> just let the page be mapped so that the application could attempt it's
>> own fine-grained clearing without calling back into the kernel.
> 
> I'm not sure we'd want to do allow mapping the PTEs even if there was
> support for clearing errors via CPU instructions.  Any load from a
> poisoned page will result in an MCE, and there exists the possiblity
> that you will hit an unrecoverable error (Processor Context Corrupt).
> It's just safer to catch these cases by not mapping the page, and
> forcing recovery through the driver.
> 
> -Jeff
> 

I'm still in the process of trying a number of things before making an
attempt to respond to Dan's response. But I'm too slow, so I'd like
to share some concerns I have here.

If a poison in a file is consumed, and the signal handle does the
repair and recover as follow: punch a hole the size at least 4K, then
pwrite the correct data in to the 'hole', then resume the operation.
However, because the newly allocated pmem block (due to pwrite to the 
'hole') is a different clean physical pmem block while the poisoned
block remain unfixed, so we have a provisioning problem, because
  1. DCPMEM is expensive hence there is likely little provision being
provided by users;
  2. lack up API between dax-filesystem and pmem driver for clearing
poison at each legitimate point, such as when the filesystem tries
to allocate a pmem block, or zeroing out a range.

As DCPMM is used for its performance and capacity in cloud application,
which translates to that the performance code paths include the error
handling and recovery code path...

With respect to the new cpu instruction, my concern is about the API 
including the error blast radius as reported in the signal payload.
Is there a venue where we could discuss more in detail ?

Regards,
-jane


_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] dax: Do not try to clear poison for partial pages
  2020-02-18 20:45       ` jane.chu
@ 2020-02-18 22:50         ` jane.chu
  0 siblings, 0 replies; 7+ messages in thread
From: jane.chu @ 2020-02-18 22:50 UTC (permalink / raw)
  To: Jeff Moyer, Dan Williams; +Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel

On 2/18/20 12:45 PM, jane.chu@oracle.com wrote:
> On 2/18/20 11:50 AM, Jeff Moyer wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>> Right now the kernel does not install a pte on faults that land on a
>>> page with known poison, but only because the error clearing path is so
>>> convoluted and could only claim that fallocate(PUNCH_HOLE) cleared
>>> errors because that was guaranteed to send 512-byte aligned zero's
>>> down the block-I/O path when the fs-blocks got reallocated. In a world
>>> where native cpu instructions can clear errors the dax write() syscall
>>> case could be covered (modulo 64-byte alignment), and the kernel could
>>> just let the page be mapped so that the application could attempt it's
>>> own fine-grained clearing without calling back into the kernel.
>>
>> I'm not sure we'd want to do allow mapping the PTEs even if there was
>> support for clearing errors via CPU instructions.  Any load from a
>> poisoned page will result in an MCE, and there exists the possiblity
>> that you will hit an unrecoverable error (Processor Context Corrupt).
>> It's just safer to catch these cases by not mapping the page, and
>> forcing recovery through the driver.
>>
>> -Jeff
>>
> 
> I'm still in the process of trying a number of things before making an
> attempt to respond to Dan's response. But I'm too slow, so I'd like
> to share some concerns I have here.
> 
> If a poison in a file is consumed, and the signal handle does the
> repair and recover as follow: punch a hole the size at least 4K, then
> pwrite the correct data in to the 'hole', then resume the operation.
> However, because the newly allocated pmem block (due to pwrite to the 
> 'hole') is a different clean physical pmem block while the poisoned
> block remain unfixed, so we have a provisioning problem, because
>   1. DCPMEM is expensive hence there is likely little provision being
> provided by users;
>   2. lack up API between dax-filesystem and pmem driver for clearing
> poison at each legitimate point, such as when the filesystem tries
> to allocate a pmem block, or zeroing out a range >
> As DCPMM is used for its performance and capacity in cloud application,
> which translates to that the performance code paths include the error
> handling and recovery code path...
> 
> With respect to the new cpu instruction, my concern is about the API 
> including the error blast radius as reported in the signal payload.
> Is there a venue where we could discuss more in detail ?

For all the quarantined poison blocks, it's not practical to clear them 
poisons via ndctl/libndctl on a per namespace granularity for fear of
poisons occurred in valid pmem blocks during data at rest.

How to ultimately clear poisons in a dax-fs in current framework?
it seems to me poisons need to be cleared on the go automatically.

Regards,
-jane

> 
> Regards,
> -jane
> 
> 
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-02-18 22:53 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-29 21:03 [RFC][PATCH] dax: Do not try to clear poison for partial pages Vivek Goyal
2020-01-31  5:42 ` Christoph Hellwig
2020-02-05 20:26 ` jane.chu
2020-02-06  0:37   ` Dan Williams
2020-02-18 19:50     ` Jeff Moyer
2020-02-18 20:45       ` jane.chu
2020-02-18 22:50         ` jane.chu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).