nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] dax: clear poison on the fly along pwrite
@ 2021-09-14 23:31 Jane Chu
  2021-09-14 23:31 ` [PATCH 1/3] dax: introduce dax_operation dax_clear_poison Jane Chu
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Jane Chu @ 2021-09-14 23:31 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, ira.weiny, viro,
	willy, jack, nvdimm, linux-kernel, linux-fsdevel

If pwrite(2) encounters poison in a pmem range, it fails with EIO.
This is unecessary if hardware is capable of clearing the poison.

Though not all dax backend hardware has the capability of clearing
poison on the fly, but dax backed by Intel DCPMEM has such capability,
and it's desirable to, first, speed up repairing by means of it;
second, maintain backend continuity instead of fragmenting it in
search for clean blocks.

Jane Chu (3):
  dax: introduce dax_operation dax_clear_poison
  dax: introduce dax_clear_poison to dax pwrite operation
  libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation

 drivers/dax/super.c   | 13 +++++++++++++
 drivers/nvdimm/pmem.c | 17 +++++++++++++++++
 fs/dax.c              |  9 +++++++++
 include/linux/dax.h   |  6 ++++++
 4 files changed, 45 insertions(+)

-- 
2.18.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] dax: introduce dax_operation dax_clear_poison
  2021-09-14 23:31 [PATCH 0/3] dax: clear poison on the fly along pwrite Jane Chu
@ 2021-09-14 23:31 ` Jane Chu
  2021-11-04 17:53   ` Christoph Hellwig
  2021-09-14 23:31 ` [PATCH 2/3] dax: introduce dax_clear_poison to dax pwrite operation Jane Chu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Jane Chu @ 2021-09-14 23:31 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, ira.weiny, viro,
	willy, jack, nvdimm, linux-kernel, linux-fsdevel

Though not all dax backend hardware has the capability of clearing
poison on the fly, but dax backed by Intel DCPMEM has such capability,
and it's desirable to, first, speed up repairing by means of it;
second, maintain backend continuity instead of fragmenting it in
search for clean blocks.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/dax/super.c | 13 +++++++++++++
 include/linux/dax.h |  6 ++++++
 2 files changed, 19 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 44736cbd446e..935d496fa7db 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -373,6 +373,19 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_zero_page_range);
 
+int dax_clear_poison(struct dax_device *dax_dev, pgoff_t pgoff,
+			size_t nr_pages)
+{
+	if (!dax_alive(dax_dev))
+		return -ENXIO;
+
+	if (!dax_dev->ops->clear_poison)
+		return -EOPNOTSUPP;
+
+	return dax_dev->ops->clear_poison(dax_dev, pgoff, nr_pages);
+}
+EXPORT_SYMBOL_GPL(dax_clear_poison);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..c54c1087ece1 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,6 +36,11 @@ struct dax_operations {
 			struct iov_iter *);
 	/* zero_page_range: required operation. Zero page range   */
 	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
+	/*
+	 * clear_poison: clear media error in the given page aligned range via
+	 * vendor appropriate method. Optional operation.
+	 */
+	int (*clear_poison)(struct dax_device *, pgoff_t, size_t);
 };
 
 extern struct attribute_group dax_attribute_group;
@@ -226,6 +231,7 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 		size_t bytes, struct iov_iter *i);
 int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 			size_t nr_pages);
+int dax_clear_poison(struct dax_device *dax_dev, pgoff_t pgoff, size_t nr_pages);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/3] dax: introduce dax_clear_poison to dax pwrite operation
  2021-09-14 23:31 [PATCH 0/3] dax: clear poison on the fly along pwrite Jane Chu
  2021-09-14 23:31 ` [PATCH 1/3] dax: introduce dax_operation dax_clear_poison Jane Chu
@ 2021-09-14 23:31 ` Jane Chu
  2021-11-04 17:53   ` Christoph Hellwig
  2021-09-14 23:31 ` [PATCH 2/3] dax: introduce dax clear poison to page aligned " Jane Chu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Jane Chu @ 2021-09-14 23:31 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, ira.weiny, viro,
	willy, jack, nvdimm, linux-kernel, linux-fsdevel

When pwrite(2) encounters poison in a dax range, it fails with EIO.
But if the backend hardware of the dax device is capable of clearing
poison, try that and resume the write.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 fs/dax.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 99b4e78d888f..592a156abbf2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1156,8 +1156,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		if (ret)
 			break;
 
+		/*
+		 * If WRITE operation encounters media error in a page aligned
+		 * range, try to clear the error, then resume, for just once.
+		 */
 		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
 				&kaddr, NULL);
+		if ((map_len == -EIO) && (iov_iter_rw(iter) == WRITE)) {
+			if (dax_clear_poison(dax_dev, pgoff, PHYS_PFN(size)) == 0)
+				map_len = dax_direct_access(dax_dev, pgoff,
+						PHYS_PFN(size), &kaddr, NULL);
+		}
 		if (map_len < 0) {
 			ret = map_len;
 			break;
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/3] dax: introduce dax clear poison to page aligned dax pwrite operation
  2021-09-14 23:31 [PATCH 0/3] dax: clear poison on the fly along pwrite Jane Chu
  2021-09-14 23:31 ` [PATCH 1/3] dax: introduce dax_operation dax_clear_poison Jane Chu
  2021-09-14 23:31 ` [PATCH 2/3] dax: introduce dax_clear_poison to dax pwrite operation Jane Chu
@ 2021-09-14 23:31 ` Jane Chu
  2021-09-14 23:31 ` [PATCH 3/3] libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation Jane Chu
  2021-09-15  4:44 ` [PATCH 0/3] dax: clear poison on the fly along pwrite Dan Williams
  4 siblings, 0 replies; 24+ messages in thread
From: Jane Chu @ 2021-09-14 23:31 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, ira.weiny, viro,
	willy, jack, nvdimm, linux-kernel, linux-fsdevel

Currenty, when pwrite(2) s issued to a dax range that contains poison,
the pwrite(2) fails with EIO. Well, if the hardware backend of the
dax device is capable of clearing poison, try that and resume the write.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 fs/dax.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 99b4e78d888f..592a156abbf2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1156,8 +1156,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		if (ret)
 			break;
 
+		/*
+		 * If WRITE operation encounters media error in a page aligned
+		 * range, try to clear the error, then resume, for just once.
+		 */
 		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
 				&kaddr, NULL);
+		if ((map_len == -EIO) && (iov_iter_rw(iter) == WRITE)) {
+			if (dax_clear_poison(dax_dev, pgoff, PHYS_PFN(size)) == 0)
+				map_len = dax_direct_access(dax_dev, pgoff,
+						PHYS_PFN(size), &kaddr, NULL);
+		}
 		if (map_len < 0) {
 			ret = map_len;
 			break;
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/3] libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation
  2021-09-14 23:31 [PATCH 0/3] dax: clear poison on the fly along pwrite Jane Chu
                   ` (2 preceding siblings ...)
  2021-09-14 23:31 ` [PATCH 2/3] dax: introduce dax clear poison to page aligned " Jane Chu
@ 2021-09-14 23:31 ` Jane Chu
  2021-11-04 17:55   ` Christoph Hellwig
  2021-09-15  4:44 ` [PATCH 0/3] dax: clear poison on the fly along pwrite Dan Williams
  4 siblings, 1 reply; 24+ messages in thread
From: Jane Chu @ 2021-09-14 23:31 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, ira.weiny, viro,
	willy, jack, nvdimm, linux-kernel, linux-fsdevel

Provide pmem_dax_clear_poison() to struct dax_operations.clear_poison.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/nvdimm/pmem.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 1e0615b8565e..307a53aa3432 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -294,6 +294,22 @@ static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 				   PAGE_SIZE));
 }
 
+static int pmem_dax_clear_poison(struct dax_device *dax_dev, pgoff_t pgoff,
+					size_t nr_pages)
+{
+	unsigned int len = PFN_PHYS(nr_pages);
+	sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
+	struct pmem_device *pmem = dax_get_private(dax_dev);
+	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+	blk_status_t ret;
+
+	if (!is_bad_pmem(&pmem->bb, sector, len))
+		return 0;
+
+	ret = pmem_clear_poison(pmem, pmem_off, len);
+	return (ret == BLK_STS_OK) ? 0 : -EIO;
+}
+
 static long pmem_dax_direct_access(struct dax_device *dax_dev,
 		pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
 {
@@ -326,6 +342,7 @@ static const struct dax_operations pmem_dax_ops = {
 	.copy_from_iter = pmem_copy_from_iter,
 	.copy_to_iter = pmem_copy_to_iter,
 	.zero_page_range = pmem_dax_zero_page_range,
+	.clear_poison = pmem_dax_clear_poison,
 };
 
 static const struct attribute_group *pmem_attribute_groups[] = {
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-14 23:31 [PATCH 0/3] dax: clear poison on the fly along pwrite Jane Chu
                   ` (3 preceding siblings ...)
  2021-09-14 23:31 ` [PATCH 3/3] libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation Jane Chu
@ 2021-09-15  4:44 ` Dan Williams
  2021-09-15  7:22   ` Jane Chu
  4 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2021-09-15  4:44 UTC (permalink / raw)
  To: Jane Chu
  Cc: Vishal L Verma, Dave Jiang, Weiny, Ira, Al Viro, Matthew Wilcox,
	Jan Kara, Linux NVDIMM, Linux Kernel Mailing List, linux-fsdevel

On Tue, Sep 14, 2021 at 4:32 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> If pwrite(2) encounters poison in a pmem range, it fails with EIO.
> This is unecessary if hardware is capable of clearing the poison.
>
> Though not all dax backend hardware has the capability of clearing
> poison on the fly, but dax backed by Intel DCPMEM has such capability,
> and it's desirable to, first, speed up repairing by means of it;
> second, maintain backend continuity instead of fragmenting it in
> search for clean blocks.
>
> Jane Chu (3):
>   dax: introduce dax_operation dax_clear_poison

The problem with new dax operations is that they need to be plumbed
not only through fsdax and pmem, but also through device-mapper.

In this case I think we're already covered by dax_zero_page_range().
That will ultimately trigger pmem_clear_poison() and it is routed
through device-mapper properly.

Can you clarify why the existing dax_zero_page_range() is not sufficient?

>   dax: introduce dax_clear_poison to dax pwrite operation
>   libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation
>
>  drivers/dax/super.c   | 13 +++++++++++++
>  drivers/nvdimm/pmem.c | 17 +++++++++++++++++
>  fs/dax.c              |  9 +++++++++
>  include/linux/dax.h   |  6 ++++++
>  4 files changed, 45 insertions(+)
>
> --
> 2.18.4
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-15  4:44 ` [PATCH 0/3] dax: clear poison on the fly along pwrite Dan Williams
@ 2021-09-15  7:22   ` Jane Chu
  2021-09-15 16:15     ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Jane Chu @ 2021-09-15  7:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vishal L Verma, Dave Jiang, Weiny, Ira, Al Viro, Matthew Wilcox,
	Jan Kara, Linux NVDIMM, Linux Kernel Mailing List, linux-fsdevel

Hi, Dan,

On 9/14/2021 9:44 PM, Dan Williams wrote:
> On Tue, Sep 14, 2021 at 4:32 PM Jane Chu <jane.chu@oracle.com> wrote:
>>
>> If pwrite(2) encounters poison in a pmem range, it fails with EIO.
>> This is unecessary if hardware is capable of clearing the poison.
>>
>> Though not all dax backend hardware has the capability of clearing
>> poison on the fly, but dax backed by Intel DCPMEM has such capability,
>> and it's desirable to, first, speed up repairing by means of it;
>> second, maintain backend continuity instead of fragmenting it in
>> search for clean blocks.
>>
>> Jane Chu (3):
>>    dax: introduce dax_operation dax_clear_poison
> 
> The problem with new dax operations is that they need to be plumbed
> not only through fsdax and pmem, but also through device-mapper.
> 
> In this case I think we're already covered by dax_zero_page_range().
> That will ultimately trigger pmem_clear_poison() and it is routed
> through device-mapper properly.
> 
> Can you clarify why the existing dax_zero_page_range() is not sufficient?

fallocate ZERO_RANGE is in itself a functionality that applied to dax
should lead to zero out the media range.  So one may argue it is part
of a block operations, and not something explicitly aimed at clearing
poison. I'm also thinking about the MOVEDIR64B instruction and how it
might be used to clear poison on the fly with a single 'store'.
Of course, that means we need to figure out how to narrow down the
error blast radius first.

With respect to plumbing through device-mapper, I thought about that,
and wasn't sure. I mean the clear-poison work will eventually fall on
the pmem driver, and thru the DM layers, how does that play out thru
DM?  BTW, our customer doesn't care about creating dax volume thru DM, so.

thanks!
-jane


> 
>>    dax: introduce dax_clear_poison to dax pwrite operation
>>    libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation
>>
>>   drivers/dax/super.c   | 13 +++++++++++++
>>   drivers/nvdimm/pmem.c | 17 +++++++++++++++++
>>   fs/dax.c              |  9 +++++++++
>>   include/linux/dax.h   |  6 ++++++
>>   4 files changed, 45 insertions(+)
>>
>> --
>> 2.18.4
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-15  7:22   ` Jane Chu
@ 2021-09-15 16:15     ` Darrick J. Wong
  2021-09-15 20:27       ` Dan Williams
  2021-09-23 20:55       ` Jane Chu
  0 siblings, 2 replies; 24+ messages in thread
From: Darrick J. Wong @ 2021-09-15 16:15 UTC (permalink / raw)
  To: Jane Chu
  Cc: Dan Williams, Vishal L Verma, Dave Jiang, Weiny, Ira, Al Viro,
	Matthew Wilcox, Jan Kara, Linux NVDIMM,
	Linux Kernel Mailing List, linux-fsdevel

On Wed, Sep 15, 2021 at 12:22:05AM -0700, Jane Chu wrote:
> Hi, Dan,
> 
> On 9/14/2021 9:44 PM, Dan Williams wrote:
> > On Tue, Sep 14, 2021 at 4:32 PM Jane Chu <jane.chu@oracle.com> wrote:
> > > 
> > > If pwrite(2) encounters poison in a pmem range, it fails with EIO.
> > > This is unecessary if hardware is capable of clearing the poison.
> > > 
> > > Though not all dax backend hardware has the capability of clearing
> > > poison on the fly, but dax backed by Intel DCPMEM has such capability,
> > > and it's desirable to, first, speed up repairing by means of it;
> > > second, maintain backend continuity instead of fragmenting it in
> > > search for clean blocks.
> > > 
> > > Jane Chu (3):
> > >    dax: introduce dax_operation dax_clear_poison
> > 
> > The problem with new dax operations is that they need to be plumbed
> > not only through fsdax and pmem, but also through device-mapper.
> > 
> > In this case I think we're already covered by dax_zero_page_range().
> > That will ultimately trigger pmem_clear_poison() and it is routed
> > through device-mapper properly.
> > 
> > Can you clarify why the existing dax_zero_page_range() is not sufficient?
> 
> fallocate ZERO_RANGE is in itself a functionality that applied to dax
> should lead to zero out the media range.  So one may argue it is part
> of a block operations, and not something explicitly aimed at clearing
> poison.

Yeah, Christoph suggested that we make the clearing operation explicit
in a related thread a few weeks ago:
https://lore.kernel.org/linux-fsdevel/YRtnlPERHfMZ23Tr@infradead.org/

I like Jane's patchset far better than the one that I sent, because it
doesn't require a block device wrapper for the pmem, and it enables us
to tell application writers that they can handle media errors by
pwrite()ing the bad region, just like they do for nvme and spinners.

> I'm also thinking about the MOVEDIR64B instruction and how it
> might be used to clear poison on the fly with a single 'store'.
> Of course, that means we need to figure out how to narrow down the
> error blast radius first.

That was one of the advantages of Shiyang Ruan's NAKed patchset to
enable byte-granularity media errors to pass upwards through the stack
back to the filesystem, which could then tell applications exactly what
they lost.

I want to get back to that, though if Dan won't withdraw the NAK then I
don't know how to move forward...

> With respect to plumbing through device-mapper, I thought about that,
> and wasn't sure. I mean the clear-poison work will eventually fall on
> the pmem driver, and thru the DM layers, how does that play out thru
> DM?

Each of the dm drivers has to add their own ->clear_poison operation
that remaps the incoming (sector, len) parameters as appropriate for
that device and then calls the lower device's ->clear_poison with the
translated parameters.

This (AFAICT) has already been done for dax_zero_page_range, so I sense
that Dan is trying to save you a bunch of code plumbing work by nudging
you towards doing s/dax_clear_poison/dax_zero_page_range/ to this series
and then you only need patches 2-3.

> BTW, our customer doesn't care about creating dax volume thru DM, so.

They might not care, but anything going upstream should work in the
general case.

--D

> thanks!
> -jane
> 
> 
> > 
> > >    dax: introduce dax_clear_poison to dax pwrite operation
> > >    libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation
> > > 
> > >   drivers/dax/super.c   | 13 +++++++++++++
> > >   drivers/nvdimm/pmem.c | 17 +++++++++++++++++
> > >   fs/dax.c              |  9 +++++++++
> > >   include/linux/dax.h   |  6 ++++++
> > >   4 files changed, 45 insertions(+)
> > > 
> > > --
> > > 2.18.4
> > > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-15 16:15     ` Darrick J. Wong
@ 2021-09-15 20:27       ` Dan Williams
  2021-09-16  0:05         ` Darrick J. Wong
                           ` (2 more replies)
  2021-09-23 20:55       ` Jane Chu
  1 sibling, 3 replies; 24+ messages in thread
From: Dan Williams @ 2021-09-15 20:27 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jane Chu, Vishal L Verma, Dave Jiang, Weiny, Ira, Al Viro,
	Matthew Wilcox, Jan Kara, Linux NVDIMM,
	Linux Kernel Mailing List, linux-fsdevel

On Wed, Sep 15, 2021 at 9:15 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Sep 15, 2021 at 12:22:05AM -0700, Jane Chu wrote:
> > Hi, Dan,
> >
> > On 9/14/2021 9:44 PM, Dan Williams wrote:
> > > On Tue, Sep 14, 2021 at 4:32 PM Jane Chu <jane.chu@oracle.com> wrote:
> > > >
> > > > If pwrite(2) encounters poison in a pmem range, it fails with EIO.
> > > > This is unecessary if hardware is capable of clearing the poison.
> > > >
> > > > Though not all dax backend hardware has the capability of clearing
> > > > poison on the fly, but dax backed by Intel DCPMEM has such capability,
> > > > and it's desirable to, first, speed up repairing by means of it;
> > > > second, maintain backend continuity instead of fragmenting it in
> > > > search for clean blocks.
> > > >
> > > > Jane Chu (3):
> > > >    dax: introduce dax_operation dax_clear_poison
> > >
> > > The problem with new dax operations is that they need to be plumbed
> > > not only through fsdax and pmem, but also through device-mapper.
> > >
> > > In this case I think we're already covered by dax_zero_page_range().
> > > That will ultimately trigger pmem_clear_poison() and it is routed
> > > through device-mapper properly.
> > >
> > > Can you clarify why the existing dax_zero_page_range() is not sufficient?
> >
> > fallocate ZERO_RANGE is in itself a functionality that applied to dax
> > should lead to zero out the media range.  So one may argue it is part
> > of a block operations, and not something explicitly aimed at clearing
> > poison.
>
> Yeah, Christoph suggested that we make the clearing operation explicit
> in a related thread a few weeks ago:
> https://lore.kernel.org/linux-fsdevel/YRtnlPERHfMZ23Tr@infradead.org/

That seemed to be tied to a proposal to plumb it all the way out to an
explicit fallocate() mode, not make it a silent side effect of
pwrite(). That said pwrite() does clear errors in hard drives in
not-DAX mode, but I like the change in direction to make it explicit
going forward.

> I like Jane's patchset far better than the one that I sent, because it
> doesn't require a block device wrapper for the pmem, and it enables us
> to tell application writers that they can handle media errors by
> pwrite()ing the bad region, just like they do for nvme and spinners.

pwrite(), hmm, so you're not onboard with the explicit clearing API
proposal, or...?

> > I'm also thinking about the MOVEDIR64B instruction and how it
> > might be used to clear poison on the fly with a single 'store'.
> > Of course, that means we need to figure out how to narrow down the
> > error blast radius first.

It turns out the MOVDIR64B error clearing idea runs into problem with
the device poison tracking. Without the explicit notification that
software wanted the error cleared the device may ghost report errors
that are not there anymore. I think we should continue explicit error
clearing and notification of the device that the error has been
cleared (by asking the device to clear it).

> That was one of the advantages of Shiyang Ruan's NAKed patchset to
> enable byte-granularity media errors

...the method of triggering reverse mapping had review feedback, I
apologize if that came across of a NAK of the whole proposal. As I
clarified to Eric this morning, I think the solution is iterating
towards upstream inclusion.

> to pass upwards through the stack
> back to the filesystem, which could then tell applications exactly what
> they lost.
>
> I want to get back to that, though if Dan won't withdraw the NAK then I
> don't know how to move forward...

No NAK in place. Let's go!

>
> > With respect to plumbing through device-mapper, I thought about that,
> > and wasn't sure. I mean the clear-poison work will eventually fall on
> > the pmem driver, and thru the DM layers, how does that play out thru
> > DM?
>
> Each of the dm drivers has to add their own ->clear_poison operation
> that remaps the incoming (sector, len) parameters as appropriate for
> that device and then calls the lower device's ->clear_poison with the
> translated parameters.
>
> This (AFAICT) has already been done for dax_zero_page_range, so I sense
> that Dan is trying to save you a bunch of code plumbing work by nudging
> you towards doing s/dax_clear_poison/dax_zero_page_range/ to this series
> and then you only need patches 2-3.

Yes, but it sounds like Christoph was saying don't overload
dax_zero_page_range(). I'd be ok splitting the difference and having a
new fallocate clear poison mode map to dax_zero_page_range()
internally.

>
> > BTW, our customer doesn't care about creating dax volume thru DM, so.
>
> They might not care, but anything going upstream should work in the
> general case.

Agree.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-15 20:27       ` Dan Williams
@ 2021-09-16  0:05         ` Darrick J. Wong
  2021-09-16  7:11         ` Christoph Hellwig
  2021-09-23 20:48         ` Jane Chu
  2 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2021-09-16  0:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jane Chu, Vishal L Verma, Dave Jiang, Weiny, Ira, Al Viro,
	Matthew Wilcox, Jan Kara, Linux NVDIMM,
	Linux Kernel Mailing List, linux-fsdevel

On Wed, Sep 15, 2021 at 01:27:47PM -0700, Dan Williams wrote:
> On Wed, Sep 15, 2021 at 9:15 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Wed, Sep 15, 2021 at 12:22:05AM -0700, Jane Chu wrote:
> > > Hi, Dan,
> > >
> > > On 9/14/2021 9:44 PM, Dan Williams wrote:
> > > > On Tue, Sep 14, 2021 at 4:32 PM Jane Chu <jane.chu@oracle.com> wrote:
> > > > >
> > > > > If pwrite(2) encounters poison in a pmem range, it fails with EIO.
> > > > > This is unecessary if hardware is capable of clearing the poison.
> > > > >
> > > > > Though not all dax backend hardware has the capability of clearing
> > > > > poison on the fly, but dax backed by Intel DCPMEM has such capability,
> > > > > and it's desirable to, first, speed up repairing by means of it;
> > > > > second, maintain backend continuity instead of fragmenting it in
> > > > > search for clean blocks.
> > > > >
> > > > > Jane Chu (3):
> > > > >    dax: introduce dax_operation dax_clear_poison
> > > >
> > > > The problem with new dax operations is that they need to be plumbed
> > > > not only through fsdax and pmem, but also through device-mapper.
> > > >
> > > > In this case I think we're already covered by dax_zero_page_range().
> > > > That will ultimately trigger pmem_clear_poison() and it is routed
> > > > through device-mapper properly.
> > > >
> > > > Can you clarify why the existing dax_zero_page_range() is not sufficient?
> > >
> > > fallocate ZERO_RANGE is in itself a functionality that applied to dax
> > > should lead to zero out the media range.  So one may argue it is part
> > > of a block operations, and not something explicitly aimed at clearing
> > > poison.
> >
> > Yeah, Christoph suggested that we make the clearing operation explicit
> > in a related thread a few weeks ago:
> > https://lore.kernel.org/linux-fsdevel/YRtnlPERHfMZ23Tr@infradead.org/
> 
> That seemed to be tied to a proposal to plumb it all the way out to an
> explicit fallocate() mode, not make it a silent side effect of
> pwrite(). That said pwrite() does clear errors in hard drives in
> not-DAX mode, but I like the change in direction to make it explicit
> going forward.
> 
> > I like Jane's patchset far better than the one that I sent, because it
> > doesn't require a block device wrapper for the pmem, and it enables us
> > to tell application writers that they can handle media errors by
> > pwrite()ing the bad region, just like they do for nvme and spinners.
> 
> pwrite(), hmm, so you're not onboard with the explicit clearing API
> proposal, or...?

I don't really care either way.  I was going to send a reworked version
of that earlier patchset which would add an explicit fallocate mode and
make it work on regular block storage too, but then Jane sent this. :)

Hmm, maybe I should rework my patchset to call dax_zero_page_range
directly...?

> > > I'm also thinking about the MOVEDIR64B instruction and how it
> > > might be used to clear poison on the fly with a single 'store'.
> > > Of course, that means we need to figure out how to narrow down the
> > > error blast radius first.
> 
> It turns out the MOVDIR64B error clearing idea runs into problem with
> the device poison tracking. Without the explicit notification that
> software wanted the error cleared the device may ghost report errors
> that are not there anymore. I think we should continue explicit error
> clearing and notification of the device that the error has been
> cleared (by asking the device to clear it).

If the poison clearing is entirely OOB (i.e. you have to call ACPI
methods) and can't be made part of the memory controller, then I guess
you can't use movdir64b at all, right?

> > That was one of the advantages of Shiyang Ruan's NAKed patchset to
> > enable byte-granularity media errors
> 
> ...the method of triggering reverse mapping had review feedback, I
> apologize if that came across of a NAK of the whole proposal. As I
> clarified to Eric this morning, I think the solution is iterating
> towards upstream inclusion.
> 
> > to pass upwards through the stack
> > back to the filesystem, which could then tell applications exactly what
> > they lost.
> >
> > I want to get back to that, though if Dan won't withdraw the NAK then I
> > don't know how to move forward...
> 
> No NAK in place. Let's go!

Ok, thanks.  I'll start looking through Shiyang's patches tomorrow.

> 
> >
> > > With respect to plumbing through device-mapper, I thought about that,
> > > and wasn't sure. I mean the clear-poison work will eventually fall on
> > > the pmem driver, and thru the DM layers, how does that play out thru
> > > DM?
> >
> > Each of the dm drivers has to add their own ->clear_poison operation
> > that remaps the incoming (sector, len) parameters as appropriate for
> > that device and then calls the lower device's ->clear_poison with the
> > translated parameters.
> >
> > This (AFAICT) has already been done for dax_zero_page_range, so I sense
> > that Dan is trying to save you a bunch of code plumbing work by nudging
> > you towards doing s/dax_clear_poison/dax_zero_page_range/ to this series
> > and then you only need patches 2-3.
> 
> Yes, but it sounds like Christoph was saying don't overload
> dax_zero_page_range(). I'd be ok splitting the difference and having a
> new fallocate clear poison mode map to dax_zero_page_range()
> internally.

Ok.

--D

> >
> > > BTW, our customer doesn't care about creating dax volume thru DM, so.
> >
> > They might not care, but anything going upstream should work in the
> > general case.
> 
> Agree.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-15 20:27       ` Dan Williams
  2021-09-16  0:05         ` Darrick J. Wong
@ 2021-09-16  7:11         ` Christoph Hellwig
  2021-09-16 18:40           ` Dan Williams
  2021-09-23 20:48         ` Jane Chu
  2 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2021-09-16  7:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Darrick J. Wong, Jane Chu, Vishal L Verma, Dave Jiang, Weiny,
	Ira, Al Viro, Matthew Wilcox, Jan Kara, Linux NVDIMM,
	Linux Kernel Mailing List, linux-fsdevel

On Wed, Sep 15, 2021 at 01:27:47PM -0700, Dan Williams wrote:
> > Yeah, Christoph suggested that we make the clearing operation explicit
> > in a related thread a few weeks ago:
> > https://lore.kernel.org/linux-fsdevel/YRtnlPERHfMZ23Tr@infradead.org/
> 
> That seemed to be tied to a proposal to plumb it all the way out to an
> explicit fallocate() mode, not make it a silent side effect of
> pwrite().

Yes.

> >
> > Each of the dm drivers has to add their own ->clear_poison operation
> > that remaps the incoming (sector, len) parameters as appropriate for
> > that device and then calls the lower device's ->clear_poison with the
> > translated parameters.
> >
> > This (AFAICT) has already been done for dax_zero_page_range, so I sense
> > that Dan is trying to save you a bunch of code plumbing work by nudging
> > you towards doing s/dax_clear_poison/dax_zero_page_range/ to this series
> > and then you only need patches 2-3.
> 
> Yes, but it sounds like Christoph was saying don't overload
> dax_zero_page_range(). I'd be ok splitting the difference and having a
> new fallocate clear poison mode map to dax_zero_page_range()
> internally.

That was my gut feeling.  If everyone feels 100% comfortable with
zeroingas the mechanism to clear poisoning I'll cave in.  The most
important bit is that we do that through a dedicated DAX path instead
of abusing the block layer even more.

> 
> >
> > > BTW, our customer doesn't care about creating dax volume thru DM, so.
> >
> > They might not care, but anything going upstream should work in the
> > general case.
> 
> Agree.

I'm really worried about both patartitions on DAX and DM passing through
DAX because they deeply bind DAX to the block layer, which is just a bad
idea.  I think we also need to sort that whole story out before removing
the EXPERIMENTAL tags.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-16  7:11         ` Christoph Hellwig
@ 2021-09-16 18:40           ` Dan Williams
  2021-09-17 12:53             ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2021-09-16 18:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Jane Chu, Vishal L Verma, Dave Jiang, Weiny,
	Ira, Al Viro, Matthew Wilcox, Jan Kara, Linux NVDIMM,
	Linux Kernel Mailing List, linux-fsdevel

On Thu, Sep 16, 2021 at 12:12 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Sep 15, 2021 at 01:27:47PM -0700, Dan Williams wrote:
> > > Yeah, Christoph suggested that we make the clearing operation explicit
> > > in a related thread a few weeks ago:
> > > https://lore.kernel.org/linux-fsdevel/YRtnlPERHfMZ23Tr@infradead.org/
> >
> > That seemed to be tied to a proposal to plumb it all the way out to an
> > explicit fallocate() mode, not make it a silent side effect of
> > pwrite().
>
> Yes.
>
> > >
> > > Each of the dm drivers has to add their own ->clear_poison operation
> > > that remaps the incoming (sector, len) parameters as appropriate for
> > > that device and then calls the lower device's ->clear_poison with the
> > > translated parameters.
> > >
> > > This (AFAICT) has already been done for dax_zero_page_range, so I sense
> > > that Dan is trying to save you a bunch of code plumbing work by nudging
> > > you towards doing s/dax_clear_poison/dax_zero_page_range/ to this series
> > > and then you only need patches 2-3.
> >
> > Yes, but it sounds like Christoph was saying don't overload
> > dax_zero_page_range(). I'd be ok splitting the difference and having a
> > new fallocate clear poison mode map to dax_zero_page_range()
> > internally.
>
> That was my gut feeling.  If everyone feels 100% comfortable with
> zeroingas the mechanism to clear poisoning I'll cave in.  The most
> important bit is that we do that through a dedicated DAX path instead
> of abusing the block layer even more.

...or just rename dax_zero_page_range() to dax_reset_page_range()?
Where reset == "zero + clear-poison"?

> > > > BTW, our customer doesn't care about creating dax volume thru DM, so.
> > >
> > > They might not care, but anything going upstream should work in the
> > > general case.
> >
> > Agree.
>
> I'm really worried about both patartitions on DAX and DM passing through
> DAX because they deeply bind DAX to the block layer, which is just a bad
> idea.  I think we also need to sort that whole story out before removing
> the EXPERIMENTAL tags.

I do think it was a mistake to allow for DAX on partitions of a pmemX
block-device.

DAX-reflink support may be the opportunity to start deprecating that
support. Only enable DAX-reflink for direct mounting on /dev/pmemX
without partitions (later add dax-device direct mounting), change
DAX-experimental warning to a deprecation notification for DAX on
DM/partitions, continue to fail / never fix DAX-reflink for
DM/partitions, direct people to use namespace provisioning for
sub-divisions of PMEM capacity, and finally look into adding
concatenation and additional software striping support to the new CXL
region creation facility.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-16 18:40           ` Dan Williams
@ 2021-09-17 12:53             ` Christoph Hellwig
  2021-09-17 15:27               ` Darrick J. Wong
  2021-09-17 19:37               ` Dan Williams
  0 siblings, 2 replies; 24+ messages in thread
From: Christoph Hellwig @ 2021-09-17 12:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Darrick J. Wong, Jane Chu, Vishal L Verma,
	Dave Jiang, Weiny, Ira, Al Viro, Matthew Wilcox, Jan Kara,
	Linux NVDIMM, Linux Kernel Mailing List, linux-fsdevel

On Thu, Sep 16, 2021 at 11:40:28AM -0700, Dan Williams wrote:
> > That was my gut feeling.  If everyone feels 100% comfortable with
> > zeroingas the mechanism to clear poisoning I'll cave in.  The most
> > important bit is that we do that through a dedicated DAX path instead
> > of abusing the block layer even more.
> 
> ...or just rename dax_zero_page_range() to dax_reset_page_range()?
> Where reset == "zero + clear-poison"?

I'd say that naming is more confusing than overloading zero.

> > I'm really worried about both patartitions on DAX and DM passing through
> > DAX because they deeply bind DAX to the block layer, which is just a bad
> > idea.  I think we also need to sort that whole story out before removing
> > the EXPERIMENTAL tags.
> 
> I do think it was a mistake to allow for DAX on partitions of a pmemX
> block-device.
> 
> DAX-reflink support may be the opportunity to start deprecating that
> support. Only enable DAX-reflink for direct mounting on /dev/pmemX
> without partitions (later add dax-device direct mounting),

I think we need to fully or almost fully sort this out.

Here is my bold suggestions:

 1) drop no drop the EXPERMINTAL on the current block layer overload
    at all
 2) add direct mounting of the nvdimm namespaces ASAP.  Because all
    the filesystem currently also need the /dev/pmem0 device add a way
    to open the block device by the dax_device instead of our current
    way of doing the reverse
 3) deprecate DAX support through block layer mounts with a say 2 year
    deprecation period
 4) add DAX remapping devices as needed

I'll volunteer to write the initial code for 2).  And I think we should
not allow DAX+reflink on the block device shim at all.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-17 12:53             ` Christoph Hellwig
@ 2021-09-17 15:27               ` Darrick J. Wong
  2021-09-17 20:21                 ` Dan Williams
  2021-09-17 19:37               ` Dan Williams
  1 sibling, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2021-09-17 15:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jane Chu, Vishal L Verma, Dave Jiang, Weiny, Ira,
	Al Viro, Matthew Wilcox, Jan Kara, Linux NVDIMM,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 17, 2021 at 01:53:33PM +0100, Christoph Hellwig wrote:
> On Thu, Sep 16, 2021 at 11:40:28AM -0700, Dan Williams wrote:
> > > That was my gut feeling.  If everyone feels 100% comfortable with
> > > zeroingas the mechanism to clear poisoning I'll cave in.  The most
> > > important bit is that we do that through a dedicated DAX path instead
> > > of abusing the block layer even more.
> > 
> > ...or just rename dax_zero_page_range() to dax_reset_page_range()?
> > Where reset == "zero + clear-poison"?
> 
> I'd say that naming is more confusing than overloading zero.

How about dax_zeroinit_range() ?

To go with its fallocate flag (yeah I've been too busy sorting out -rc1
regressions to repost this) FALLOC_FL_ZEROINIT_RANGE that will reset the
hardware (whatever that means) and set the contents to the known value
zero.

Userspace usage model:

void handle_media_error(int fd, loff_t pos, size_t len)
{
	/* yell about this for posterior's sake */

	ret = fallocate(fd, FALLOC_FL_ZEROINIT_RANGE, pos, len);

	/* yay our disk drive / pmem / stone table engraver is online */
}

> > > I'm really worried about both patartitions on DAX and DM passing through
> > > DAX because they deeply bind DAX to the block layer, which is just a bad
> > > idea.  I think we also need to sort that whole story out before removing
> > > the EXPERIMENTAL tags.
> > 
> > I do think it was a mistake to allow for DAX on partitions of a pmemX
> > block-device.
> > 
> > DAX-reflink support may be the opportunity to start deprecating that
> > support. Only enable DAX-reflink for direct mounting on /dev/pmemX
> > without partitions (later add dax-device direct mounting),
> 
> I think we need to fully or almost fully sort this out.
> 
> Here is my bold suggestions:
> 
>  1) drop no drop the EXPERMINTAL on the current block layer overload
>     at all

I don't understand this.

>  2) add direct mounting of the nvdimm namespaces ASAP.  Because all
>     the filesystem currently also need the /dev/pmem0 device add a way
>     to open the block device by the dax_device instead of our current
>     way of doing the reverse
>  3) deprecate DAX support through block layer mounts with a say 2 year
>     deprecation period
>  4) add DAX remapping devices as needed

What devices are needed?  linear for lvm, and maybe error so we can
actually test all this stuff?

> I'll volunteer to write the initial code for 2).  And I think we should
> not allow DAX+reflink on the block device shim at all.

/me has other questions about daxreflink, but I'll ask them on shiyang's
thread.

--D

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-17 12:53             ` Christoph Hellwig
  2021-09-17 15:27               ` Darrick J. Wong
@ 2021-09-17 19:37               ` Dan Williams
  1 sibling, 0 replies; 24+ messages in thread
From: Dan Williams @ 2021-09-17 19:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Jane Chu, Vishal L Verma, Dave Jiang, Weiny,
	Ira, Al Viro, Matthew Wilcox, Jan Kara, Linux NVDIMM,
	Linux Kernel Mailing List, linux-fsdevel, Shiyang Ruan

On Fri, Sep 17, 2021 at 5:57 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Sep 16, 2021 at 11:40:28AM -0700, Dan Williams wrote:
> > > That was my gut feeling.  If everyone feels 100% comfortable with
> > > zeroingas the mechanism to clear poisoning I'll cave in.  The most
> > > important bit is that we do that through a dedicated DAX path instead
> > > of abusing the block layer even more.
> >
> > ...or just rename dax_zero_page_range() to dax_reset_page_range()?
> > Where reset == "zero + clear-poison"?
>
> I'd say that naming is more confusing than overloading zero.

Ok, I see Darrick has a better suggestion for the shed color.

>
> > > I'm really worried about both patartitions on DAX and DM passing through
> > > DAX because they deeply bind DAX to the block layer, which is just a bad
> > > idea.  I think we also need to sort that whole story out before removing
> > > the EXPERIMENTAL tags.
> >
> > I do think it was a mistake to allow for DAX on partitions of a pmemX
> > block-device.
> >
> > DAX-reflink support may be the opportunity to start deprecating that
> > support. Only enable DAX-reflink for direct mounting on /dev/pmemX
> > without partitions (later add dax-device direct mounting),
>
> I think we need to fully or almost fully sort this out.
>
> Here is my bold suggestions:
>
>  1) drop no drop the EXPERMINTAL on the current block layer overload
>     at all

s/drop no drop/do not drop/?

>  2) add direct mounting of the nvdimm namespaces ASAP.  Because all
>     the filesystem currently also need the /dev/pmem0 device add a way
>     to open the block device by the dax_device instead of our current
>     way of doing the reverse

Oh, interesting. I can get on board with that. There's currently no
/dev entry for namespaces. It's either /dev/pmemX, or /dev/daxX.Y as a
child of /sys/bus/nd/devices/namespaceX.Y. However, I see nothing
glaringly wrong with having /dev/daxX.Y always published regardless of
whether /dev/pmemX is also present.

>  3) deprecate DAX support through block layer mounts with a say 2 year
>     deprecation period
>  4) add DAX remapping devices as needed
>
> I'll volunteer to write the initial code for 2).  And I think we should
> not allow DAX+reflink on the block device shim at all.

Yeah, I think this can fly.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-17 15:27               ` Darrick J. Wong
@ 2021-09-17 20:21                 ` Dan Williams
  2021-09-18  0:07                   ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2021-09-17 20:21 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Jane Chu, Vishal L Verma, Dave Jiang, Weiny,
	Ira, Al Viro, Matthew Wilcox, Jan Kara, Linux NVDIMM,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 17, 2021 at 8:27 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Sep 17, 2021 at 01:53:33PM +0100, Christoph Hellwig wrote:
> > On Thu, Sep 16, 2021 at 11:40:28AM -0700, Dan Williams wrote:
> > > > That was my gut feeling.  If everyone feels 100% comfortable with
> > > > zeroingas the mechanism to clear poisoning I'll cave in.  The most
> > > > important bit is that we do that through a dedicated DAX path instead
> > > > of abusing the block layer even more.
> > >
> > > ...or just rename dax_zero_page_range() to dax_reset_page_range()?
> > > Where reset == "zero + clear-poison"?
> >
> > I'd say that naming is more confusing than overloading zero.
>
> How about dax_zeroinit_range() ?

Works for me.

>
> To go with its fallocate flag (yeah I've been too busy sorting out -rc1
> regressions to repost this) FALLOC_FL_ZEROINIT_RANGE that will reset the
> hardware (whatever that means) and set the contents to the known value
> zero.
>
> Userspace usage model:
>
> void handle_media_error(int fd, loff_t pos, size_t len)
> {
>         /* yell about this for posterior's sake */
>
>         ret = fallocate(fd, FALLOC_FL_ZEROINIT_RANGE, pos, len);
>
>         /* yay our disk drive / pmem / stone table engraver is online */

The fallocate mode can still be error-aware though, right? When the FS
has knowledge of the error locations the fallocate mode could be
fallocate(fd, FALLOC_FL_OVERWRITE_ERRORS, pos, len) with the semantics
of attempting to zero out any known poison extents in the given file
range? At the risk of going overboard on new fallocate modes there
could also (or instead of) be FALLOC_FL_PUNCH_ERRORS to skip trying to
clear them and just ask the FS to throw error extents away.

> }
>
> > > > I'm really worried about both patartitions on DAX and DM passing through
> > > > DAX because they deeply bind DAX to the block layer, which is just a bad
> > > > idea.  I think we also need to sort that whole story out before removing
> > > > the EXPERIMENTAL tags.
> > >
> > > I do think it was a mistake to allow for DAX on partitions of a pmemX
> > > block-device.
> > >
> > > DAX-reflink support may be the opportunity to start deprecating that
> > > support. Only enable DAX-reflink for direct mounting on /dev/pmemX
> > > without partitions (later add dax-device direct mounting),
> >
> > I think we need to fully or almost fully sort this out.
> >
> > Here is my bold suggestions:
> >
> >  1) drop no drop the EXPERMINTAL on the current block layer overload
> >     at all
>
> I don't understand this.
>
> >  2) add direct mounting of the nvdimm namespaces ASAP.  Because all
> >     the filesystem currently also need the /dev/pmem0 device add a way
> >     to open the block device by the dax_device instead of our current
> >     way of doing the reverse
> >  3) deprecate DAX support through block layer mounts with a say 2 year
> >     deprecation period
> >  4) add DAX remapping devices as needed
>
> What devices are needed?  linear for lvm, and maybe error so we can
> actually test all this stuff?

The proposal would be zero lvm support. The nvdimm namespace
definition would need to grow support for concatenation + striping.
Soft error injection could be achieved by writing to the badblocks
interface.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-17 20:21                 ` Dan Williams
@ 2021-09-18  0:07                   ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2021-09-18  0:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jane Chu, Vishal L Verma, Dave Jiang, Weiny,
	Ira, Al Viro, Matthew Wilcox, Jan Kara, Linux NVDIMM,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 17, 2021 at 01:21:25PM -0700, Dan Williams wrote:
> On Fri, Sep 17, 2021 at 8:27 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Fri, Sep 17, 2021 at 01:53:33PM +0100, Christoph Hellwig wrote:
> > > On Thu, Sep 16, 2021 at 11:40:28AM -0700, Dan Williams wrote:
> > > > > That was my gut feeling.  If everyone feels 100% comfortable with
> > > > > zeroingas the mechanism to clear poisoning I'll cave in.  The most
> > > > > important bit is that we do that through a dedicated DAX path instead
> > > > > of abusing the block layer even more.
> > > >
> > > > ...or just rename dax_zero_page_range() to dax_reset_page_range()?
> > > > Where reset == "zero + clear-poison"?
> > >
> > > I'd say that naming is more confusing than overloading zero.
> >
> > How about dax_zeroinit_range() ?
> 
> Works for me.
> 
> >
> > To go with its fallocate flag (yeah I've been too busy sorting out -rc1
> > regressions to repost this) FALLOC_FL_ZEROINIT_RANGE that will reset the
> > hardware (whatever that means) and set the contents to the known value
> > zero.
> >
> > Userspace usage model:
> >
> > void handle_media_error(int fd, loff_t pos, size_t len)
> > {
> >         /* yell about this for posterior's sake */
> >
> >         ret = fallocate(fd, FALLOC_FL_ZEROINIT_RANGE, pos, len);
> >
> >         /* yay our disk drive / pmem / stone table engraver is online */
> 
> The fallocate mode can still be error-aware though, right? When the FS
> has knowledge of the error locations the fallocate mode could be
> fallocate(fd, FALLOC_FL_OVERWRITE_ERRORS, pos, len) with the semantics
> of attempting to zero out any known poison extents in the given file
> range? At the risk of going overboard on new fallocate modes there
> could also (or instead of) be FALLOC_FL_PUNCH_ERRORS to skip trying to
> clear them and just ask the FS to throw error extents away.

It /could/ be, but for now I've stuck to what you see is what you get --
if you tell it to 'zero initialize' 1MB of pmem, it'll write zeroes and
clear the poison on all 1MB, regardless of the old contents.

IOWs, you can use it from a poison handler on just the range that it
told you about, or you could use it to bulk-clear a lot of space all at
once.

A dorky thing here is that the dax_zero_page_range function returns EIO
if you tell it to do more than one page...


> 
> > }
> >
> > > > > I'm really worried about both patartitions on DAX and DM passing through
> > > > > DAX because they deeply bind DAX to the block layer, which is just a bad
> > > > > idea.  I think we also need to sort that whole story out before removing
> > > > > the EXPERIMENTAL tags.
> > > >
> > > > I do think it was a mistake to allow for DAX on partitions of a pmemX
> > > > block-device.
> > > >
> > > > DAX-reflink support may be the opportunity to start deprecating that
> > > > support. Only enable DAX-reflink for direct mounting on /dev/pmemX
> > > > without partitions (later add dax-device direct mounting),
> > >
> > > I think we need to fully or almost fully sort this out.
> > >
> > > Here is my bold suggestions:
> > >
> > >  1) drop no drop the EXPERMINTAL on the current block layer overload
> > >     at all
> >
> > I don't understand this.
> >
> > >  2) add direct mounting of the nvdimm namespaces ASAP.  Because all
> > >     the filesystem currently also need the /dev/pmem0 device add a way
> > >     to open the block device by the dax_device instead of our current
> > >     way of doing the reverse
> > >  3) deprecate DAX support through block layer mounts with a say 2 year
> > >     deprecation period
> > >  4) add DAX remapping devices as needed
> >
> > What devices are needed?  linear for lvm, and maybe error so we can
> > actually test all this stuff?
> 
> The proposal would be zero lvm support. The nvdimm namespace
> definition would need to grow support for concatenation + striping.

Ah, ok.

> Soft error injection could be achieved by writing to the badblocks
> interface.

<nod>

I'll send out an RFC of what I have currently.

--D

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-15 20:27       ` Dan Williams
  2021-09-16  0:05         ` Darrick J. Wong
  2021-09-16  7:11         ` Christoph Hellwig
@ 2021-09-23 20:48         ` Jane Chu
  2 siblings, 0 replies; 24+ messages in thread
From: Jane Chu @ 2021-09-23 20:48 UTC (permalink / raw)
  To: Dan Williams, Darrick J. Wong
  Cc: Vishal L Verma, Dave Jiang, Weiny, Ira, Al Viro, Matthew Wilcox,
	Jan Kara, Linux NVDIMM, Linux Kernel Mailing List, linux-fsdevel

On 9/15/2021 1:27 PM, Dan Williams wrote:
>>> I'm also thinking about the MOVEDIR64B instruction and how it
>>> might be used to clear poison on the fly with a single 'store'.
>>> Of course, that means we need to figure out how to narrow down the
>>> error blast radius first.
> It turns out the MOVDIR64B error clearing idea runs into problem with
> the device poison tracking. Without the explicit notification that
> software wanted the error cleared the device may ghost report errors
> that are not there anymore. I think we should continue explicit error
> clearing and notification of the device that the error has been
> cleared (by asking the device to clear it).
> 

Sorry for the late response, I was out for several days.

Your concern is understood.  I wasn't thinking of an out-of-band
MOVDIR64B to clear poison, I was thinking about adding a case to
pmem_clear_poison(), such that if CPUID feature shows that
MOVDIR64B is supported, instead of calling the BIOS interface
to clear poison, MOVDIR64B could be called. The advantage is
a. a lot faster; b. smaller radius.  And the driver has a chance
to update its ->bb record.

thanks,
-jane


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-15 16:15     ` Darrick J. Wong
  2021-09-15 20:27       ` Dan Williams
@ 2021-09-23 20:55       ` Jane Chu
  2021-09-23 21:42         ` Dan Williams
  1 sibling, 1 reply; 24+ messages in thread
From: Jane Chu @ 2021-09-23 20:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, Vishal L Verma, Dave Jiang, Weiny, Ira, Al Viro,
	Matthew Wilcox, Jan Kara, Linux NVDIMM,
	Linux Kernel Mailing List, linux-fsdevel

On 9/15/2021 9:15 AM, Darrick J. Wong wrote:
> On Wed, Sep 15, 2021 at 12:22:05AM -0700, Jane Chu wrote:
>> Hi, Dan,
>>
>> On 9/14/2021 9:44 PM, Dan Williams wrote:
>>> On Tue, Sep 14, 2021 at 4:32 PM Jane Chu <jane.chu@oracle.com> wrote:
>>>>
>>>> If pwrite(2) encounters poison in a pmem range, it fails with EIO.
>>>> This is unecessary if hardware is capable of clearing the poison.
>>>>
>>>> Though not all dax backend hardware has the capability of clearing
>>>> poison on the fly, but dax backed by Intel DCPMEM has such capability,
>>>> and it's desirable to, first, speed up repairing by means of it;
>>>> second, maintain backend continuity instead of fragmenting it in
>>>> search for clean blocks.
>>>>
>>>> Jane Chu (3):
>>>>     dax: introduce dax_operation dax_clear_poison
>>>
>>> The problem with new dax operations is that they need to be plumbed
>>> not only through fsdax and pmem, but also through device-mapper.
>>>
>>> In this case I think we're already covered by dax_zero_page_range().
>>> That will ultimately trigger pmem_clear_poison() and it is routed
>>> through device-mapper properly.
>>>
>>> Can you clarify why the existing dax_zero_page_range() is not sufficient?
>>
>> fallocate ZERO_RANGE is in itself a functionality that applied to dax
>> should lead to zero out the media range.  So one may argue it is part
>> of a block operations, and not something explicitly aimed at clearing
>> poison.
> 
> Yeah, Christoph suggested that we make the clearing operation explicit
> in a related thread a few weeks ago:
> https://lore.kernel.org/linux-fsdevel/YRtnlPERHfMZ23Tr@infradead.org/
> 
> I like Jane's patchset far better than the one that I sent, because it
> doesn't require a block device wrapper for the pmem, and it enables us
> to tell application writers that they can handle media errors by
> pwrite()ing the bad region, just like they do for nvme and spinners.
> 
>> I'm also thinking about the MOVEDIR64B instruction and how it
>> might be used to clear poison on the fly with a single 'store'.
>> Of course, that means we need to figure out how to narrow down the
>> error blast radius first.
> 
> That was one of the advantages of Shiyang Ruan's NAKed patchset to
> enable byte-granularity media errors to pass upwards through the stack
> back to the filesystem, which could then tell applications exactly what
> they lost.
> 
> I want to get back to that, though if Dan won't withdraw the NAK then I
> don't know how to move forward...
> 
>> With respect to plumbing through device-mapper, I thought about that,
>> and wasn't sure. I mean the clear-poison work will eventually fall on
>> the pmem driver, and thru the DM layers, how does that play out thru
>> DM?
> 
> Each of the dm drivers has to add their own ->clear_poison operation
> that remaps the incoming (sector, len) parameters as appropriate for
> that device and then calls the lower device's ->clear_poison with the
> translated parameters.
> 
> This (AFAICT) has already been done for dax_zero_page_range, so I sense
> that Dan is trying to save you a bunch of code plumbing work by nudging
> you towards doing s/dax_clear_poison/dax_zero_page_range/ to this series
> and then you only need patches 2-3.

Thanks Darrick for the explanation!
I don't mind to add DM layer support, it sounds straight forward.
I also like your latest patch and am wondering if the clear_poison API
is still of value.

thanks,
-jane

> 
>> BTW, our customer doesn't care about creating dax volume thru DM, so.
> 
> They might not care, but anything going upstream should work in the
> general case.
> 
> --D
> 
>> thanks!
>> -jane
>>
>>
>>>
>>>>     dax: introduce dax_clear_poison to dax pwrite operation
>>>>     libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation
>>>>
>>>>    drivers/dax/super.c   | 13 +++++++++++++
>>>>    drivers/nvdimm/pmem.c | 17 +++++++++++++++++
>>>>    fs/dax.c              |  9 +++++++++
>>>>    include/linux/dax.h   |  6 ++++++
>>>>    4 files changed, 45 insertions(+)
>>>>
>>>> --
>>>> 2.18.4
>>>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
  2021-09-23 20:55       ` Jane Chu
@ 2021-09-23 21:42         ` Dan Williams
  0 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2021-09-23 21:42 UTC (permalink / raw)
  To: Jane Chu
  Cc: Darrick J. Wong, Vishal L Verma, Dave Jiang, Weiny, Ira, Al Viro,
	Matthew Wilcox, Jan Kara, Linux NVDIMM,
	Linux Kernel Mailing List, linux-fsdevel

On Thu, Sep 23, 2021 at 1:56 PM Jane Chu <jane.chu@oracle.com> wrote:
[..]
> > This (AFAICT) has already been done for dax_zero_page_range, so I sense
> > that Dan is trying to save you a bunch of code plumbing work by nudging
> > you towards doing s/dax_clear_poison/dax_zero_page_range/ to this series
> > and then you only need patches 2-3.
>
> Thanks Darrick for the explanation!
> I don't mind to add DM layer support, it sounds straight forward.
> I also like your latest patch and am wondering if the clear_poison API
> is still of value.

No, the discussion about fallocate(...ZEROINIT...) has lead to a
better solution. Instead of making error clearing a silent /
opportunistic side-effect of writes, or trying to define new fallocate
mode, just add a new RWF_CLEAR_HWERROR flag to pwritev2(). This allows
for dax_direct_access() to map the page regardless of poison and
trigger pmem_copy_from_iter() to precisely handle sub-page poison.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] dax: introduce dax_operation dax_clear_poison
  2021-09-14 23:31 ` [PATCH 1/3] dax: introduce dax_operation dax_clear_poison Jane Chu
@ 2021-11-04 17:53   ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2021-11-04 17:53 UTC (permalink / raw)
  To: Jane Chu
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, ira.weiny, viro,
	willy, jack, nvdimm, linux-kernel, linux-fsdevel

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] dax: introduce dax_clear_poison to dax pwrite operation
  2021-09-14 23:31 ` [PATCH 2/3] dax: introduce dax_clear_poison to dax pwrite operation Jane Chu
@ 2021-11-04 17:53   ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2021-11-04 17:53 UTC (permalink / raw)
  To: Jane Chu
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, ira.weiny, viro,
	willy, jack, nvdimm, linux-kernel, linux-fsdevel

On Tue, Sep 14, 2021 at 05:31:30PM -0600, Jane Chu wrote:
> +		if ((map_len == -EIO) && (iov_iter_rw(iter) == WRITE)) {

No need for the inner braces.

> +			if (dax_clear_poison(dax_dev, pgoff, PHYS_PFN(size)) == 0)

Overly long line.

Otherwise looks good, but it might need a rebase to the iomap_iter
changes.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation
  2021-09-14 23:31 ` [PATCH 3/3] libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation Jane Chu
@ 2021-11-04 17:55   ` Christoph Hellwig
  2021-11-04 20:27     ` Jane Chu
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2021-11-04 17:55 UTC (permalink / raw)
  To: Jane Chu
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, ira.weiny, viro,
	willy, jack, nvdimm, linux-kernel, linux-fsdevel

On Tue, Sep 14, 2021 at 05:31:32PM -0600, Jane Chu wrote:
> +static int pmem_dax_clear_poison(struct dax_device *dax_dev, pgoff_t pgoff,
> +					size_t nr_pages)
> +{
> +	unsigned int len = PFN_PHYS(nr_pages);
> +	sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
> +	struct pmem_device *pmem = dax_get_private(dax_dev);
> +	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
> +	blk_status_t ret;
> +
> +	if (!is_bad_pmem(&pmem->bb, sector, len))
> +		return 0;
> +
> +	ret = pmem_clear_poison(pmem, pmem_off, len);
> +	return (ret == BLK_STS_OK) ? 0 : -EIO;

No need for the braces here (and I'd prefer a good old if anyway).

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation
  2021-11-04 17:55   ` Christoph Hellwig
@ 2021-11-04 20:27     ` Jane Chu
  0 siblings, 0 replies; 24+ messages in thread
From: Jane Chu @ 2021-11-04 20:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, ira.weiny, viro,
	willy, jack, nvdimm, linux-kernel, linux-fsdevel

On 11/4/2021 10:55 AM, Christoph Hellwig wrote:
> On Tue, Sep 14, 2021 at 05:31:32PM -0600, Jane Chu wrote:
>> +static int pmem_dax_clear_poison(struct dax_device *dax_dev, pgoff_t pgoff,
>> +					size_t nr_pages)
>> +{
>> +	unsigned int len = PFN_PHYS(nr_pages);
>> +	sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
>> +	struct pmem_device *pmem = dax_get_private(dax_dev);
>> +	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
>> +	blk_status_t ret;
>> +
>> +	if (!is_bad_pmem(&pmem->bb, sector, len))
>> +		return 0;
>> +
>> +	ret = pmem_clear_poison(pmem, pmem_off, len);
>> +	return (ret == BLK_STS_OK) ? 0 : -EIO;
> 
> No need for the braces here (and I'd prefer a good old if anyway).
> 
> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 

Thanks a lot!  I'll keep in mind your comments.

-jane

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-11-04 20:28 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-14 23:31 [PATCH 0/3] dax: clear poison on the fly along pwrite Jane Chu
2021-09-14 23:31 ` [PATCH 1/3] dax: introduce dax_operation dax_clear_poison Jane Chu
2021-11-04 17:53   ` Christoph Hellwig
2021-09-14 23:31 ` [PATCH 2/3] dax: introduce dax_clear_poison to dax pwrite operation Jane Chu
2021-11-04 17:53   ` Christoph Hellwig
2021-09-14 23:31 ` [PATCH 2/3] dax: introduce dax clear poison to page aligned " Jane Chu
2021-09-14 23:31 ` [PATCH 3/3] libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation Jane Chu
2021-11-04 17:55   ` Christoph Hellwig
2021-11-04 20:27     ` Jane Chu
2021-09-15  4:44 ` [PATCH 0/3] dax: clear poison on the fly along pwrite Dan Williams
2021-09-15  7:22   ` Jane Chu
2021-09-15 16:15     ` Darrick J. Wong
2021-09-15 20:27       ` Dan Williams
2021-09-16  0:05         ` Darrick J. Wong
2021-09-16  7:11         ` Christoph Hellwig
2021-09-16 18:40           ` Dan Williams
2021-09-17 12:53             ` Christoph Hellwig
2021-09-17 15:27               ` Darrick J. Wong
2021-09-17 20:21                 ` Dan Williams
2021-09-18  0:07                   ` Darrick J. Wong
2021-09-17 19:37               ` Dan Williams
2021-09-23 20:48         ` Jane Chu
2021-09-23 20:55       ` Jane Chu
2021-09-23 21:42         ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).