linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
@ 2015-10-28 22:09 Ross Zwisler
  2015-10-28 22:09 ` [PATCH 1/2] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-10-28 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, Dan Williams, Ingo Molnar,
	Thomas Gleixner, linux-nvdimm, x86, Dave Chinner, Jan Kara

This series implements the very slow but correct handling for
blkdev_issue_flush() with DAX mappings, as discussed here:

https://lkml.org/lkml/2015/10/26/116

I don't think that we can actually do the

    on_each_cpu(sync_cache, ...);

...where sync_cache is something like:

    cache_disable();
    wbinvd();
    pcommit();
    cache_enable();

solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
your writes actually make it durably onto the DIMMs.  I believe you really do
need to loop through the cache lines, flush them with CLWB, then fence and
PCOMMIT.

I do worry that the cost of blindly flushing the entire PMEM namespace on each
fsync or msync will be prohibitively expensive, and that we'll by very
incentivized to move to the radix tree based dirty page tracking as soon as
possible. :)

Ross Zwisler (2):
  pmem: add wb_cache_pmem() to the PMEM API
  pmem: Add simple and slow fsync/msync support

 arch/x86/include/asm/pmem.h | 11 ++++++-----
 drivers/nvdimm/pmem.c       | 10 +++++++++-
 include/linux/pmem.h        | 22 +++++++++++++++++++++-
 3 files changed, 36 insertions(+), 7 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2] pmem: add wb_cache_pmem() to the PMEM API
  2015-10-28 22:09 [PATCH 0/2] "big hammer" for DAX msync/fsync correctness Ross Zwisler
@ 2015-10-28 22:09 ` Ross Zwisler
  2015-10-28 22:09 ` [PATCH 2/2] pmem: Add simple and slow fsync/msync support Ross Zwisler
  2015-10-28 22:24 ` [PATCH 0/2] "big hammer" for DAX msync/fsync correctness Jeff Moyer
  2 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2015-10-28 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, Dan Williams, Ingo Molnar,
	Thomas Gleixner, linux-nvdimm, x86, Dave Chinner, Jan Kara

The function __arch_wb_cache_pmem() was already an internal implementation
detail of the x86 PMEM API, but this functionality needs to be exported as
part of the general PMEM API to handle the fsync/msync case for DAX mmaps.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pmem.h | 11 ++++++-----
 include/linux/pmem.h        | 22 +++++++++++++++++++++-
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec..6c7ade0 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
 }
 
 /**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
+ * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr:	virtual start address
  * @size:	number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
  * instruction.  This function requires explicit ordering with an
- * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
  */
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
 {
 	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
 	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vaddr = (void __force *)addr;
 	void *vend = vaddr + size;
 	void *p;
 
@@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
 	len = copy_from_iter_nocache(vaddr, bytes, i);
 
 	if (__iter_needs_pmem_wb(i))
-		__arch_wb_cache_pmem(vaddr, bytes);
+		arch_wb_cache_pmem(addr, bytes);
 
 	return len;
 }
@@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 	else
 		memset(vaddr, 0, size);
 
-	__arch_wb_cache_pmem(vaddr, size);
+	arch_wb_cache_pmem(addr, size);
 }
 
 static inline bool __arch_has_wmb_pmem(void)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 85f810b3..2cd5003 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
  */
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
 {
@@ -202,4 +208,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size)
 	else
 		default_clear_pmem(addr, size);
 }
+
+/**
+ * wb_cache_pmem - write back processor cache for PMEM memory range
+ * @addr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back the processor cache range starting at 'addr' for 'size' bytes.
+ * This function requires explicit ordering with a wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_wb_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] pmem: Add simple and slow fsync/msync support
  2015-10-28 22:09 [PATCH 0/2] "big hammer" for DAX msync/fsync correctness Ross Zwisler
  2015-10-28 22:09 ` [PATCH 1/2] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
@ 2015-10-28 22:09 ` Ross Zwisler
  2015-10-28 23:02   ` Dan Williams
  2015-10-28 22:24 ` [PATCH 0/2] "big hammer" for DAX msync/fsync correctness Jeff Moyer
  2 siblings, 1 reply; 19+ messages in thread
From: Ross Zwisler @ 2015-10-28 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, Dan Williams, Ingo Molnar,
	Thomas Gleixner, linux-nvdimm, x86, Dave Chinner, Jan Kara

Make blkdev_issue_flush() behave correctly according to its required
semantics - all volatile cached data is flushed to stable storage.

Eventually this needs to be replaced with something much more precise by
tracking dirty DAX entries via the radix tree in struct address_space, but
for now this gives us correctness even if the performance is quite bad.

Userspace applications looking to avoid the fsync/msync penalty should
consider more fine-grained flushing via the NVML library:

https://github.com/pmem/nvml

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/nvdimm/pmem.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0ba6a97..eea7997 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -80,7 +80,14 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	if (do_acct)
 		nd_iostat_end(bio, start);
 
-	if (bio_data_dir(bio))
+	if (bio->bi_rw & REQ_FLUSH) {
+		void __pmem *addr = pmem->virt_addr + pmem->data_offset;
+		size_t size = pmem->size - pmem->data_offset;
+
+		wb_cache_pmem(addr, size);
+	}
+
+	if (bio_data_dir(bio) || (bio->bi_rw & REQ_FLUSH))
 		wmb_pmem();
 
 	bio_endio(bio);
@@ -189,6 +196,7 @@ static int pmem_attach_disk(struct device *dev,
 	blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
 	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+	blk_queue_flush(pmem->pmem_queue, REQ_FLUSH);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);
 
 	disk = alloc_disk(0);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-10-28 22:09 [PATCH 0/2] "big hammer" for DAX msync/fsync correctness Ross Zwisler
  2015-10-28 22:09 ` [PATCH 1/2] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
  2015-10-28 22:09 ` [PATCH 2/2] pmem: Add simple and slow fsync/msync support Ross Zwisler
@ 2015-10-28 22:24 ` Jeff Moyer
  2015-10-28 22:49   ` Dan Williams
  2015-10-28 22:51   ` Ross Zwisler
  2 siblings, 2 replies; 19+ messages in thread
From: Jeff Moyer @ 2015-10-28 22:24 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, linux-nvdimm, Dave Chinner, x86, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Jan Kara

Ross Zwisler <ross.zwisler@linux.intel.com> writes:

> This series implements the very slow but correct handling for
> blkdev_issue_flush() with DAX mappings, as discussed here:
>
> https://lkml.org/lkml/2015/10/26/116
>
> I don't think that we can actually do the
>
>     on_each_cpu(sync_cache, ...);
>
> ...where sync_cache is something like:
>
>     cache_disable();
>     wbinvd();
>     pcommit();
>     cache_enable();
>
> solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
> your writes actually make it durably onto the DIMMs.  I believe you really do
> need to loop through the cache lines, flush them with CLWB, then fence and
> PCOMMIT.

*blink*
*blink*

So much for not violating the principal of least surprise.  I suppose
you've asked the hardware folks, and they've sent you down this path?

> I do worry that the cost of blindly flushing the entire PMEM namespace on each
> fsync or msync will be prohibitively expensive, and that we'll by very
> incentivized to move to the radix tree based dirty page tracking as soon as
> possible. :)

Sure, but wbinvd would be quite costly as well.  Either way I think a
better solution will be required in the near term.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-10-28 22:24 ` [PATCH 0/2] "big hammer" for DAX msync/fsync correctness Jeff Moyer
@ 2015-10-28 22:49   ` Dan Williams
  2015-10-28 22:51   ` Ross Zwisler
  1 sibling, 0 replies; 19+ messages in thread
From: Dan Williams @ 2015-10-28 22:49 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Ross Zwisler, linux-nvdimm, X86 ML, Dave Chinner, linux-kernel,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Jan Kara

On Thu, Oct 29, 2015 at 7:24 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Ross Zwisler <ross.zwisler@linux.intel.com> writes:
>
>> This series implements the very slow but correct handling for
>> blkdev_issue_flush() with DAX mappings, as discussed here:
>>
>> https://lkml.org/lkml/2015/10/26/116
>>
>> I don't think that we can actually do the
>>
>>     on_each_cpu(sync_cache, ...);
>>
>> ...where sync_cache is something like:
>>
>>     cache_disable();
>>     wbinvd();
>>     pcommit();
>>     cache_enable();
>>
>> solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
>> your writes actually make it durably onto the DIMMs.  I believe you really do
>> need to loop through the cache lines, flush them with CLWB, then fence and
>> PCOMMIT.
>
> *blink*
> *blink*
>
> So much for not violating the principal of least surprise.  I suppose
> you've asked the hardware folks, and they've sent you down this path?

The SDM states that wbinvd only asynchronously "signals" L3 to flush.

>> I do worry that the cost of blindly flushing the entire PMEM namespace on each
>> fsync or msync will be prohibitively expensive, and that we'll by very
>> incentivized to move to the radix tree based dirty page tracking as soon as
>> possible. :)
>
> Sure, but wbinvd would be quite costly as well.  Either way I think a
> better solution will be required in the near term.
>

As Peter points out the irqoff latency that wbinvd introduces also
makes it not optimal.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-10-28 22:24 ` [PATCH 0/2] "big hammer" for DAX msync/fsync correctness Jeff Moyer
  2015-10-28 22:49   ` Dan Williams
@ 2015-10-28 22:51   ` Ross Zwisler
  2015-11-05 23:59     ` Dan Williams
  1 sibling, 1 reply; 19+ messages in thread
From: Ross Zwisler @ 2015-10-28 22:51 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Ross Zwisler, linux-kernel, linux-nvdimm, Dave Chinner, x86,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Jan Kara

On Wed, Oct 28, 2015 at 06:24:29PM -0400, Jeff Moyer wrote:
> Ross Zwisler <ross.zwisler@linux.intel.com> writes:
> 
> > This series implements the very slow but correct handling for
> > blkdev_issue_flush() with DAX mappings, as discussed here:
> >
> > https://lkml.org/lkml/2015/10/26/116
> >
> > I don't think that we can actually do the
> >
> >     on_each_cpu(sync_cache, ...);
> >
> > ...where sync_cache is something like:
> >
> >     cache_disable();
> >     wbinvd();
> >     pcommit();
> >     cache_enable();
> >
> > solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
> > your writes actually make it durably onto the DIMMs.  I believe you really do
> > need to loop through the cache lines, flush them with CLWB, then fence and
> > PCOMMIT.
> 
> *blink*
> *blink*
> 
> So much for not violating the principal of least surprise.  I suppose
> you've asked the hardware folks, and they've sent you down this path?

Sadly, yes, this was the guidance from the hardware folks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] pmem: Add simple and slow fsync/msync support
  2015-10-28 22:09 ` [PATCH 2/2] pmem: Add simple and slow fsync/msync support Ross Zwisler
@ 2015-10-28 23:02   ` Dan Williams
  0 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2015-10-28 23:02 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	linux-nvdimm@lists.01.org, X86 ML, Dave Chinner, Jan Kara

On Thu, Oct 29, 2015 at 7:09 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> Make blkdev_issue_flush() behave correctly according to its required
> semantics - all volatile cached data is flushed to stable storage.
>
> Eventually this needs to be replaced with something much more precise by
> tracking dirty DAX entries via the radix tree in struct address_space, but
> for now this gives us correctness even if the performance is quite bad.
>
> Userspace applications looking to avoid the fsync/msync penalty should
> consider more fine-grained flushing via the NVML library:
>
> https://github.com/pmem/nvml
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  drivers/nvdimm/pmem.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 0ba6a97..eea7997 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -80,7 +80,14 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
>         if (do_acct)
>                 nd_iostat_end(bio, start);
>
> -       if (bio_data_dir(bio))
> +       if (bio->bi_rw & REQ_FLUSH) {
> +               void __pmem *addr = pmem->virt_addr + pmem->data_offset;
> +               size_t size = pmem->size - pmem->data_offset;
> +
> +               wb_cache_pmem(addr, size);
> +       }
> +

So I think this will be too expensive to run synchronously in the
submission path for very large pmem ranges and should be farmed out to
an async thread. Then, as long as we're farming it out, might as well
farm it out to more than one cpu.  I'll take a stab at this on the
flight back from KS.

Another optimization is that we can make the flush a nop up until
pmem_direct_access() is first called, because we know there is nothing
to flush when all the i/o is coming through the driver.  That at least
helps the "pmem as a fast SSD" use case avoid the overhead.

Bikeshed alert... wb_cache_pmem() should probably become
mmio_wb_cache() and live next to mmio_flush_cache() since it is not
specific to persistent memory.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-10-28 22:51   ` Ross Zwisler
@ 2015-11-05 23:59     ` Dan Williams
  2015-11-06  8:06       ` Thomas Gleixner
  2015-11-06 20:25       ` H. Peter Anvin
  0 siblings, 2 replies; 19+ messages in thread
From: Dan Williams @ 2015-11-05 23:59 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jeff Moyer, linux-nvdimm, X86 ML, Dave Chinner, linux-kernel,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Jan Kara

On Wed, Oct 28, 2015 at 3:51 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Wed, Oct 28, 2015 at 06:24:29PM -0400, Jeff Moyer wrote:
>> Ross Zwisler <ross.zwisler@linux.intel.com> writes:
>>
>> > This series implements the very slow but correct handling for
>> > blkdev_issue_flush() with DAX mappings, as discussed here:
>> >
>> > https://lkml.org/lkml/2015/10/26/116
>> >
>> > I don't think that we can actually do the
>> >
>> >     on_each_cpu(sync_cache, ...);
>> >
>> > ...where sync_cache is something like:
>> >
>> >     cache_disable();
>> >     wbinvd();
>> >     pcommit();
>> >     cache_enable();
>> >
>> > solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
>> > your writes actually make it durably onto the DIMMs.  I believe you really do
>> > need to loop through the cache lines, flush them with CLWB, then fence and
>> > PCOMMIT.
>>
>> *blink*
>> *blink*
>>
>> So much for not violating the principal of least surprise.  I suppose
>> you've asked the hardware folks, and they've sent you down this path?
>
> Sadly, yes, this was the guidance from the hardware folks.

So it turns out we weren't asking the right question.  wbinvd may
indeed be viable... we're still working through the caveats.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-11-05 23:59     ` Dan Williams
@ 2015-11-06  8:06       ` Thomas Gleixner
  2015-11-06 16:04         ` Dan Williams
  2015-11-06 20:25       ` H. Peter Anvin
  1 sibling, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2015-11-06  8:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Jeff Moyer, linux-nvdimm, X86 ML, Dave Chinner,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Jan Kara

On Thu, 5 Nov 2015, Dan Williams wrote:
> On Wed, Oct 28, 2015 at 3:51 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Wed, Oct 28, 2015 at 06:24:29PM -0400, Jeff Moyer wrote:
> >> Ross Zwisler <ross.zwisler@linux.intel.com> writes:
> >>
> >> > This series implements the very slow but correct handling for
> >> > blkdev_issue_flush() with DAX mappings, as discussed here:
> >> >
> >> > https://lkml.org/lkml/2015/10/26/116
> >> >
> >> > I don't think that we can actually do the
> >> >
> >> >     on_each_cpu(sync_cache, ...);
> >> >
> >> > ...where sync_cache is something like:
> >> >
> >> >     cache_disable();
> >> >     wbinvd();
> >> >     pcommit();
> >> >     cache_enable();
> >> >
> >> > solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
> >> > your writes actually make it durably onto the DIMMs.  I believe you really do
> >> > need to loop through the cache lines, flush them with CLWB, then fence and
> >> > PCOMMIT.
> >>
> >> *blink*
> >> *blink*
> >>
> >> So much for not violating the principal of least surprise.  I suppose
> >> you've asked the hardware folks, and they've sent you down this path?
> >
> > Sadly, yes, this was the guidance from the hardware folks.
> 
> So it turns out we weren't asking the right question.  wbinvd may
> indeed be viable... we're still working through the caveats.

Just for the record. Such a flush mechanism with 

     on_each_cpu()
	wbinvd()
	...
 
will make that stuff completely unusable on Real-Time systems. We've
been there with the big hammer approach of the intel graphics
driver.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-11-06  8:06       ` Thomas Gleixner
@ 2015-11-06 16:04         ` Dan Williams
  2015-11-06 17:35           ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2015-11-06 16:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ross Zwisler, Jeff Moyer, linux-nvdimm, X86 ML, Dave Chinner,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Jan Kara

On Fri, Nov 6, 2015 at 12:06 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, 5 Nov 2015, Dan Williams wrote:
>> On Wed, Oct 28, 2015 at 3:51 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > On Wed, Oct 28, 2015 at 06:24:29PM -0400, Jeff Moyer wrote:
>> >> Ross Zwisler <ross.zwisler@linux.intel.com> writes:
>> >>
>> >> > This series implements the very slow but correct handling for
>> >> > blkdev_issue_flush() with DAX mappings, as discussed here:
>> >> >
>> >> > https://lkml.org/lkml/2015/10/26/116
>> >> >
>> >> > I don't think that we can actually do the
>> >> >
>> >> >     on_each_cpu(sync_cache, ...);
>> >> >
>> >> > ...where sync_cache is something like:
>> >> >
>> >> >     cache_disable();
>> >> >     wbinvd();
>> >> >     pcommit();
>> >> >     cache_enable();
>> >> >
>> >> > solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
>> >> > your writes actually make it durably onto the DIMMs.  I believe you really do
>> >> > need to loop through the cache lines, flush them with CLWB, then fence and
>> >> > PCOMMIT.
>> >>
>> >> *blink*
>> >> *blink*
>> >>
>> >> So much for not violating the principal of least surprise.  I suppose
>> >> you've asked the hardware folks, and they've sent you down this path?
>> >
>> > Sadly, yes, this was the guidance from the hardware folks.
>>
>> So it turns out we weren't asking the right question.  wbinvd may
>> indeed be viable... we're still working through the caveats.
>
> Just for the record. Such a flush mechanism with
>
>      on_each_cpu()
>         wbinvd()
>         ...
>
> will make that stuff completely unusable on Real-Time systems. We've
> been there with the big hammer approach of the intel graphics
> driver.

Noted.  This means RT systems either need to disable DAX or avoid
fsync.  Yes, this is a wart, but not an unexpected one in a first
generation persistent memory platform.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-11-06 16:04         ` Dan Williams
@ 2015-11-06 17:35           ` Thomas Gleixner
  2015-11-06 23:17             ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2015-11-06 17:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Jeff Moyer, linux-nvdimm, X86 ML, Dave Chinner,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Jan Kara

On Fri, 6 Nov 2015, Dan Williams wrote:
> On Fri, Nov 6, 2015 at 12:06 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > Just for the record. Such a flush mechanism with
> >
> >      on_each_cpu()
> >         wbinvd()
> >         ...
> >
> > will make that stuff completely unusable on Real-Time systems. We've
> > been there with the big hammer approach of the intel graphics
> > driver.
> 
> Noted.  This means RT systems either need to disable DAX or avoid
> fsync.  Yes, this is a wart, but not an unexpected one in a first
> generation persistent memory platform.

And it's not just only RT. The folks who are aiming for 100%
undisturbed user space (NOHZ_FULL) will be massively unhappy about
that as well.

Is it really required to do that on all cpus?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-11-05 23:59     ` Dan Williams
  2015-11-06  8:06       ` Thomas Gleixner
@ 2015-11-06 20:25       ` H. Peter Anvin
  1 sibling, 0 replies; 19+ messages in thread
From: H. Peter Anvin @ 2015-11-06 20:25 UTC (permalink / raw)
  To: Dan Williams, Ross Zwisler
  Cc: Jeff Moyer, linux-nvdimm, X86 ML, Dave Chinner, linux-kernel,
	Ingo Molnar, Thomas Gleixner, Jan Kara

On November 5, 2015 3:59:46 PM PST, Dan Williams <dan.j.williams@intel.com> wrote:
>On Wed, Oct 28, 2015 at 3:51 PM, Ross Zwisler
><ross.zwisler@linux.intel.com> wrote:
>> On Wed, Oct 28, 2015 at 06:24:29PM -0400, Jeff Moyer wrote:
>>> Ross Zwisler <ross.zwisler@linux.intel.com> writes:
>>>
>>> > This series implements the very slow but correct handling for
>>> > blkdev_issue_flush() with DAX mappings, as discussed here:
>>> >
>>> > https://lkml.org/lkml/2015/10/26/116
>>> >
>>> > I don't think that we can actually do the
>>> >
>>> >     on_each_cpu(sync_cache, ...);
>>> >
>>> > ...where sync_cache is something like:
>>> >
>>> >     cache_disable();
>>> >     wbinvd();
>>> >     pcommit();
>>> >     cache_enable();
>>> >
>>> > solution as proposed by Dan because WBINVD + PCOMMIT doesn't
>guarantee that
>>> > your writes actually make it durably onto the DIMMs.  I believe
>you really do
>>> > need to loop through the cache lines, flush them with CLWB, then
>fence and
>>> > PCOMMIT.
>>>
>>> *blink*
>>> *blink*
>>>
>>> So much for not violating the principal of least surprise.  I
>suppose
>>> you've asked the hardware folks, and they've sent you down this
>path?
>>
>> Sadly, yes, this was the guidance from the hardware folks.
>
>So it turns out we weren't asking the right question.  wbinvd may
>indeed be viable... we're still working through the caveats.

Do not disable the caches here.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-11-06 17:35           ` Thomas Gleixner
@ 2015-11-06 23:17             ` Dan Williams
  2015-11-07  0:51               ` H. Peter Anvin
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2015-11-06 23:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ross Zwisler, Jeff Moyer, linux-nvdimm, X86 ML, Dave Chinner,
	linux-kernel, Ingo Molnar, H. Peter Anvin, Jan Kara

On Fri, Nov 6, 2015 at 9:35 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 6 Nov 2015, Dan Williams wrote:
>> On Fri, Nov 6, 2015 at 12:06 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Just for the record. Such a flush mechanism with
>> >
>> >      on_each_cpu()
>> >         wbinvd()
>> >         ...
>> >
>> > will make that stuff completely unusable on Real-Time systems. We've
>> > been there with the big hammer approach of the intel graphics
>> > driver.
>>
>> Noted.  This means RT systems either need to disable DAX or avoid
>> fsync.  Yes, this is a wart, but not an unexpected one in a first
>> generation persistent memory platform.
>
> And it's not just only RT. The folks who are aiming for 100%
> undisturbed user space (NOHZ_FULL) will be massively unhappy about
> that as well.
>
> Is it really required to do that on all cpus?
>

I believe it is, but I'll double check.

I assume the folks that want undisturbed userspace are ok with the
mitigation to modify their application to flush by individual cache
lines if they want to use DAX without fsync.  At least until the
platform can provide a cheaper fsync implementation.

The option to drive cache flushing from the radix is at least
interruptible, but it might be long running depending on how much
virtual address space is dirty.  Altogether, the options in the
current generation are:

1/ wbinvd driven: quick flush O(size of cache), but long interrupt-off latency

2/ radix driven: long flush O(size of dirty range), but at least preempt-able

3/ DAX without calling fsync: userspace takes direct responsibility
for cache management of DAX mappings

4/ DAX disabled: fsync is the standard page cache writeback latency

We could potentially argue about 1 vs 2 ad nauseum, but I wonder if
there is room to it punt it to a configuration option or make it
dynamic?  My stance is do 1 with the hope of riding options 3 and 4
until the platform happens to provide a better alternative.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-11-06 23:17             ` Dan Williams
@ 2015-11-07  0:51               ` H. Peter Anvin
  2015-11-07  6:50                 ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: H. Peter Anvin @ 2015-11-07  0:51 UTC (permalink / raw)
  To: Dan Williams, Thomas Gleixner
  Cc: Ross Zwisler, Jeff Moyer, linux-nvdimm, X86 ML, Dave Chinner,
	linux-kernel, Ingo Molnar, Jan Kara

On 11/06/15 15:17, Dan Williams wrote:
>>
>> Is it really required to do that on all cpus?
> 
> I believe it is, but I'll double check.
> 

It's required on all CPUs on which the DAX memory may have been dirtied.
 This is similar to the way we flush TLBs.

	-hpa



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-11-07  0:51               ` H. Peter Anvin
@ 2015-11-07  6:50                 ` Thomas Gleixner
  2015-11-07  8:12                   ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2015-11-07  6:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Dan Williams, Ross Zwisler, Jeff Moyer, linux-nvdimm, X86 ML,
	Dave Chinner, linux-kernel, Ingo Molnar, Jan Kara

On Fri, 6 Nov 2015, H. Peter Anvin wrote:
> On 11/06/15 15:17, Dan Williams wrote:
> >>
> >> Is it really required to do that on all cpus?
> > 
> > I believe it is, but I'll double check.
> > 
> 
> It's required on all CPUs on which the DAX memory may have been dirtied.
>  This is similar to the way we flush TLBs.

Right. And that's exactly the problem: "may have been dirtied"

If DAX is used on 50% of the CPUs and the other 50% are plumming away
happily in user space or run low latency RT tasks w/o ever touching
it, then having an unconditional flush on ALL CPUs is just wrong
because you penalize the uninvolved cores with a completely pointless
SMP function call and drain their caches.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-11-07  6:50                 ` Thomas Gleixner
@ 2015-11-07  8:12                   ` Dan Williams
  2015-11-07  8:38                     ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2015-11-07  8:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ross Zwisler, Jeff Moyer, linux-nvdimm, X86 ML,
	Dave Chinner, linux-kernel, Ingo Molnar, Jan Kara

On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 6 Nov 2015, H. Peter Anvin wrote:
>> On 11/06/15 15:17, Dan Williams wrote:
>> >>
>> >> Is it really required to do that on all cpus?
>> >
>> > I believe it is, but I'll double check.
>> >
>>
>> It's required on all CPUs on which the DAX memory may have been dirtied.
>>  This is similar to the way we flush TLBs.
>
> Right. And that's exactly the problem: "may have been dirtied"
>
> If DAX is used on 50% of the CPUs and the other 50% are plumming away
> happily in user space or run low latency RT tasks w/o ever touching
> it, then having an unconditional flush on ALL CPUs is just wrong
> because you penalize the uninvolved cores with a completely pointless
> SMP function call and drain their caches.
>

It's not wrong and pointless, it's all we have available outside of
having the kernel remember every virtual address that might have been
touched since the last fsync and sit in a loop flushing those virtual
address cache line by cache line.

There is a crossover point where wbinvd is better than a clwb loop
that needs to be determined.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-11-07  8:12                   ` Dan Williams
@ 2015-11-07  8:38                     ` Thomas Gleixner
  2015-11-07  9:02                       ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2015-11-07  8:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: H. Peter Anvin, Ross Zwisler, Jeff Moyer, linux-nvdimm, X86 ML,
	Dave Chinner, linux-kernel, Ingo Molnar, Jan Kara

On Sat, 7 Nov 2015, Dan Williams wrote:
> On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Fri, 6 Nov 2015, H. Peter Anvin wrote:
> >> On 11/06/15 15:17, Dan Williams wrote:
> >> >>
> >> >> Is it really required to do that on all cpus?
> >> >
> >> > I believe it is, but I'll double check.
> >> >
> >>
> >> It's required on all CPUs on which the DAX memory may have been dirtied.
> >>  This is similar to the way we flush TLBs.
> >
> > Right. And that's exactly the problem: "may have been dirtied"
> >
> > If DAX is used on 50% of the CPUs and the other 50% are plumming away
> > happily in user space or run low latency RT tasks w/o ever touching
> > it, then having an unconditional flush on ALL CPUs is just wrong
> > because you penalize the uninvolved cores with a completely pointless
> > SMP function call and drain their caches.
> >
> 
> It's not wrong and pointless, it's all we have available outside of
> having the kernel remember every virtual address that might have been
> touched since the last fsync and sit in a loop flushing those virtual
> address cache line by cache line.
> 
> There is a crossover point where wbinvd is better than a clwb loop
> that needs to be determined.

This is a totally different issue and I'm well aware that there is a
tradeoff between wbinvd() and a clwb loop. wbinvd() might be more
efficient performance wise above some number of cache lines, but then
again it's draining all unrelated stuff as well, which can result in a
even larger performance hit.

Now what really concerns me more is that you just unconditionally
flush on all CPUs whether they were involved in that DAX stuff or not.

Assume that DAX using application on CPU 0-3 and some other unrelated
workload on CPU4-7. That flush will 

  - Interrupt CPU4-7 for no reason (whether you use clwb or wbinvd)

  - Drain the cache for CPU4-7 for no reason if done with wbinvd()
   
  - Render Cache Allocation useless if done with wbinvd()

And we are not talking about a few micro seconds here. Assume that
CPU4-7 have cache allocated and it's mostly dirty. We've measured the
wbinvd() impact on RT, back then when the graphic folks used it as a
big hammer. The maximum latency spike was way above one millisecond.

We have similar issues with TLB flushing, but there we 

  - are tracking where it was used and never flush on innocent cpus

  - one can design his application in a way that it uses different
    processes so cross CPU flushing does not happen

I know that this is not an easy problem to solve, but you should be
aware that various application scenarios are going to be massively
unhappy about that.

Thanks,

	tglx








^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-11-07  8:38                     ` Thomas Gleixner
@ 2015-11-07  9:02                       ` Dan Williams
  2015-11-07  9:24                         ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2015-11-07  9:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Ross Zwisler, Jeff Moyer, linux-nvdimm, X86 ML,
	Dave Chinner, linux-kernel, Ingo Molnar, Jan Kara

On Sat, Nov 7, 2015 at 12:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sat, 7 Nov 2015, Dan Williams wrote:
>> On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > On Fri, 6 Nov 2015, H. Peter Anvin wrote:
>> >> On 11/06/15 15:17, Dan Williams wrote:
>> >> >>
>> >> >> Is it really required to do that on all cpus?
>> >> >
>> >> > I believe it is, but I'll double check.
>> >> >
>> >>
>> >> It's required on all CPUs on which the DAX memory may have been dirtied.
>> >>  This is similar to the way we flush TLBs.
>> >
>> > Right. And that's exactly the problem: "may have been dirtied"
>> >
>> > If DAX is used on 50% of the CPUs and the other 50% are plumming away
>> > happily in user space or run low latency RT tasks w/o ever touching
>> > it, then having an unconditional flush on ALL CPUs is just wrong
>> > because you penalize the uninvolved cores with a completely pointless
>> > SMP function call and drain their caches.
>> >
>>
>> It's not wrong and pointless, it's all we have available outside of
>> having the kernel remember every virtual address that might have been
>> touched since the last fsync and sit in a loop flushing those virtual
>> address cache line by cache line.
>>
>> There is a crossover point where wbinvd is better than a clwb loop
>> that needs to be determined.
>
> This is a totally different issue and I'm well aware that there is a
> tradeoff between wbinvd() and a clwb loop. wbinvd() might be more
> efficient performance wise above some number of cache lines, but then
> again it's draining all unrelated stuff as well, which can result in a
> even larger performance hit.
>
> Now what really concerns me more is that you just unconditionally
> flush on all CPUs whether they were involved in that DAX stuff or not.
>
> Assume that DAX using application on CPU 0-3 and some other unrelated
> workload on CPU4-7. That flush will
>
>   - Interrupt CPU4-7 for no reason (whether you use clwb or wbinvd)
>
>   - Drain the cache for CPU4-7 for no reason if done with wbinvd()
>
>   - Render Cache Allocation useless if done with wbinvd()
>
> And we are not talking about a few micro seconds here. Assume that
> CPU4-7 have cache allocated and it's mostly dirty. We've measured the
> wbinvd() impact on RT, back then when the graphic folks used it as a
> big hammer. The maximum latency spike was way above one millisecond.
>
> We have similar issues with TLB flushing, but there we
>
>   - are tracking where it was used and never flush on innocent cpus
>
>   - one can design his application in a way that it uses different
>     processes so cross CPU flushing does not happen
>
> I know that this is not an easy problem to solve, but you should be
> aware that various application scenarios are going to be massively
> unhappy about that.
>

Thanks for that explanation.  Peter had alluded to it at KS, but I
indeed did not know that it was as horrible as milliseconds of
latency, hmm...

One other mitigation that follows on with Dave's plan of per-inode DAX
control, is to also track when an inode has a writable DAX mmap
established.  With that we could have a REQ_DAX flag to augment
REQ_FLUSH to potentially reduce committing violence on the cache.  In
an earlier thread I also recall an idea to have an mmap flag that an
app can use to say "yes, I'm doing a writable DAX mapping, but I'm
taking care of the cache myself".  We could track innocent cpus, but
I'm thinking that would be a core change to write-protect pages when a
thread migrates?  In general I feel there's a limit for how much
hardware workaround is reasonable to do in the core kernel vs waiting
for the platform to offer better options...

Sorry if I'm being a bit punchy, but I'm still feeling like I need to
defend the notion that DAX may just need to be turned off in some
situations.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
  2015-11-07  9:02                       ` Dan Williams
@ 2015-11-07  9:24                         ` Thomas Gleixner
  0 siblings, 0 replies; 19+ messages in thread
From: Thomas Gleixner @ 2015-11-07  9:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: H. Peter Anvin, Ross Zwisler, Jeff Moyer, linux-nvdimm, X86 ML,
	Dave Chinner, linux-kernel, Ingo Molnar, Jan Kara

On Sat, 7 Nov 2015, Dan Williams wrote:
> Thanks for that explanation.  Peter had alluded to it at KS, but I
> indeed did not know that it was as horrible as milliseconds of
> latency, hmm...

Yes, I was pretty surprised as well. But even if it's just in the
hundreds of microseconds it can be too much for latency sensitive
applications.
 
> One other mitigation that follows on with Dave's plan of per-inode DAX
> control, is to also track when an inode has a writable DAX mmap
> established.  With that we could have a REQ_DAX flag to augment
> REQ_FLUSH to potentially reduce committing violence on the cache.  In
> an earlier thread I also recall an idea to have an mmap flag that an
> app can use to say "yes, I'm doing a writable DAX mapping, but I'm
> taking care of the cache myself".  We could track innocent cpus, but
> I'm thinking that would be a core change to write-protect pages when a
> thread migrates?  In general I feel there's a limit for how much
> hardware workaround is reasonable to do in the core kernel vs waiting
> for the platform to offer better options...

One thing vs. the mmaps: We exactly know which CPUs are involved in
that mapping. We know that from the TLB management. So we probably can
make use of that knowledge.

> Sorry if I'm being a bit punchy, but I'm still feeling like I need to
> defend the notion that DAX may just need to be turned off in some
> situations.

That's fine, if there is no reasonable way around it. It just needs to
be documented so people won't be surprised.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-11-07  9:25 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-28 22:09 [PATCH 0/2] "big hammer" for DAX msync/fsync correctness Ross Zwisler
2015-10-28 22:09 ` [PATCH 1/2] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
2015-10-28 22:09 ` [PATCH 2/2] pmem: Add simple and slow fsync/msync support Ross Zwisler
2015-10-28 23:02   ` Dan Williams
2015-10-28 22:24 ` [PATCH 0/2] "big hammer" for DAX msync/fsync correctness Jeff Moyer
2015-10-28 22:49   ` Dan Williams
2015-10-28 22:51   ` Ross Zwisler
2015-11-05 23:59     ` Dan Williams
2015-11-06  8:06       ` Thomas Gleixner
2015-11-06 16:04         ` Dan Williams
2015-11-06 17:35           ` Thomas Gleixner
2015-11-06 23:17             ` Dan Williams
2015-11-07  0:51               ` H. Peter Anvin
2015-11-07  6:50                 ` Thomas Gleixner
2015-11-07  8:12                   ` Dan Williams
2015-11-07  8:38                     ` Thomas Gleixner
2015-11-07  9:02                       ` Dan Williams
2015-11-07  9:24                         ` Thomas Gleixner
2015-11-06 20:25       ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).