nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
       [not found] <1525704627-30114-1-git-send-email-yehs1@lenovo.com>
@ 2018-05-07 18:46 ` Matthew Wilcox
  2018-05-07 18:57   ` Dan Williams
  2018-05-08  0:54   ` [External] " Huaisheng HS1 Ye
  0 siblings, 2 replies; 21+ messages in thread
From: Matthew Wilcox @ 2018-05-07 18:46 UTC (permalink / raw)
  To: Huaisheng Ye
  Cc: akpm, linux-mm, mhocko, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, chengnt,
	linux-kernel, linux-nvdimm

On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
> DEVICE zone, which is a virtual zone and both its start and end of pfn 
> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
> corresponding drivers, which locate at \drivers\nvdimm\ and 
> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> memory hot plug implementation.

You probably want to let linux-nvdimm know about this patch set.
Adding to the cc.  Also, I only received patch 0 and 4.  What happened
to 1-3,5 and 6?

> With current kernel, many mm’s classical features like the buddy
> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
> What we are doing is to expand kernel mm’s capacity to make it to handle
> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
> separately, that means mm can only put the critical pages to NVDIMM
> zone, here we created a new zone type as NVM zone. That is to say for 
> traditional(or normal) pages which would be stored at DRAM scope like
> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
> them could be recovered from power fail or system crash, we make them
> to be persistent by storing them to NVM zone.
> 
> We installed two NVDIMMs to Lenovo Thinksystem product as development
> platform, which has 125GB storage capacity respectively. With these 
> patches below, mm can create NVM zones for NVDIMMs.
> 
> Here is dmesg info,
>  Initmem setup node 0 [mem 0x0000000000001000-0x000000237fffffff]
>  On node 0 totalpages: 36879666
>    DMA zone: 64 pages used for memmap
>    DMA zone: 23 pages reserved
>    DMA zone: 3999 pages, LIFO batch:0
>  mminit::memmap_init Initialising map node 0 zone 0 pfns 1 -> 4096 
>    DMA32 zone: 10935 pages used for memmap
>    DMA32 zone: 699795 pages, LIFO batch:31
>  mminit::memmap_init Initialising map node 0 zone 1 pfns 4096 -> 1048576
>    Normal zone: 53248 pages used for memmap
>    Normal zone: 3407872 pages, LIFO batch:31
>  mminit::memmap_init Initialising map node 0 zone 2 pfns 1048576 -> 4456448
>    NVM zone: 512000 pages used for memmap
>    NVM zone: 32768000 pages, LIFO batch:31
>  mminit::memmap_init Initialising map node 0 zone 3 pfns 4456448 -> 37224448
>  Initmem setup node 1 [mem 0x0000002380000000-0x00000046bfffffff]
>  On node 1 totalpages: 36962304
>    Normal zone: 65536 pages used for memmap
>    Normal zone: 4194304 pages, LIFO batch:31
>  mminit::memmap_init Initialising map node 1 zone 2 pfns 37224448 -> 41418752
>    NVM zone: 512000 pages used for memmap
>    NVM zone: 32768000 pages, LIFO batch:31
>  mminit::memmap_init Initialising map node 1 zone 3 pfns 41418752 -> 74186752
> 
> This comes /proc/zoneinfo
> Node 0, zone      NVM
>   pages free     32768000
>         min      15244
>         low      48012
>         high     80780
>         spanned  32768000
>         present  32768000
>         managed  32768000
>         protection: (0, 0, 0, 0, 0, 0)
>         nr_free_pages 32768000
> Node 1, zone      NVM
>   pages free     32768000
>         min      15244
>         low      48012
>         high     80780
>         spanned  32768000
>         present  32768000
>         managed  32768000
> 
> Huaisheng Ye (6):
>   mm/memblock: Expand definition of flags to support NVDIMM
>   mm/page_alloc.c: get pfn range with flags of memblock
>   mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
>   arch/x86/kernel: mark NVDIMM regions from e820_table
>   mm: get zone spanned pages separately for DRAM and NVDIMM
>   arch/x86/mm: create page table mapping for DRAM and NVDIMM both
> 
>  arch/x86/include/asm/e820/api.h |  3 +++
>  arch/x86/kernel/e820.c          | 20 +++++++++++++-
>  arch/x86/kernel/setup.c         |  8 ++++++
>  arch/x86/mm/init_64.c           | 16 +++++++++++
>  include/linux/gfp.h             | 57 ++++++++++++++++++++++++++++++++++++---
>  include/linux/memblock.h        | 19 +++++++++++++
>  include/linux/mm.h              |  4 +++
>  include/linux/mmzone.h          |  3 +++
>  mm/Kconfig                      | 16 +++++++++++
>  mm/memblock.c                   | 46 +++++++++++++++++++++++++++----
>  mm/nobootmem.c                  |  5 ++--
>  mm/page_alloc.c                 | 60 ++++++++++++++++++++++++++++++++++++++++-
>  12 files changed, 245 insertions(+), 12 deletions(-)
> 
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-07 18:46 ` [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone Matthew Wilcox
@ 2018-05-07 18:57   ` Dan Williams
  2018-05-07 19:08     ` Jeff Moyer
  2018-05-07 19:18     ` Matthew Wilcox
  2018-05-08  0:54   ` [External] " Huaisheng HS1 Ye
  1 sibling, 2 replies; 21+ messages in thread
From: Dan Williams @ 2018-05-07 18:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, Huaisheng Ye, linux-nvdimm, Tetsuo Handa, chengnt,
	Dave Hansen, Linux Kernel Mailing List, pasha.tatashin, Linux MM,
	colyli, Johannes Weiner, Andrew Morton, Sasha Levin, Mel Gorman,
	Vlastimil Babka

On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox <willy@infradead.org> wrote:
> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
>> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
>> DEVICE zone, which is a virtual zone and both its start and end of pfn
>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
>> corresponding drivers, which locate at \drivers\nvdimm\ and
>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
>> memory hot plug implementation.
>
> You probably want to let linux-nvdimm know about this patch set.
> Adding to the cc.

Yes, thanks for that!

> Also, I only received patch 0 and 4.  What happened
> to 1-3,5 and 6?
>
>> With current kernel, many mm’s classical features like the buddy
>> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
>> What we are doing is to expand kernel mm’s capacity to make it to handle
>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
>> separately, that means mm can only put the critical pages to NVDIMM
>> zone, here we created a new zone type as NVM zone. That is to say for
>> traditional(or normal) pages which would be stored at DRAM scope like
>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
>> them could be recovered from power fail or system crash, we make them
>> to be persistent by storing them to NVM zone.
>>
>> We installed two NVDIMMs to Lenovo Thinksystem product as development
>> platform, which has 125GB storage capacity respectively. With these
>> patches below, mm can create NVM zones for NVDIMMs.
>>
>> Here is dmesg info,
>>  Initmem setup node 0 [mem 0x0000000000001000-0x000000237fffffff]
>>  On node 0 totalpages: 36879666
>>    DMA zone: 64 pages used for memmap
>>    DMA zone: 23 pages reserved
>>    DMA zone: 3999 pages, LIFO batch:0
>>  mminit::memmap_init Initialising map node 0 zone 0 pfns 1 -> 4096
>>    DMA32 zone: 10935 pages used for memmap
>>    DMA32 zone: 699795 pages, LIFO batch:31
>>  mminit::memmap_init Initialising map node 0 zone 1 pfns 4096 -> 1048576
>>    Normal zone: 53248 pages used for memmap
>>    Normal zone: 3407872 pages, LIFO batch:31
>>  mminit::memmap_init Initialising map node 0 zone 2 pfns 1048576 -> 4456448
>>    NVM zone: 512000 pages used for memmap
>>    NVM zone: 32768000 pages, LIFO batch:31
>>  mminit::memmap_init Initialising map node 0 zone 3 pfns 4456448 -> 37224448
>>  Initmem setup node 1 [mem 0x0000002380000000-0x00000046bfffffff]
>>  On node 1 totalpages: 36962304
>>    Normal zone: 65536 pages used for memmap
>>    Normal zone: 4194304 pages, LIFO batch:31
>>  mminit::memmap_init Initialising map node 1 zone 2 pfns 37224448 -> 41418752
>>    NVM zone: 512000 pages used for memmap
>>    NVM zone: 32768000 pages, LIFO batch:31
>>  mminit::memmap_init Initialising map node 1 zone 3 pfns 41418752 -> 74186752
>>
>> This comes /proc/zoneinfo
>> Node 0, zone      NVM
>>   pages free     32768000
>>         min      15244
>>         low      48012
>>         high     80780
>>         spanned  32768000
>>         present  32768000
>>         managed  32768000
>>         protection: (0, 0, 0, 0, 0, 0)
>>         nr_free_pages 32768000
>> Node 1, zone      NVM
>>   pages free     32768000
>>         min      15244
>>         low      48012
>>         high     80780
>>         spanned  32768000
>>         present  32768000
>>         managed  32768000

I think adding yet one more mm-zone is the wrong direction. Instead,
what we have been considering is a mechanism to allow a device-dax
instance to be given back to the kernel as a distinct numa node
managed by the VM. It seems it times to dust off those patches.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-07 18:57   ` Dan Williams
@ 2018-05-07 19:08     ` Jeff Moyer
  2018-05-07 19:17       ` Dan Williams
  2018-05-08  2:59       ` [External] " Huaisheng HS1 Ye
  2018-05-07 19:18     ` Matthew Wilcox
  1 sibling, 2 replies; 21+ messages in thread
From: Jeff Moyer @ 2018-05-07 19:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Sasha Levin, Michal Hocko, Huaisheng Ye, linux-nvdimm,
	Tetsuo Handa, chengnt, Linux Kernel Mailing List, Matthew Wilcox,
	pasha.tatashin, Linux MM, Dave Hansen, Johannes Weiner,
	Andrew Morton, colyli, Mel Gorman, Vlastimil Babka

Dan Williams <dan.j.williams@intel.com> writes:

> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox <willy@infradead.org> wrote:
>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
>>> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
>>> DEVICE zone, which is a virtual zone and both its start and end of pfn
>>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
>>> corresponding drivers, which locate at \drivers\nvdimm\ and
>>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
>>> memory hot plug implementation.
>>
>> You probably want to let linux-nvdimm know about this patch set.
>> Adding to the cc.
>
> Yes, thanks for that!
>
>> Also, I only received patch 0 and 4.  What happened
>> to 1-3,5 and 6?
>>
>>> With current kernel, many mm’s classical features like the buddy
>>> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
>>> What we are doing is to expand kernel mm’s capacity to make it to handle
>>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
>>> separately, that means mm can only put the critical pages to NVDIMM

Please define "critical pages."

>>> zone, here we created a new zone type as NVM zone. That is to say for
>>> traditional(or normal) pages which would be stored at DRAM scope like
>>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
>>> them could be recovered from power fail or system crash, we make them
>>> to be persistent by storing them to NVM zone.

[...]

> I think adding yet one more mm-zone is the wrong direction. Instead,
> what we have been considering is a mechanism to allow a device-dax
> instance to be given back to the kernel as a distinct numa node
> managed by the VM. It seems it times to dust off those patches.

What's the use case?  The above patch description seems to indicate an
intent to recover contents after a power loss.  Without seeing the whole
series, I'm not sure how that's accomplished in a safe or meaningful
way.

Huaisheng, could you provide a bit more background?

Thanks!
Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-07 19:08     ` Jeff Moyer
@ 2018-05-07 19:17       ` Dan Williams
  2018-05-07 19:28         ` Jeff Moyer
  2018-05-08  2:59       ` [External] " Huaisheng HS1 Ye
  1 sibling, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-05-07 19:17 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Matthew Wilcox, Michal Hocko, Huaisheng Ye, linux-nvdimm,
	Tetsuo Handa, chengnt, Dave Hansen, Linux Kernel Mailing List,
	pasha.tatashin, Linux MM, colyli, Johannes Weiner, Andrew Morton,
	Sasha Levin, Mel Gorman, Vlastimil Babka

On Mon, May 7, 2018 at 12:08 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
>>>> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
>>>> DEVICE zone, which is a virtual zone and both its start and end of pfn
>>>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
>>>> corresponding drivers, which locate at \drivers\nvdimm\ and
>>>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
>>>> memory hot plug implementation.
>>>
>>> You probably want to let linux-nvdimm know about this patch set.
>>> Adding to the cc.
>>
>> Yes, thanks for that!
>>
>>> Also, I only received patch 0 and 4.  What happened
>>> to 1-3,5 and 6?
>>>
>>>> With current kernel, many mm’s classical features like the buddy
>>>> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
>>>> What we are doing is to expand kernel mm’s capacity to make it to handle
>>>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
>>>> separately, that means mm can only put the critical pages to NVDIMM
>
> Please define "critical pages."
>
>>>> zone, here we created a new zone type as NVM zone. That is to say for
>>>> traditional(or normal) pages which would be stored at DRAM scope like
>>>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
>>>> them could be recovered from power fail or system crash, we make them
>>>> to be persistent by storing them to NVM zone.
>
> [...]
>
>> I think adding yet one more mm-zone is the wrong direction. Instead,
>> what we have been considering is a mechanism to allow a device-dax
>> instance to be given back to the kernel as a distinct numa node
>> managed by the VM. It seems it times to dust off those patches.
>
> What's the use case?

Use NVDIMMs as System-RAM given their potentially higher capacity than
DDR. The expectation in that case is that data is forfeit (not
persisted) after a crash. Any persistent use case would need to go
through the pmem driver, filesystem-dax or device-dax.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-07 18:57   ` Dan Williams
  2018-05-07 19:08     ` Jeff Moyer
@ 2018-05-07 19:18     ` Matthew Wilcox
  2018-05-07 19:30       ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2018-05-07 19:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, Huaisheng Ye, linux-nvdimm, Tetsuo Handa, chengnt,
	Dave Hansen, Linux Kernel Mailing List, pasha.tatashin, Linux MM,
	colyli, Johannes Weiner, Andrew Morton, Sasha Levin, Mel Gorman,
	Vlastimil Babka

On Mon, May 07, 2018 at 11:57:10AM -0700, Dan Williams wrote:
> I think adding yet one more mm-zone is the wrong direction. Instead,
> what we have been considering is a mechanism to allow a device-dax
> instance to be given back to the kernel as a distinct numa node
> managed by the VM. It seems it times to dust off those patches.

I was wondering how "safe" we think that ability is.  NV-DIMM pages
(obviously) differ from normal pages by their non-volatility.  Do we
want their contents from the previous boot to be observable?  If not,
then we need the BIOS to clear them at boot-up, which means we would
want no kernel changes at all; rather the BIOS should just describe
those pages as if they were DRAM (after zeroing them).
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-07 19:17       ` Dan Williams
@ 2018-05-07 19:28         ` Jeff Moyer
  2018-05-07 19:29           ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Moyer @ 2018-05-07 19:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Sasha Levin, Michal Hocko, Huaisheng Ye, linux-nvdimm,
	Tetsuo Handa, chengnt, Linux Kernel Mailing List, Matthew Wilcox,
	pasha.tatashin, Linux MM, Dave Hansen, Johannes Weiner,
	Andrew Morton, colyli, Mel Gorman, Vlastimil Babka

Dan Williams <dan.j.williams@intel.com> writes:

> On Mon, May 7, 2018 at 12:08 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
>>>>> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
>>>>> DEVICE zone, which is a virtual zone and both its start and end of pfn
>>>>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
>>>>> corresponding drivers, which locate at \drivers\nvdimm\ and
>>>>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
>>>>> memory hot plug implementation.
>>>>
>>>> You probably want to let linux-nvdimm know about this patch set.
>>>> Adding to the cc.
>>>
>>> Yes, thanks for that!
>>>
>>>> Also, I only received patch 0 and 4.  What happened
>>>> to 1-3,5 and 6?
>>>>
>>>>> With current kernel, many mm’s classical features like the buddy
>>>>> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
>>>>> What we are doing is to expand kernel mm’s capacity to make it to handle
>>>>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
>>>>> separately, that means mm can only put the critical pages to NVDIMM
>>
>> Please define "critical pages."
>>
>>>>> zone, here we created a new zone type as NVM zone. That is to say for
>>>>> traditional(or normal) pages which would be stored at DRAM scope like
>>>>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
>>>>> them could be recovered from power fail or system crash, we make them
>>>>> to be persistent by storing them to NVM zone.
>>
>> [...]
>>
>>> I think adding yet one more mm-zone is the wrong direction. Instead,
>>> what we have been considering is a mechanism to allow a device-dax
>>> instance to be given back to the kernel as a distinct numa node
>>> managed by the VM. It seems it times to dust off those patches.
>>
>> What's the use case?
>
> Use NVDIMMs as System-RAM given their potentially higher capacity than
> DDR. The expectation in that case is that data is forfeit (not
> persisted) after a crash. Any persistent use case would need to go
> through the pmem driver, filesystem-dax or device-dax.

OK, but that sounds different from what was being proposed, here.  I'll
quote from above:

>>>>> But for the critical pages, which we hope them could be recovered
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>> from power fail or system crash, we make them to be persistent by
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>> storing them to NVM zone.

Hence my confusion.

Cheers,
Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-07 19:28         ` Jeff Moyer
@ 2018-05-07 19:29           ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-05-07 19:29 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Sasha Levin, Michal Hocko, Huaisheng Ye, linux-nvdimm,
	Tetsuo Handa, chengnt, Linux Kernel Mailing List, Matthew Wilcox,
	pasha.tatashin, Linux MM, Dave Hansen, Johannes Weiner,
	Andrew Morton, colyli, Mel Gorman, Vlastimil Babka

On Mon, May 7, 2018 at 12:28 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
[..]
>>> What's the use case?
>>
>> Use NVDIMMs as System-RAM given their potentially higher capacity than
>> DDR. The expectation in that case is that data is forfeit (not
>> persisted) after a crash. Any persistent use case would need to go
>> through the pmem driver, filesystem-dax or device-dax.
>
> OK, but that sounds different from what was being proposed, here.  I'll
> quote from above:
>
>>>>>> But for the critical pages, which we hope them could be recovered
>                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>> from power fail or system crash, we make them to be persistent by
>       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>> storing them to NVM zone.
>
> Hence my confusion.

Yes, now mine too, I overlooked that.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-07 19:18     ` Matthew Wilcox
@ 2018-05-07 19:30       ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-05-07 19:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, Huaisheng Ye, linux-nvdimm, Tetsuo Handa, chengnt,
	Dave Hansen, Linux Kernel Mailing List, pasha.tatashin, Linux MM,
	colyli, Johannes Weiner, Andrew Morton, Sasha Levin, Mel Gorman,
	Vlastimil Babka

On Mon, May 7, 2018 at 12:18 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Mon, May 07, 2018 at 11:57:10AM -0700, Dan Williams wrote:
>> I think adding yet one more mm-zone is the wrong direction. Instead,
>> what we have been considering is a mechanism to allow a device-dax
>> instance to be given back to the kernel as a distinct numa node
>> managed by the VM. It seems it times to dust off those patches.
>
> I was wondering how "safe" we think that ability is.  NV-DIMM pages
> (obviously) differ from normal pages by their non-volatility.  Do we
> want their contents from the previous boot to be observable?  If not,
> then we need the BIOS to clear them at boot-up, which means we would
> want no kernel changes at all; rather the BIOS should just describe
> those pages as if they were DRAM (after zeroing them).

Certainly the BIOS could do it, but the impetus for having a kernel
mechanism to do the same is for supporting the configuration
flexibility afforded by namespaces, or otherwise having the capability
when the BIOS does not offer it. However, you are right that there are
extra security implications when System-RAM is persisted, perhaps
requiring the capacity to be explicitly locked / unlocked could
address that concern?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [External]  Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-07 18:46 ` [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone Matthew Wilcox
  2018-05-07 18:57   ` Dan Williams
@ 2018-05-08  0:54   ` Huaisheng HS1 Ye
  1 sibling, 0 replies; 21+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  0:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, mhocko, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, NingTing Cheng,
	linux-kernel, linux-nvdimm


> 
> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
> > Traditionally, NVDIMMs are treated by mm(memory management)
> subsystem as
> > DEVICE zone, which is a virtual zone and both its start and end of pfn
> > are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel
> uses
> > corresponding drivers, which locate at \drivers\nvdimm\ and
> > \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> > memory hot plug implementation.
> 
> You probably want to let linux-nvdimm know about this patch set.
> Adding to the cc.  Also, I only received patch 0 and 4.  What happened
> to 1-3,5 and 6?

Sorry, It could be something wrong with my git-sendemail, but my mailbox has received all of them.
Anyway, I will send them again and CC linux-nvdimm.

Thanks
Huaisheng

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-07 19:08     ` Jeff Moyer
  2018-05-07 19:17       ` Dan Williams
@ 2018-05-08  2:59       ` Huaisheng HS1 Ye
  2018-05-08  3:09         ` Matthew Wilcox
  2018-05-08  3:52         ` Dan Williams
  1 sibling, 2 replies; 21+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  2:59 UTC (permalink / raw)
  To: Jeff Moyer, Dan Williams
  Cc: Sasha Levin, Michal Hocko, linux-nvdimm, Tetsuo Handa,
	NingTing Cheng, Linux Kernel Mailing List, Matthew Wilcox,
	pasha.tatashin, Linux MM, Dave Hansen, Johannes Weiner,
	Andrew Morton, colyli, Mel Gorman, Vlastimil Babka

>
>Dan Williams <dan.j.williams@intel.com> writes:
>
>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox <willy@infradead.org>
>wrote:
>>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
>>>> Traditionally, NVDIMMs are treated by mm(memory management)
>subsystem as
>>>> DEVICE zone, which is a virtual zone and both its start and end of pfn
>>>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel
>uses
>>>> corresponding drivers, which locate at \drivers\nvdimm\ and
>>>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
>>>> memory hot plug implementation.
>>>
>>> You probably want to let linux-nvdimm know about this patch set.
>>> Adding to the cc.
>>
>> Yes, thanks for that!
>>
>>> Also, I only received patch 0 and 4.  What happened
>>> to 1-3,5 and 6?
>>>
>>>> With current kernel, many mm’s classical features like the buddy
>>>> system, swap mechanism and page cache couldn’t be supported to
>NVDIMM.
>>>> What we are doing is to expand kernel mm’s capacity to make it to
>handle
>>>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and
>NVDIMM
>>>> separately, that means mm can only put the critical pages to NVDIMM
>
>Please define "critical pages."
>
>>>> zone, here we created a new zone type as NVM zone. That is to say for
>>>> traditional(or normal) pages which would be stored at DRAM scope like
>>>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
>>>> them could be recovered from power fail or system crash, we make them
>>>> to be persistent by storing them to NVM zone.
>
>[...]
>
>> I think adding yet one more mm-zone is the wrong direction. Instead,
>> what we have been considering is a mechanism to allow a device-dax
>> instance to be given back to the kernel as a distinct numa node
>> managed by the VM. It seems it times to dust off those patches.
>
>What's the use case?  The above patch description seems to indicate an
>intent to recover contents after a power loss.  Without seeing the whole
>series, I'm not sure how that's accomplished in a safe or meaningful
>way.
>
>Huaisheng, could you provide a bit more background?
>

Currently in our mind, an ideal use scenario is that, we put all page caches to
zone_nvm, without any doubt, page cache is an efficient and common cache
implement, but it has a disadvantage that all dirty data within it would has risk
to be missed by power failure or system crash. If we put all page caches to NVDIMMs,
all dirty data will be safe. 

And the most important is that, Page cache is different from dm-cache or B-cache.
Page cache exists at mm. So, it has much more performance than other Write
caches, which locate at storage level.

At present we have realized NVM zone to be supported by two sockets(NUMA)
product based on Lenovo Purley platform, and we can expand NVM flag into
Page Cache allocation interface, so all Page Caches of system had been stored
to NVDIMM safely.

Now we are focusing how to recover data from Page cache after power on. That is,
The dirty pages could be safe and the time cost of cache training would be saved a lot.
Because many pages have already stored to ZONE_NVM before power failture.

Thanks,
Huaisheng Ye

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External]  Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-08  2:59       ` [External] " Huaisheng HS1 Ye
@ 2018-05-08  3:09         ` Matthew Wilcox
  2018-05-09  4:47           ` Huaisheng HS1 Ye
  2018-05-08  3:52         ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2018-05-08  3:09 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: Sasha Levin, Michal Hocko, linux-nvdimm, Tetsuo Handa,
	NingTing Cheng, Linux Kernel Mailing List, pasha.tatashin,
	Linux MM, Dave Hansen, Johannes Weiner, colyli, Mel Gorman,
	Andrew Morton, Vlastimil Babka

On Tue, May 08, 2018 at 02:59:40AM +0000, Huaisheng HS1 Ye wrote:
> Currently in our mind, an ideal use scenario is that, we put all page caches to
> zone_nvm, without any doubt, page cache is an efficient and common cache
> implement, but it has a disadvantage that all dirty data within it would has risk
> to be missed by power failure or system crash. If we put all page caches to NVDIMMs,
> all dirty data will be safe. 

That's a common misconception.  Some dirty data will still be in the
CPU caches.  Are you planning on building servers which have enough
capacitance to allow the CPU to flush all dirty data from LLC to NV-DIMM?

Then there's the problem of reconnecting the page cache (which is
pointed to by ephemeral data structures like inodes and dentries) to
the new inodes.

And then you have to convince customers that what you're doing is safe
enough for them to trust it ;-)

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-08  2:59       ` [External] " Huaisheng HS1 Ye
  2018-05-08  3:09         ` Matthew Wilcox
@ 2018-05-08  3:52         ` Dan Williams
  1 sibling, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-05-08  3:52 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: Sasha Levin, Michal Hocko, linux-nvdimm, Tetsuo Handa,
	NingTing Cheng, Linux Kernel Mailing List, Matthew Wilcox,
	pasha.tatashin, Linux MM, Dave Hansen, Mikulas Patocka,
	Johannes Weiner, Andrew Morton, colyli, Mel Gorman,
	Vlastimil Babka

On Mon, May 7, 2018 at 7:59 PM, Huaisheng HS1 Ye <yehs1@lenovo.com> wrote:
>>
>>Dan Williams <dan.j.williams@intel.com> writes:
>>
>>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox <willy@infradead.org>
>>wrote:
>>>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
>>>>> Traditionally, NVDIMMs are treated by mm(memory management)
>>subsystem as
>>>>> DEVICE zone, which is a virtual zone and both its start and end of pfn
>>>>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel
>>uses
>>>>> corresponding drivers, which locate at \drivers\nvdimm\ and
>>>>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
>>>>> memory hot plug implementation.
>>>>
>>>> You probably want to let linux-nvdimm know about this patch set.
>>>> Adding to the cc.
>>>
>>> Yes, thanks for that!
>>>
>>>> Also, I only received patch 0 and 4.  What happened
>>>> to 1-3,5 and 6?
>>>>
>>>>> With current kernel, many mm’s classical features like the buddy
>>>>> system, swap mechanism and page cache couldn’t be supported to
>>NVDIMM.
>>>>> What we are doing is to expand kernel mm’s capacity to make it to
>>handle
>>>>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and
>>NVDIMM
>>>>> separately, that means mm can only put the critical pages to NVDIMM
>>
>>Please define "critical pages."
>>
>>>>> zone, here we created a new zone type as NVM zone. That is to say for
>>>>> traditional(or normal) pages which would be stored at DRAM scope like
>>>>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
>>>>> them could be recovered from power fail or system crash, we make them
>>>>> to be persistent by storing them to NVM zone.
>>
>>[...]
>>
>>> I think adding yet one more mm-zone is the wrong direction. Instead,
>>> what we have been considering is a mechanism to allow a device-dax
>>> instance to be given back to the kernel as a distinct numa node
>>> managed by the VM. It seems it times to dust off those patches.
>>
>>What's the use case?  The above patch description seems to indicate an
>>intent to recover contents after a power loss.  Without seeing the whole
>>series, I'm not sure how that's accomplished in a safe or meaningful
>>way.
>>
>>Huaisheng, could you provide a bit more background?
>>
>
> Currently in our mind, an ideal use scenario is that, we put all page caches to
> zone_nvm, without any doubt, page cache is an efficient and common cache
> implement, but it has a disadvantage that all dirty data within it would has risk
> to be missed by power failure or system crash. If we put all page caches to NVDIMMs,
> all dirty data will be safe.
>
> And the most important is that, Page cache is different from dm-cache or B-cache.
> Page cache exists at mm. So, it has much more performance than other Write
> caches, which locate at storage level.

Can you be more specific? I think the only fundamental performance
difference between page cache and a block caching driver is that page
cache pages can be DMA'ed directly to lower level storage. However, I
believe that problem is solvable, i.e. we can teach dm-cache to
perform the equivalent of in-kernel direct-I/O when transferring data
between the cache and the backing storage when the cache is comprised
of persistent memory.

>
> At present we have realized NVM zone to be supported by two sockets(NUMA)
> product based on Lenovo Purley platform, and we can expand NVM flag into
> Page Cache allocation interface, so all Page Caches of system had been stored
> to NVDIMM safely.
>
> Now we are focusing how to recover data from Page cache after power on. That is,
> The dirty pages could be safe and the time cost of cache training would be saved a lot.
> Because many pages have already stored to ZONE_NVM before power failture.

I don't see how ZONE_NVM fits into a persistent page cache solution.
All of the mm structures to maintain the page cache are built to be
volatile. Once you build the infrastructure to persist and restore the
state of the page cache it is no longer the traditional page cache.
I.e. it will become something much closer to dm-cache or a filesystem.

One nascent idea from Dave Chinner is to teach xfs how to be a block
server for an upper level filesystem. His aim is sub-volume and
snapshot support, but I wonder if caching could be adapted into that
model?

In any event I think persisting and restoring cache state needs to be
designed before deciding if changes to the mm are needed.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-08  3:09         ` Matthew Wilcox
@ 2018-05-09  4:47           ` Huaisheng HS1 Ye
  2018-05-10 16:27             ` Matthew Wilcox
  0 siblings, 1 reply; 21+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-09  4:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sasha Levin, Michal Hocko, linux-nvdimm, Tetsuo Handa,
	NingTing Cheng, Linux Kernel Mailing List, pasha.tatashin,
	Linux MM, Dave Hansen, Johannes Weiner, colyli, Mel Gorman,
	Andrew Morton, Vlastimil Babka


> 
> On Tue, May 08, 2018 at 02:59:40AM +0000, Huaisheng HS1 Ye wrote:
> > Currently in our mind, an ideal use scenario is that, we put all page caches to
> > zone_nvm, without any doubt, page cache is an efficient and common cache
> > implement, but it has a disadvantage that all dirty data within it would has risk
> > to be missed by power failure or system crash. If we put all page caches to NVDIMMs,
> > all dirty data will be safe.
> 
> That's a common misconception.  Some dirty data will still be in the
> CPU caches.  Are you planning on building servers which have enough
> capacitance to allow the CPU to flush all dirty data from LLC to NV-DIMM?
> 
Sorry for not being clear.
For CPU caches if there is a power failure, NVDIMM has ADR to guarantee an interrupt will be reported to CPU, an interrupt response function should be responsible to flush all dirty data to NVDIMM.
If there is a system crush, perhaps CPU couldn't have chance to execute this response.

It is hard to make sure everything is safe, what we can do is just to save the dirty data which is already stored to Pagecache, but not in CPU cache.
Is this an improvement than current?

> Then there's the problem of reconnecting the page cache (which is
> pointed to by ephemeral data structures like inodes and dentries) to
> the new inodes.
Yes, it is not easy.

> 
> And then you have to convince customers that what you're doing is safe
> enough for them to trust it ;-)
Sure. 😊

Sincerely,
Huaisheng Ye
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External]  Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-09  4:47           ` Huaisheng HS1 Ye
@ 2018-05-10 16:27             ` Matthew Wilcox
  2018-05-15 16:07               ` Huaisheng HS1 Ye
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2018-05-10 16:27 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: Sasha Levin, Michal Hocko, linux-nvdimm, Tetsuo Handa,
	NingTing Cheng, Linux Kernel Mailing List, pasha.tatashin,
	Linux MM, Dave Hansen, Johannes Weiner, colyli, Mel Gorman,
	Andrew Morton, Vlastimil Babka

On Wed, May 09, 2018 at 04:47:54AM +0000, Huaisheng HS1 Ye wrote:
> > On Tue, May 08, 2018 at 02:59:40AM +0000, Huaisheng HS1 Ye wrote:
> > > Currently in our mind, an ideal use scenario is that, we put all page caches to
> > > zone_nvm, without any doubt, page cache is an efficient and common cache
> > > implement, but it has a disadvantage that all dirty data within it would has risk
> > > to be missed by power failure or system crash. If we put all page caches to NVDIMMs,
> > > all dirty data will be safe.
> > 
> > That's a common misconception.  Some dirty data will still be in the
> > CPU caches.  Are you planning on building servers which have enough
> > capacitance to allow the CPU to flush all dirty data from LLC to NV-DIMM?
> > 
> Sorry for not being clear.
> For CPU caches if there is a power failure, NVDIMM has ADR to guarantee an interrupt will be reported to CPU, an interrupt response function should be responsible to flush all dirty data to NVDIMM.
> If there is a system crush, perhaps CPU couldn't have chance to execute this response.
> 
> It is hard to make sure everything is safe, what we can do is just to save the dirty data which is already stored to Pagecache, but not in CPU cache.
> Is this an improvement than current?

No.  In the current situation, the user knows that either the entire
page was written back from the pagecache or none of it was (at least
with a journalling filesystem).  With your proposal, we may have pages
splintered along cacheline boundaries, with a mix of old and new data.
This is completely unacceptable to most customers.

> > Then there's the problem of reconnecting the page cache (which is
> > pointed to by ephemeral data structures like inodes and dentries) to
> > the new inodes.
> Yes, it is not easy.

Right ... and until we have that ability, there's no point in this patch.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-10 16:27             ` Matthew Wilcox
@ 2018-05-15 16:07               ` Huaisheng HS1 Ye
  2018-05-15 16:20                 ` Matthew Wilcox
  0 siblings, 1 reply; 21+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-15 16:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sasha Levin, Michal Hocko, linux-nvdimm, Tetsuo Handa,
	NingTing Cheng, Ocean HY1 He, Linux Kernel Mailing List,
	pasha.tatashin, Linux MM, Dave Hansen, Johannes Weiner, colyli,
	Mel Gorman, Andrew Morton, Vlastimil Babka




> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of Matthew
> Wilcox
> Sent: Friday, May 11, 2018 12:28 AM
> On Wed, May 09, 2018 at 04:47:54AM +0000, Huaisheng HS1 Ye wrote:
> > > On Tue, May 08, 2018 at 02:59:40AM +0000, Huaisheng HS1 Ye wrote:
> > > > Currently in our mind, an ideal use scenario is that, we put all page caches to
> > > > zone_nvm, without any doubt, page cache is an efficient and common cache
> > > > implement, but it has a disadvantage that all dirty data within it would has risk
> > > > to be missed by power failure or system crash. If we put all page caches to NVDIMMs,
> > > > all dirty data will be safe.
> > >
> > > That's a common misconception.  Some dirty data will still be in the
> > > CPU caches.  Are you planning on building servers which have enough
> > > capacitance to allow the CPU to flush all dirty data from LLC to NV-DIMM?
> > >
> > Sorry for not being clear.
> > For CPU caches if there is a power failure, NVDIMM has ADR to guarantee an interrupt
> will be reported to CPU, an interrupt response function should be responsible to flush
> all dirty data to NVDIMM.
> > If there is a system crush, perhaps CPU couldn't have chance to execute this response.
> >
> > It is hard to make sure everything is safe, what we can do is just to save the dirty
> data which is already stored to Pagecache, but not in CPU cache.
> > Is this an improvement than current?
> 
> No.  In the current situation, the user knows that either the entire
> page was written back from the pagecache or none of it was (at least
> with a journalling filesystem).  With your proposal, we may have pages
> splintered along cacheline boundaries, with a mix of old and new data.
> This is completely unacceptable to most customers.

Dear Matthew,

Thanks for your great help, I really didn't consider this case.
I want to make it a little bit clearer to me. So, correct me if anything wrong.

Is that to say this mix of old and new data in one page, which only has chance to happen when CPU failed to flush all dirty data from LLC to NVDIMM?
But if an interrupt can be reported to CPU, and CPU successfully flush all dirty data from cache lines to NVDIMM within interrupt response function, this mix of old and new data can be avoided.

Current X86_64 uses N-way set associative cache, and every cache line has 64 bytes.
For 4096 bytes page, one page shall be splintered to 64 (4096/64) lines. Is it right?


> > > Then there's the problem of reconnecting the page cache (which is
> > > pointed to by ephemeral data structures like inodes and dentries) to
> > > the new inodes.
> > Yes, it is not easy.
> 
> Right ... and until we have that ability, there's no point in this patch.
We are focusing to realize this ability.

Sincerely,
Huaisheng Ye


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External]  Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-15 16:07               ` Huaisheng HS1 Ye
@ 2018-05-15 16:20                 ` Matthew Wilcox
  2018-05-16  2:05                   ` Huaisheng HS1 Ye
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2018-05-15 16:20 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: Sasha Levin, Michal Hocko, linux-nvdimm, Tetsuo Handa,
	NingTing Cheng, Ocean HY1 He, Linux Kernel Mailing List,
	pasha.tatashin, Linux MM, Dave Hansen, Johannes Weiner, colyli,
	Mel Gorman, Andrew Morton, Vlastimil Babka

On Tue, May 15, 2018 at 04:07:28PM +0000, Huaisheng HS1 Ye wrote:
> > From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of Matthew
> > Wilcox
> > No.  In the current situation, the user knows that either the entire
> > page was written back from the pagecache or none of it was (at least
> > with a journalling filesystem).  With your proposal, we may have pages
> > splintered along cacheline boundaries, with a mix of old and new data.
> > This is completely unacceptable to most customers.
> 
> Dear Matthew,
> 
> Thanks for your great help, I really didn't consider this case.
> I want to make it a little bit clearer to me. So, correct me if anything wrong.
> 
> Is that to say this mix of old and new data in one page, which only has chance to happen when CPU failed to flush all dirty data from LLC to NVDIMM?
> But if an interrupt can be reported to CPU, and CPU successfully flush all dirty data from cache lines to NVDIMM within interrupt response function, this mix of old and new data can be avoided.

If you can keep the CPU and the memory (and all the busses between them)
alive for long enough after the power signal hs been tripped, yes.
Talk to your hardware designers about what it will take to achieve this
:-) Be sure to ask about the number of retries which may be necessary
on the CPU interconnect to flush all data to an NV-DIMM attached to a
remote CPU.

> Current X86_64 uses N-way set associative cache, and every cache line has 64 bytes.
> For 4096 bytes page, one page shall be splintered to 64 (4096/64) lines. Is it right?

That's correct.

> > > > Then there's the problem of reconnecting the page cache (which is
> > > > pointed to by ephemeral data structures like inodes and dentries) to
> > > > the new inodes.
> > > Yes, it is not easy.
> > 
> > Right ... and until we have that ability, there's no point in this patch.
> We are focusing to realize this ability.

But is it the right approach?  So far we have (I think) two parallel
activities.  The first is for local storage, using DAX to store files
directly on the pmem.  The second is a physical block cache for network
filesystems (both NAS and SAN).  You seem to be wanting to supplant the
second effort, but I think it's much harder to reconnect the logical cache
(ie the page cache) than it is the physical cache (ie the block cache).

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-15 16:20                 ` Matthew Wilcox
@ 2018-05-16  2:05                   ` Huaisheng HS1 Ye
  2018-05-16  2:48                     ` Dan Williams
  2018-05-16  2:52                     ` Matthew Wilcox
  0 siblings, 2 replies; 21+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-16  2:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sasha Levin, Michal Hocko, linux-nvdimm, Tetsuo Handa,
	NingTing Cheng, Ocean HY1 He, Linux Kernel Mailing List,
	pasha.tatashin, Linux MM, Dave Hansen, Johannes Weiner, colyli,
	Mel Gorman, Andrew Morton, Vlastimil Babka

> From: Matthew Wilcox [mailto:willy@infradead.org]
> Sent: Wednesday, May 16, 2018 12:20 AM> 
> > > > > Then there's the problem of reconnecting the page cache (which is
> > > > > pointed to by ephemeral data structures like inodes and dentries) to
> > > > > the new inodes.
> > > > Yes, it is not easy.
> > >
> > > Right ... and until we have that ability, there's no point in this patch.
> > We are focusing to realize this ability.
> 
> But is it the right approach?  So far we have (I think) two parallel
> activities.  The first is for local storage, using DAX to store files
> directly on the pmem.  The second is a physical block cache for network
> filesystems (both NAS and SAN).  You seem to be wanting to supplant the
> second effort, but I think it's much harder to reconnect the logical cache
> (ie the page cache) than it is the physical cache (ie the block cache).

Dear Matthew,

Thanks for correcting my idea with cache line.
But I have questions about that, assuming NVDIMM works with pmem mode, even we
used it as physical block cache, like dm-cache, there is potential risk with
this cache line issue, because NVDIMMs are bytes-address storage, right?
If system crash happens, that means CPU doesn't have opportunity to flush all dirty
data from cache lines to NVDIMM, during copying data pointed by bio_vec.bv_page to
NVDIMM. 
I know there is btt which is used to guarantee sector atomic with block mode,
but for pmem mode that will likely cause mix of new and old data in one page
of NVDIMM.
Correct me if anything wrong.

Another question, if we used NVDIMMs as physical block cache for network filesystems,
Does industry have existing implementation to bypass Page Cache similarly like DAX way,
that is to say, directly storing data to NVDIMMs from userspace, rather than copying
data from kernel space memory to NVDIMMs.

BRs,
Huaisheng
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-16  2:05                   ` Huaisheng HS1 Ye
@ 2018-05-16  2:48                     ` Dan Williams
  2018-05-16  8:33                       ` Huaisheng HS1 Ye
  2018-05-16  2:52                     ` Matthew Wilcox
  1 sibling, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-05-16  2:48 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: Sasha Levin, Michal Hocko, linux-nvdimm, Tetsuo Handa,
	NingTing Cheng, Ocean HY1 He, Linux Kernel Mailing List,
	Matthew Wilcox, pasha.tatashin, Linux MM, Dave Hansen,
	Johannes Weiner, Andrew Morton, colyli, Mel Gorman,
	Vlastimil Babka

On Tue, May 15, 2018 at 7:05 PM, Huaisheng HS1 Ye <yehs1@lenovo.com> wrote:
>> From: Matthew Wilcox [mailto:willy@infradead.org]
>> Sent: Wednesday, May 16, 2018 12:20 AM>
>> > > > > Then there's the problem of reconnecting the page cache (which is
>> > > > > pointed to by ephemeral data structures like inodes and dentries) to
>> > > > > the new inodes.
>> > > > Yes, it is not easy.
>> > >
>> > > Right ... and until we have that ability, there's no point in this patch.
>> > We are focusing to realize this ability.
>>
>> But is it the right approach?  So far we have (I think) two parallel
>> activities.  The first is for local storage, using DAX to store files
>> directly on the pmem.  The second is a physical block cache for network
>> filesystems (both NAS and SAN).  You seem to be wanting to supplant the
>> second effort, but I think it's much harder to reconnect the logical cache
>> (ie the page cache) than it is the physical cache (ie the block cache).
>
> Dear Matthew,
>
> Thanks for correcting my idea with cache line.
> But I have questions about that, assuming NVDIMM works with pmem mode, even we
> used it as physical block cache, like dm-cache, there is potential risk with
> this cache line issue, because NVDIMMs are bytes-address storage, right?

No, there is no risk if the cache is designed properly. The pmem
driver will not report that the I/O is complete until the entire
payload of the data write has made it to persistent memory. The cache
driver will not report that the write succeeded until the pmem driver
completes the I/O. There is no risk to losing power while the pmem
driver is operating because the cache will recover to it's last
acknowledged stable state, i.e. it will roll back / undo the
incomplete write.

> If system crash happens, that means CPU doesn't have opportunity to flush all dirty
> data from cache lines to NVDIMM, during copying data pointed by bio_vec.bv_page to
> NVDIMM.
> I know there is btt which is used to guarantee sector atomic with block mode,
> but for pmem mode that will likely cause mix of new and old data in one page
> of NVDIMM.
> Correct me if anything wrong.

dm-cache is performing similar metadata management as the btt driver
to ensure safe forward progress of the cache state relative to power
loss or system-crash.

> Another question, if we used NVDIMMs as physical block cache for network filesystems,
> Does industry have existing implementation to bypass Page Cache similarly like DAX way,
> that is to say, directly storing data to NVDIMMs from userspace, rather than copying
> data from kernel space memory to NVDIMMs.

Any caching solution with associated metadata requires coordination
with the kernel, so it is not possible for the kernel to stay
completely out of the way. Especially when we're talking about a cache
in front of the network there is not much room for DAX to offer
improved performance because we need the kernel to takeover on all
write-persist operations to update cache metadata.

So, I'm still struggling to see why dm-cache is not a suitable
solution for this case. It seems suitable if it is updated to allow
direct dma-access to the pmem cache pages from the backing device
storage / networking driver.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External]  Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-16  2:05                   ` Huaisheng HS1 Ye
  2018-05-16  2:48                     ` Dan Williams
@ 2018-05-16  2:52                     ` Matthew Wilcox
  2018-05-16  4:10                       ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2018-05-16  2:52 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: Sasha Levin, Michal Hocko, linux-nvdimm, Tetsuo Handa,
	NingTing Cheng, Ocean HY1 He, Linux Kernel Mailing List,
	pasha.tatashin, Linux MM, Dave Hansen, Johannes Weiner, colyli,
	Mel Gorman, Andrew Morton, Vlastimil Babka

On Wed, May 16, 2018 at 02:05:05AM +0000, Huaisheng HS1 Ye wrote:
> > From: Matthew Wilcox [mailto:willy@infradead.org]
> > Sent: Wednesday, May 16, 2018 12:20 AM> 
> > > > > > Then there's the problem of reconnecting the page cache (which is
> > > > > > pointed to by ephemeral data structures like inodes and dentries) to
> > > > > > the new inodes.
> > > > > Yes, it is not easy.
> > > >
> > > > Right ... and until we have that ability, there's no point in this patch.
> > > We are focusing to realize this ability.
> > 
> > But is it the right approach?  So far we have (I think) two parallel
> > activities.  The first is for local storage, using DAX to store files
> > directly on the pmem.  The second is a physical block cache for network
> > filesystems (both NAS and SAN).  You seem to be wanting to supplant the
> > second effort, but I think it's much harder to reconnect the logical cache
> > (ie the page cache) than it is the physical cache (ie the block cache).
> 
> Dear Matthew,
> 
> Thanks for correcting my idea with cache line.
> But I have questions about that, assuming NVDIMM works with pmem mode, even we
> used it as physical block cache, like dm-cache, there is potential risk with
> this cache line issue, because NVDIMMs are bytes-address storage, right?
> If system crash happens, that means CPU doesn't have opportunity to flush all dirty
> data from cache lines to NVDIMM, during copying data pointed by bio_vec.bv_page to
> NVDIMM. 
> I know there is btt which is used to guarantee sector atomic with block mode,
> but for pmem mode that will likely cause mix of new and old data in one page
> of NVDIMM.
> Correct me if anything wrong.

Right, we do have BTT.  I'm not sure how it's being used with the block
cache ... but the principle is the same; write the new data to a new
page and then update the metadata to point to the new page.

> Another question, if we used NVDIMMs as physical block cache for network filesystems,
> Does industry have existing implementation to bypass Page Cache similarly like DAX way,
> that is to say, directly storing data to NVDIMMs from userspace, rather than copying
> data from kernel space memory to NVDIMMs.

The important part about DAX is that the kernel gets entirely out of the
way and userspace takes care of handling flushing and synchronisation.
I'm not sure how that works with the block cache; for a network
filesystem, the filesystem needs to be in charge of deciding when and
how to write the buffered data back to the storage.

Dan, Vishal, perhaps you could jump in here; I'm not really sure where
this effort has got to.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-16  2:52                     ` Matthew Wilcox
@ 2018-05-16  4:10                       ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-05-16  4:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sasha Levin, Michal Hocko, Huaisheng HS1 Ye, linux-nvdimm,
	Tetsuo Handa, NingTing Cheng, Ocean HY1 He,
	Linux Kernel Mailing List, pasha.tatashin, Linux MM, Dave Hansen,
	Johannes Weiner, Andrew Morton, colyli, Mel Gorman,
	Vlastimil Babka

On Tue, May 15, 2018 at 7:52 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, May 16, 2018 at 02:05:05AM +0000, Huaisheng HS1 Ye wrote:
>> > From: Matthew Wilcox [mailto:willy@infradead.org]
>> > Sent: Wednesday, May 16, 2018 12:20 AM>
>> > > > > > Then there's the problem of reconnecting the page cache (which is
>> > > > > > pointed to by ephemeral data structures like inodes and dentries) to
>> > > > > > the new inodes.
>> > > > > Yes, it is not easy.
>> > > >
>> > > > Right ... and until we have that ability, there's no point in this patch.
>> > > We are focusing to realize this ability.
>> >
>> > But is it the right approach?  So far we have (I think) two parallel
>> > activities.  The first is for local storage, using DAX to store files
>> > directly on the pmem.  The second is a physical block cache for network
>> > filesystems (both NAS and SAN).  You seem to be wanting to supplant the
>> > second effort, but I think it's much harder to reconnect the logical cache
>> > (ie the page cache) than it is the physical cache (ie the block cache).
>>
>> Dear Matthew,
>>
>> Thanks for correcting my idea with cache line.
>> But I have questions about that, assuming NVDIMM works with pmem mode, even we
>> used it as physical block cache, like dm-cache, there is potential risk with
>> this cache line issue, because NVDIMMs are bytes-address storage, right?
>> If system crash happens, that means CPU doesn't have opportunity to flush all dirty
>> data from cache lines to NVDIMM, during copying data pointed by bio_vec.bv_page to
>> NVDIMM.
>> I know there is btt which is used to guarantee sector atomic with block mode,
>> but for pmem mode that will likely cause mix of new and old data in one page
>> of NVDIMM.
>> Correct me if anything wrong.
>
> Right, we do have BTT.  I'm not sure how it's being used with the block
> cache ... but the principle is the same; write the new data to a new
> page and then update the metadata to point to the new page.
>
>> Another question, if we used NVDIMMs as physical block cache for network filesystems,
>> Does industry have existing implementation to bypass Page Cache similarly like DAX way,
>> that is to say, directly storing data to NVDIMMs from userspace, rather than copying
>> data from kernel space memory to NVDIMMs.
>
> The important part about DAX is that the kernel gets entirely out of the
> way and userspace takes care of handling flushing and synchronisation.
> I'm not sure how that works with the block cache; for a network
> filesystem, the filesystem needs to be in charge of deciding when and
> how to write the buffered data back to the storage.
>
> Dan, Vishal, perhaps you could jump in here; I'm not really sure where
> this effort has got to.

Which effort? I think we're saying that there is no such thing as a
DAX capable block cache and it is not clear one make sense.

We can certainly teach existing block caches some optimizations in the
presence of pmem, and perhaps that is sufficient.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-16  2:48                     ` Dan Williams
@ 2018-05-16  8:33                       ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 21+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-16  8:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Sasha Levin, Michal Hocko, linux-nvdimm, Tetsuo Handa,
	NingTing Cheng, Ocean HY1 He, Linux Kernel Mailing List,
	Matthew Wilcox, pasha.tatashin, Linux MM, Dave Hansen,
	Johannes Weiner, Andrew Morton, colyli, Mel Gorman,
	Vlastimil Babka

> From: Dan Williams [mailto:dan.j.williams@intel.com]
> Sent: Wednesday, May 16, 2018 10:49 AM
> On Tue, May 15, 2018 at 7:05 PM, Huaisheng HS1 Ye <yehs1@lenovo.com> wrote:
> >> From: Matthew Wilcox [mailto:willy@infradead.org]
> >> Sent: Wednesday, May 16, 2018 12:20 AM>
> >> > > > > Then there's the problem of reconnecting the page cache (which is
> >> > > > > pointed to by ephemeral data structures like inodes and dentries) to
> >> > > > > the new inodes.
> >> > > > Yes, it is not easy.
> >> > >
> >> > > Right ... and until we have that ability, there's no point in this patch.
> >> > We are focusing to realize this ability.
> >>
> >> But is it the right approach?  So far we have (I think) two parallel
> >> activities.  The first is for local storage, using DAX to store files
> >> directly on the pmem.  The second is a physical block cache for network
> >> filesystems (both NAS and SAN).  You seem to be wanting to supplant the
> >> second effort, but I think it's much harder to reconnect the logical cache
> >> (ie the page cache) than it is the physical cache (ie the block cache).
> >
> > Dear Matthew,
> >
> > Thanks for correcting my idea with cache line.
> > But I have questions about that, assuming NVDIMM works with pmem mode, even we
> > used it as physical block cache, like dm-cache, there is potential risk with
> > this cache line issue, because NVDIMMs are bytes-address storage, right?
> 
> No, there is no risk if the cache is designed properly. The pmem
> driver will not report that the I/O is complete until the entire
> payload of the data write has made it to persistent memory. The cache
> driver will not report that the write succeeded until the pmem driver
> completes the I/O. There is no risk to losing power while the pmem
> driver is operating because the cache will recover to it's last
> acknowledged stable state, i.e. it will roll back / undo the
> incomplete write.
> 
> > If system crash happens, that means CPU doesn't have opportunity to flush all dirty
> > data from cache lines to NVDIMM, during copying data pointed by bio_vec.bv_page to
> > NVDIMM.
> > I know there is btt which is used to guarantee sector atomic with block mode,
> > but for pmem mode that will likely cause mix of new and old data in one page
> > of NVDIMM.
> > Correct me if anything wrong.
> 
> dm-cache is performing similar metadata management as the btt driver
> to ensure safe forward progress of the cache state relative to power
> loss or system-crash.

Dear Dan,

Thanks for your introduction, I've learned a lot from your comments.
I suppose that there should be implementations to protect data and metadata both in NVDIMMs from system-crash or power loss.
Not only data but also metadata itself needs to be correct and integrated, so kernel could have chance to recover data to target device after rebooting, right?

> 
> > Another question, if we used NVDIMMs as physical block cache for network filesystems,
> > Does industry have existing implementation to bypass Page Cache similarly like DAX way,
> > that is to say, directly storing data to NVDIMMs from userspace, rather than copying
> > data from kernel space memory to NVDIMMs.
> 
> Any caching solution with associated metadata requires coordination
> with the kernel, so it is not possible for the kernel to stay
> completely out of the way. Especially when we're talking about a cache
> in front of the network there is not much room for DAX to offer
> improved performance because we need the kernel to takeover on all
> write-persist operations to update cache metadata.

Agree.

> So, I'm still struggling to see why dm-cache is not a suitable
> solution for this case. It seems suitable if it is updated to allow
> direct dma-access to the pmem cache pages from the backing device
> storage / networking driver.

Sincerely,
Huaisheng Ye

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-05-16  8:33 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1525704627-30114-1-git-send-email-yehs1@lenovo.com>
2018-05-07 18:46 ` [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone Matthew Wilcox
2018-05-07 18:57   ` Dan Williams
2018-05-07 19:08     ` Jeff Moyer
2018-05-07 19:17       ` Dan Williams
2018-05-07 19:28         ` Jeff Moyer
2018-05-07 19:29           ` Dan Williams
2018-05-08  2:59       ` [External] " Huaisheng HS1 Ye
2018-05-08  3:09         ` Matthew Wilcox
2018-05-09  4:47           ` Huaisheng HS1 Ye
2018-05-10 16:27             ` Matthew Wilcox
2018-05-15 16:07               ` Huaisheng HS1 Ye
2018-05-15 16:20                 ` Matthew Wilcox
2018-05-16  2:05                   ` Huaisheng HS1 Ye
2018-05-16  2:48                     ` Dan Williams
2018-05-16  8:33                       ` Huaisheng HS1 Ye
2018-05-16  2:52                     ` Matthew Wilcox
2018-05-16  4:10                       ` Dan Williams
2018-05-08  3:52         ` Dan Williams
2018-05-07 19:18     ` Matthew Wilcox
2018-05-07 19:30       ` Dan Williams
2018-05-08  0:54   ` [External] " Huaisheng HS1 Ye

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).