All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 1/1] memremap: devm_memremap_pages has wrong nid
@ 2015-11-11 21:16 Boaz Harrosh
  2015-11-11 21:46 ` Dan Williams
  0 siblings, 1 reply; 11+ messages in thread
From: Boaz Harrosh @ 2015-11-11 21:16 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-fsdevel, linux-nvdimm


The pmem dev as received in devm_memremap_pages() does not have
its dev->numa_node properly initialized yet.

What I see is that all pmem devices will call arch_add_memory
with nid==0 regardless of the real nid the memory is actually
on. Needless to say that after that all the NUMA page and zone
members (at page->flags) come out wrong.

If I do the below code it will all work properly.

RFC because probably we want to fix dev->numa_node before the
call to devm_memremap_pages.

This is on top of v4.3. I will please need the final fix for Stable@

Thanks
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 kernel/memremap.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 9d6b555..9409e23 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -183,9 +183,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
 
 	memcpy(&page_map->res, res, sizeof(*res));
 
-	nid = dev_to_node(dev);
-	if (nid < 0)
-		nid = 0;
+	nid = memory_add_physaddr_to_nid(res->start);
 
 	error = arch_add_memory(nid, res->start, resource_size(res), true);
 	if (error) {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] memremap: devm_memremap_pages has wrong nid
  2015-11-11 21:16 [RFC 1/1] memremap: devm_memremap_pages has wrong nid Boaz Harrosh
@ 2015-11-11 21:46 ` Dan Williams
  2015-11-11 22:05   ` Boaz Harrosh
  2015-11-13 16:00   ` [RFC 1/1] memremap: devm_memremap_pages has wrong nid Toshi Kani
  0 siblings, 2 replies; 11+ messages in thread
From: Dan Williams @ 2015-11-11 21:46 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-fsdevel, linux-nvdimm

On Wed, Nov 11, 2015 at 1:16 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>
> The pmem dev as received in devm_memremap_pages() does not have
> its dev->numa_node properly initialized yet.
>
> What I see is that all pmem devices will call arch_add_memory
> with nid==0 regardless of the real nid the memory is actually
> on. Needless to say that after that all the NUMA page and zone
> members (at page->flags) come out wrong.
>
> If I do the below code it will all work properly.
>
> RFC because probably we want to fix dev->numa_node before the
> call to devm_memremap_pages.

Let's just do that instead.  I.e. in the case of NFIT numa node should
already be set, and in the case of the memmap=ss!nn or e820-type-12 we
can set the numa node like this:

diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
index 8282db2ef99e..e40df8fedf73 100644
--- a/drivers/nvdimm/e820.c
+++ b/drivers/nvdimm/e820.c
@@ -48,7 +48,7 @@ static int e820_pmem_probe(struct platform_device *pdev)
               memset(&ndr_desc, 0, sizeof(ndr_desc));
               ndr_desc.res = p;
               ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
-               ndr_desc.numa_node = NUMA_NO_NODE;
+               ndr_desc.numa_node = memory_add_physaddr_to_nid(p->start);
               set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
               if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
                       goto err;

Thanks for pointing out memory_add_physaddr_to_nid(). I did not know
that existed.

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] memremap: devm_memremap_pages has wrong nid
  2015-11-11 21:46 ` Dan Williams
@ 2015-11-11 22:05   ` Boaz Harrosh
  2015-11-12 13:10     ` [PATCH] nvdimm: proper NID in e820_pmem_probe Boaz Harrosh
  2015-11-13 16:00   ` [RFC 1/1] memremap: devm_memremap_pages has wrong nid Toshi Kani
  1 sibling, 1 reply; 11+ messages in thread
From: Boaz Harrosh @ 2015-11-11 22:05 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-fsdevel, linux-nvdimm

On 11/11/2015 11:46 PM, Dan Williams wrote:
> On Wed, Nov 11, 2015 at 1:16 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>
>> The pmem dev as received in devm_memremap_pages() does not have
>> its dev->numa_node properly initialized yet.
>>
>> What I see is that all pmem devices will call arch_add_memory
>> with nid==0 regardless of the real nid the memory is actually
>> on. Needless to say that after that all the NUMA page and zone
>> members (at page->flags) come out wrong.
>>
>> If I do the below code it will all work properly.
>>
>> RFC because probably we want to fix dev->numa_node before the
>> call to devm_memremap_pages.
> 
> Let's just do that instead.  I.e. in the case of NFIT numa node should
> already be set, and in the case of the memmap=ss!nn or e820-type-12 we
> can set the numa node like this:
> 
> diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
> index 8282db2ef99e..e40df8fedf73 100644
> --- a/drivers/nvdimm/e820.c
> +++ b/drivers/nvdimm/e820.c
> @@ -48,7 +48,7 @@ static int e820_pmem_probe(struct platform_device *pdev)
>                memset(&ndr_desc, 0, sizeof(ndr_desc));
>                ndr_desc.res = p;
>                ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
> -               ndr_desc.numa_node = NUMA_NO_NODE;
> +               ndr_desc.numa_node = memory_add_physaddr_to_nid(p->start);
>                set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
>                if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
>                        goto err;
> 
> Thanks for pointing out memory_add_physaddr_to_nid(). I did not know
> that existed.
> 

Thanks.

Its kind of late here now, please let me test this tomorrow and come back
to you, but it sounds good, thanks

Boaz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] nvdimm: proper NID in e820_pmem_probe
  2015-11-11 22:05   ` Boaz Harrosh
@ 2015-11-12 13:10     ` Boaz Harrosh
  2015-11-12 13:58       ` Boaz Harrosh
  0 siblings, 1 reply; 11+ messages in thread
From: Boaz Harrosh @ 2015-11-12 13:10 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-fsdevel, linux-nvdimm, Stable Tree

From: Dan Williams <dan.j.williams@intel.com>

[Boaz]
What I see is that in the call to arch_add_memory() nid==0 regardless of the
real NID the memory is actually on.

[Dan]
In the case of NFIT numa node should already be set, and in the
case of the memmap=ss!nn or e820-type-12 we can set the numa node
like this:

[Needed for v4.3]
CC: Stable Tree <stable@vger.kernel.org>
Tested-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/nvdimm/e820.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
index 8282db2..e40df8f 100644
--- a/drivers/nvdimm/e820.c
+++ b/drivers/nvdimm/e820.c
@@ -48,7 +48,7 @@ static int e820_pmem_probe(struct platform_device *pdev)
 		memset(&ndr_desc, 0, sizeof(ndr_desc));
 		ndr_desc.res = p;
 		ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
-		ndr_desc.numa_node = NUMA_NO_NODE;
+		ndr_desc.numa_node = memory_add_physaddr_to_nid(p->start);
 		set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
 		if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
 			goto err;
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] nvdimm: proper NID in e820_pmem_probe
  2015-11-12 13:10     ` [PATCH] nvdimm: proper NID in e820_pmem_probe Boaz Harrosh
@ 2015-11-12 13:58       ` Boaz Harrosh
  2015-11-12 17:13         ` Dan Williams
  0 siblings, 1 reply; 11+ messages in thread
From: Boaz Harrosh @ 2015-11-12 13:58 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-fsdevel, linux-nvdimm, Stable Tree

On 11/12/2015 03:10 PM, Boaz Harrosh wrote:
> From: Dan Williams <dan.j.williams@intel.com>
> 
> [Boaz]
> What I see is that in the call to arch_add_memory() nid==0 regardless of the
> real NID the memory is actually on.
> 
> [Dan]
> In the case of NFIT numa node should already be set, and in the
> case of the memmap=ss!nn or e820-type-12 we can set the numa node
> like this:
> 
> [Needed for v4.3]
> CC: Stable Tree <stable@vger.kernel.org>
> Tested-by: Boaz Harrosh <boaz@plexistor.com>

Dan thanks, of course it works perfectly well. I'm not sure if you also need my:
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>

So I'm happy to say that with this small fix.

And a big struggle to enable CONFIG_EXPERT so to disable ZONE_DMA and enable
ZONE_DEVICE. Would you support reverting the completely dead code ZONE_DMA
for x86_64 "on" by default so to allow an easier ZONE_DEVICE to be turned on?
(We currently have a script sent to clients to manipulate their .config before
 compiling their 4.3 based Kernel)

So as I said I'm happy to announce that with the 4.3 Kernel (+ fix) I'm able
to run my all system, same as with my old system, but without any Kernel patching.
(Almost, just one optimization for write page-faults).

Including:
- Direct IO of pmem-pages to slower SSD / harddisk / iscai block devices
- RDMA from pmem-pages directly.
- pmem direct RDMA target machine.
- Cluster wide unified pmem access
- VM access to pmem

We still carry a few of our own persistent assembly calls, but just because
the Kernel's ones are a bit of a mixed mess.

The only complain I have with 4.3 is the wrong and scary message in my logs
on my perfectly healthy and thriving ADR system with NvDIMMs that says:

	"d_pmem namespace0.0: unable to guarantee persistence of writes"

As I told you in our talk, (Ever so gently and with full respect), you guys
made a bit of mess with the none-existent PCOMMIT instruction and NvDIMM persistency.
With a complete ADR system, even CPUs without PCOMMIT instruction are persistence
safe because of system support in flushing of MEM/IO buffers on a power loss.

So you see the Kernel can not really say if the system is actually
"guarantee persistence". I'd send a fix for this all mess, once I have a bit
of time. (The mess I mean the all PCOMMIT thing that not a single CPU in existence
has support off, and actually it was put on hold for any real hardware. And some
missing corner cases of wrongness with persistency, as we found in testing)

So Cheers Sir Dan. 4.3 rocks and we are able to work without any Kernel patches.
4.4 stuff I have not touched yet at all. Will do ASAP and report once I tested it.

[And we are still waiting for any NFIT system which is currently a complete
 vaporware. Intel said it will not upgrade any of our ADR systems BIOS/EUFI
 to NFIT. And there is no date yet for any new NFIT systems.
 You need to please send me instructions on how to compile my own QEMU with
 NFIT support, because I do not have any means currently to test my code with
 NFIT]

Thanks
Boaz

> ---
>  drivers/nvdimm/e820.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
> index 8282db2..e40df8f 100644
> --- a/drivers/nvdimm/e820.c
> +++ b/drivers/nvdimm/e820.c
> @@ -48,7 +48,7 @@ static int e820_pmem_probe(struct platform_device *pdev)
>  		memset(&ndr_desc, 0, sizeof(ndr_desc));
>  		ndr_desc.res = p;
>  		ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
> -		ndr_desc.numa_node = NUMA_NO_NODE;
> +		ndr_desc.numa_node = memory_add_physaddr_to_nid(p->start);
>  		set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
>  		if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
>  			goto err;
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] nvdimm: proper NID in e820_pmem_probe
  2015-11-12 13:58       ` Boaz Harrosh
@ 2015-11-12 17:13         ` Dan Williams
  2015-11-15 10:08           ` Boaz Harrosh
  0 siblings, 1 reply; 11+ messages in thread
From: Dan Williams @ 2015-11-12 17:13 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-fsdevel, linux-nvdimm, Stable Tree

On Thu, Nov 12, 2015 at 5:58 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 11/12/2015 03:10 PM, Boaz Harrosh wrote:
>> From: Dan Williams <dan.j.williams@intel.com>
>>
>> [Boaz]
>> What I see is that in the call to arch_add_memory() nid==0 regardless of the
>> real NID the memory is actually on.
>>
>> [Dan]
>> In the case of NFIT numa node should already be set, and in the
>> case of the memmap=ss!nn or e820-type-12 we can set the numa node
>> like this:
>>
>> [Needed for v4.3]
>> CC: Stable Tree <stable@vger.kernel.org>
>> Tested-by: Boaz Harrosh <boaz@plexistor.com>
>
> Dan thanks, of course it works perfectly well. I'm not sure if you also need my:
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
>
> So I'm happy to say that with this small fix.

Thanks for the test!  There's a small compile fix I need to add that
0day discovered, but I'll get this into a pull request before the end
of the week.

> And a big struggle to enable CONFIG_EXPERT so to disable ZONE_DMA and enable
> ZONE_DEVICE. Would you support reverting the completely dead code ZONE_DMA
> for x86_64 "on" by default so to allow an easier ZONE_DEVICE to be turned on?
> (We currently have a script sent to clients to manipulate their .config before
>  compiling their 4.3 based Kernel)

Changing that default would be a question to the x86 maintainers.  I
agree that ZONE_DMA should be opt-in rather than opt-out on modern
systems.

> So as I said I'm happy to announce that with the 4.3 Kernel (+ fix) I'm able
> to run my all system, same as with my old system, but without any Kernel patching.
> (Almost, just one optimization for write page-faults).
>
> Including:
> - Direct IO of pmem-pages to slower SSD / harddisk / iscai block devices
> - RDMA from pmem-pages directly.
> - pmem direct RDMA target machine.

Really?  How do you achieve these 3 features without get_user_pages()
for DAX mappings?  Do you have a custom driver in the kernel that is
just going pfn_to_page()?

> - Cluster wide unified pmem access
> - VM access to pmem
>
> We still carry a few of our own persistent assembly calls, but just because
> the Kernel's ones are a bit of a mixed mess.

Would be interested to see them.  We're currently looking at
performance enhancements in this area.

>
> The only complain I have with 4.3 is the wrong and scary message in my logs
> on my perfectly healthy and thriving ADR system with NvDIMMs that says:
>
>         "d_pmem namespace0.0: unable to guarantee persistence of writes"
>
> As I told you in our talk, (Ever so gently and with full respect), you guys
> made a bit of mess with the none-existent PCOMMIT instruction and NvDIMM persistency.
> With a complete ADR system, even CPUs without PCOMMIT instruction are persistence
> safe because of system support in flushing of MEM/IO buffers on a power loss.
>
> So you see the Kernel can not really say if the system is actually
> "guarantee persistence". I'd send a fix for this all mess, once I have a bit
> of time. (The mess I mean the all PCOMMIT thing that not a single CPU in existence
> has support off, and actually it was put on hold for any real hardware. And some
> missing corner cases of wrongness with persistency, as we found in testing)

I agree that the ADR situation is a bit of a mess since ACPI provides
no mechanism to tell you that it is available.  I wouldn't be opposed
to whitelisting certain platforms or use a sideband mechanism to check
for ADR.  It's just not clear to me how to reliably determine if ADR
is available and functional.

How about a kernel parameter like "libnvdimm.adr=1" to tell the kernel
to ignore the absence of pcommit?

> 4.4 stuff I have not touched yet at all. Will do ASAP and report once I tested it.

Appreciate it.

> [And we are still waiting for any NFIT system which is currently a complete
>  vaporware. Intel said it will not upgrade any of our ADR systems BIOS/EUFI
>  to NFIT. And there is no date yet for any new NFIT systems.
>  You need to please send me instructions on how to compile my own QEMU with
>  NFIT support, because I do not have any means currently to test my code with
>  NFIT]

The QEMU NFIT enabling is progressing, but not yet merged it seems.

http://marc.info/?l=qemu-devel&m=144645659908290&w=2

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] memremap: devm_memremap_pages has wrong nid
  2015-11-11 21:46 ` Dan Williams
  2015-11-11 22:05   ` Boaz Harrosh
@ 2015-11-13 16:00   ` Toshi Kani
  2015-11-15  9:17     ` Boaz Harrosh
  1 sibling, 1 reply; 11+ messages in thread
From: Toshi Kani @ 2015-11-13 16:00 UTC (permalink / raw)
  To: Dan Williams, Boaz Harrosh; +Cc: linux-fsdevel, linux-nvdimm

On Wed, 2015-11-11 at 13:46 -0800, Dan Williams wrote:
> On Wed, Nov 11, 2015 at 1:16 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> > 
> > The pmem dev as received in devm_memremap_pages() does not have
> > its dev->numa_node properly initialized yet.
> > 
> > What I see is that all pmem devices will call arch_add_memory
> > with nid==0 regardless of the real nid the memory is actually
> > on. Needless to say that after that all the NUMA page and zone
> > members (at page->flags) come out wrong.
> > 
> > If I do the below code it will all work properly.
> > 
> > RFC because probably we want to fix dev->numa_node before the
> > call to devm_memremap_pages.
> 
> Let's just do that instead.  I.e. in the case of NFIT numa node should
> already be set, and in the case of the memmap=ss!nn or e820-type-12 we
> can set the numa node like this:

Agreed.  memory_add_physaddr_to_nid() uses the SRAT info, which does not work
with the NFIT case. 

Thanks,
-Toshi 


> diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
> index 8282db2ef99e..e40df8fedf73 100644
> --- a/drivers/nvdimm/e820.c
> +++ b/drivers/nvdimm/e820.c
> @@ -48,7 +48,7 @@ static int e820_pmem_probe(struct platform_device *pdev)
>                memset(&ndr_desc, 0, sizeof(ndr_desc));
>                ndr_desc.res = p;
>                ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
> -               ndr_desc.numa_node = NUMA_NO_NODE;
> +               ndr_desc.numa_node = memory_add_physaddr_to_nid(p->start);
>                set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
>                if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
>                        goto err;
> 
> Thanks for pointing out memory_add_physaddr_to_nid(). I did not know
> that existed.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] memremap: devm_memremap_pages has wrong nid
  2015-11-13 16:00   ` [RFC 1/1] memremap: devm_memremap_pages has wrong nid Toshi Kani
@ 2015-11-15  9:17     ` Boaz Harrosh
  2015-11-16 17:19       ` Toshi Kani
  0 siblings, 1 reply; 11+ messages in thread
From: Boaz Harrosh @ 2015-11-15  9:17 UTC (permalink / raw)
  To: Toshi Kani, Dan Williams; +Cc: linux-fsdevel, linux-nvdimm

On 11/13/2015 06:00 PM, Toshi Kani wrote:
<>
> 
> Agreed.  memory_add_physaddr_to_nid() uses the SRAT info, which does not work
> with the NFIT case. 
> 

Thanks Toshi, I did not know that NFIT would not work. (As I already ranted NFIT
is hard to find)

Would it be hard to fix? I mean the way it is today NvDIMM is always put at the
*end* of the NUMA address range, so all the NUMA boundaries (start) are there, all
we need is to make sure max_pfn is advanced behind the last NvDIMM range.

(Ok and we might have a slight problem with an NFIT only Node, where there
 is no volatile memory at all)

I think it is worth fixing there are surprising places this might be used.
I know that it works with type-12 and emulated pmem.

(Once I set up my NFIT QEMU I'll see what I can find)

> Thanks,
> -Toshi 
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] nvdimm: proper NID in e820_pmem_probe
  2015-11-12 17:13         ` Dan Williams
@ 2015-11-15 10:08           ` Boaz Harrosh
  0 siblings, 0 replies; 11+ messages in thread
From: Boaz Harrosh @ 2015-11-15 10:08 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-fsdevel, linux-nvdimm, Stable Tree

On 11/12/2015 07:13 PM, Dan Williams wrote:
<>
> 
> Thanks for the test!  There's a small compile fix I need to add that
> 0day discovered, but I'll get this into a pull request before the end
> of the week.
> 

Thanks

<>
> 
> Really?  How do you achieve these 3 features without get_user_pages()
> for DAX mappings?  Do you have a custom driver in the kernel that is
> just going pfn_to_page()?
> 

Yes both Target and Initiator are Kernel based. The target is currently
a driver that receives a /dev/pmemX parameter to operate on. (The initiator
is inside this pmem cluster FS, half Kernel half user-space)

<>
>>
>> We still carry a few of our own persistent assembly calls, but just because
>> the Kernel's ones are a bit of a mixed mess.
> 
> Would be interested to see them.  We're currently looking at
> performance enhancements in this area.
> 

I have an old patch to clean up pmem.h and stuff but is old and will not patch.
Is there anything beyond what Linus pulled for 4.4-rc1 ? I'll try to base it
on your tree. But no promises this is low priority for me.

performance wise we are using the regular __copy_from_user_inatomic_nocache()
as you do.
Only difference is the: "No need for memory-barrier with clflush" and our
clflush loop with some fixes.
(And our own clean implementation of copy_from_iter_persistent inside
 iov_iter.c for also KVEC and BVEC NT support)

Actually we are counting on your future patches to not FAULT on memory
errors and return with an error code.

<>
> 
> I agree that the ADR situation is a bit of a mess since ACPI provides
> no mechanism to tell you that it is available.  I wouldn't be opposed
> to whitelisting certain platforms or use a sideband mechanism to check
> for ADR.  It's just not clear to me how to reliably determine if ADR
> is available and functional.
> 

I think that for now the presence of ACPI both type-12 and NFIT is a clear
indication of an ADR system. Do you know of any system in existence that this
is not true? (Why talk about theoretical none existent systems?).

Any system builder will not provide ACPI NvDIMM tables if it would not work,
otherwise what is the point to provide it? so just trust the system builder
I think.

I think the all warning is mute. The chain of events that need to happen so
a pmem driver will actually appear already means the system is capable of
NvDIMM. So please let us just remove this warning. Also with emulated memmap=ss!aa
the warning means nothing.

[BTW Also with (none-existent) pcommit you do not actually know that the system
 actually works, the CPU might have it, but not the firmware]

> How about a kernel parameter like "libnvdimm.adr=1" to tell the kernel
> to ignore the absence of pcommit?
> 

Again I think the presence of NFIT or type-12 already means this.
For x86 this is ADR. For other ARCHs they will need to provide there
own Software or firmware support down the ARCH space. I think

The only real point is ARCHs that do not define any pmem support, but
for those we will BUG() so fast that the warning is pointless there too.

<>
> 
> The QEMU NFIT enabling is progressing, but not yet merged it seems.
> 
> http://marc.info/?l=qemu-devel&m=144645659908290&w=2
> 

I will make an attempt to compile this, and produce a document.

Thanks Dan, good day
Boaz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] memremap: devm_memremap_pages has wrong nid
  2015-11-15  9:17     ` Boaz Harrosh
@ 2015-11-16 17:19       ` Toshi Kani
  2015-11-17 13:15         ` Boaz Harrosh
  0 siblings, 1 reply; 11+ messages in thread
From: Toshi Kani @ 2015-11-16 17:19 UTC (permalink / raw)
  To: Boaz Harrosh, Dan Williams; +Cc: linux-fsdevel, linux-nvdimm

On Sun, 2015-11-15 at 11:17 +0200, Boaz Harrosh wrote:
> On 11/13/2015 06:00 PM, Toshi Kani wrote:
> <>
> > 
> > Agreed.  memory_add_physaddr_to_nid() uses the SRAT info, which does not 
> > work with the NFIT case.
> > 
> 
> Thanks Toshi, I did not know that NFIT would not work. (As I already ranted 
> NFIT is hard to find)
> 
> Would it be hard to fix? I mean the way it is today NvDIMM is always put at 
> the *end* of the NUMA address range, so all the NUMA boundaries (start) are 
> there, all we need is to make sure max_pfn is advanced behind the last NvDIMM
> range.

I understand that both memory_add_physaddr_to_nid() and max_pfn cover NVDIMM
ranges on platforms with legacy E820_PRAM (12), which differs from the NFIT
case.  I agree that such difference is not desirable.

NFIT FW I have does not put NVDIMM ranges into SRAT, but ACPI spec is not very
clear about it [1].  We currently consider NVDIMM as device memory, not regular
memory.  So, this depends on how we define the "memory" info.

As for max_pfn, yes, it may make sense to cover NVDIMM when ZONE_DEVICE is used.

> (Ok and we might have a slight problem with an NFIT only Node, where there
>  is no volatile memory at all)

The NFIT driver sets it to the closest on-line node (i.e. where regular memory
resides) in this case.

> I think it is worth fixing there are surprising places this might be used.
> I know that it works with type-12 and emulated pmem.
> 
> (Once I set up my NFIT QEMU I'll see what I can find)

Thanks,
-Toshi

[1] https://lkml.org/lkml/2015/9/1/484


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/1] memremap: devm_memremap_pages has wrong nid
  2015-11-16 17:19       ` Toshi Kani
@ 2015-11-17 13:15         ` Boaz Harrosh
  0 siblings, 0 replies; 11+ messages in thread
From: Boaz Harrosh @ 2015-11-17 13:15 UTC (permalink / raw)
  To: Toshi Kani, Dan Williams; +Cc: linux-fsdevel, linux-nvdimm

On 11/16/2015 07:19 PM, Toshi Kani wrote:
<>
>> (Ok and we might have a slight problem with an NFIT only Node, where there
>>  is no volatile memory at all)
> 
> The NFIT driver sets it to the closest on-line node (i.e. where regular memory
> resides) in this case.
> 

This is a real problem then. I know it used to be the same with type-12 pmem
and we had real trouble with it on systems performance. Eventually one of your
patches fixed it.

We should fix this for NFIT, yes specially with ZONE_DEVICE and page support.
It will hurt performance a lot.

Eventually I'll get to it, again when NFIT becomes more relevant to us.

<>
> 
> Thanks,
> -Toshi
> 

Thanks Toshi
Boaz



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-11-17 13:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-11 21:16 [RFC 1/1] memremap: devm_memremap_pages has wrong nid Boaz Harrosh
2015-11-11 21:46 ` Dan Williams
2015-11-11 22:05   ` Boaz Harrosh
2015-11-12 13:10     ` [PATCH] nvdimm: proper NID in e820_pmem_probe Boaz Harrosh
2015-11-12 13:58       ` Boaz Harrosh
2015-11-12 17:13         ` Dan Williams
2015-11-15 10:08           ` Boaz Harrosh
2015-11-13 16:00   ` [RFC 1/1] memremap: devm_memremap_pages has wrong nid Toshi Kani
2015-11-15  9:17     ` Boaz Harrosh
2015-11-16 17:19       ` Toshi Kani
2015-11-17 13:15         ` Boaz Harrosh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.