nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* Re: Detecting NUMA per pmem
       [not found] ` <20171020162227.GA8576@linux.intel.com>
@ 2017-10-22 11:33   ` Oren Berman
  2017-10-22 13:52     ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Oren Berman @ 2017-10-22 11:33 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: linux-nvdimm

Hi Ross

Thanks for the speedy reply. I am also adding the public list to this
thread as you suggested.

We have tried to dump the SPA table and this is what we get:

/*
 * Intel ACPI Component Architecture
 * AML/ASL+ Disassembler version 20160108-64
 * Copyright (c) 2000 - 2016 Intel Corporation
 *
 * Disassembly of NFIT, Sun Oct 22 10:46:19 2017
 *
 * ACPI Data Table [NFIT]
 *
 * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue
 */

[000h 0000   4]                    Signature : "NFIT"    [NVDIMM Firmware
Interface Table]
[004h 0004   4]                 Table Length : 00000028
[008h 0008   1]                     Revision : 01
[009h 0009   1]                     Checksum : B2
[00Ah 0010   6]                       Oem ID : "SUPERM"
[010h 0016   8]                 Oem Table ID : "SMCI--MB"
[018h 0024   4]                 Oem Revision : 00000001
[01Ch 0028   4]              Asl Compiler ID : " "
[020h 0032   4]        Asl Compiler Revision : 00000001

[024h 0036   4]                     Reserved : 00000000

Raw Table Data: Length 40 (0x28)

  0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D  // NFIT(.....SUPERM
  0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00  // SMCI--MB........
  0020: 01 00 00 00 00 00 00 00

As you can see the memory region info is missing.

This specific check was done on a supermicro server.
We also performed a bios update but the results were the same.

As said before ,the pmem devices are detected correctly and we verified
that they correspond to different numa nodes using the PCM utility.However,
 linux still reports both pmem devices to be on the same numa - Numa 0.

If this information is missing, why pmem devices and address ranges are
still detected correctly?
Is there another table that we need to check?

I also ran dmidecode and the NVDIMMs are being listed (we tested with
netlist NVDIMMs). I can also see the bank locator showing P0 and P1 which I
think indicates the numa.  Here is  an example:

Handle 0x002D, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002A
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P1-DIMMA3
Bank Locator: P0_Node0_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66F50006
Asset Tag: P1-DIMMA3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown


Handle 0x003B, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P2-DIMME3
Bank Locator: P1_Node1_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66B50010
Asset Tag: P2-DIMME3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Did you encounter such a a case? We would appreciate any insight you might
have.

BR
Oren Berman


On 20 October 2017 at 19:22, Ross Zwisler <ross.zwisler@linux.intel.com>
wrote:

> On Thu, Oct 19, 2017 at 06:12:24PM +0300, Oren Berman wrote:
> >    Hi Ross
> >    My name is Oren Berman and I am a senior developer at lightbitslabs.
> >    We are working with NDIMMs but we encountered a problem that the
> kernel
> >     does not seem to detect the numa id per PMEM device.
> >    It always reports numa 0 although we have NVDIMM devices on both
> nodes.
> >    We checked that it always returns 0 from sysfs and also from
> retrieving
> >    the device of pmem in the kernel and calling dev_to_node.
> >    The result is always 0 for both pmem0 and pmem1.
> >    In order to make sure that indeed both numa sockets are used we ran
> >    intel's pcm utlity. We verified that writing to pmem 0 increases
> socket 0
> >    utilization and  writing to pmem1 increases socket 1 utilization so
> the hw
> >    works properly.
> >    Only the detection seems to be invalid.
> >    Did you encounter such a problem?
> >    We are using kernel version 4.9 - are you aware of any fix for this
> issue
> >    or workaround that we can use.
> >    Are we missing something?
> >    Thanks for any help you can give us.
> >    BR
> >    Oren Berman
>
> Hi Oren,
>
> My first guess is that your platform isn't properly filling out the
> "proximity
> domain" field in the NFIT SPA table.
>
> See section 5.2.25.2 in ACPI 6.2:
> http://uefi.org/sites/default/files/resources/ACPI_6_2.pdf
>
> Here's how to check that:
>
>   # cd /tmp
>   # cp /sys/firmware/acpi/tables/NFIT .
>   # iasl NFIT
>
>   Intel ACPI Component Architecture
>   ASL+ Optimizing Compiler version 20160831-64
>   Copyright (c) 2000 - 2016 Intel Corporation
>
>   Binary file appears to be a valid ACPI table, disassembling
>   Input file NFIT, Length 0xE0 (224) bytes
>   ACPI: NFIT 0x0000000000000000 0000E0 (v01 BOCHS  BXPCNFIT 00000001 BXPC
>   00000001)
>   Acpi Data Table [NFIT] decoded
>   Formatted output:  NFIT.dsl - 5191 bytes
>
> This will give you an NFIT.dsl file which you can look at.  Here is what my
> SPA table looks like for an emulated QEMU NVDIMM:
>
>   [028h 0040   2]                Subtable Type : 0000 [System Physical
> Address Range]
>   [02Ah 0042   2]                       Length : 0038
>
>   [02Ch 0044   2]                  Range Index : 0002
>   [02Eh 0046   2]        Flags (decoded below) : 0003
>                      Add/Online Operation Only : 1
>                         Proximity Domain Valid : 1
>   [030h 0048   4]                     Reserved : 00000000
>   [034h 0052   4]             Proximity Domain : 00000000
>   [038h 0056  16]           Address Range GUID :
> 66F0D379-B4F3-4074-AC43-0D3318B78CDB
>   [048h 0072   8]           Address Range Base : 0000000240000000
>   [050h 0080   8]         Address Range Length : 0000000440000000
>   [058h 0088   8]         Memory Map Attribute : 0000000000008008
>
> So, the "Proximity Domain" field is 0, and this lets the system know which
> NUMA node to associate with this memory region.
>
> BTW, in the future it's best to CC our public list,
> linux-nvdimm@lists.01.org,
> as a) someone else might have the same question and b) someone else might
> know
> the answer.
>
> Thanks,
> - Ross
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2017-10-22 11:33   ` Detecting NUMA per pmem Oren Berman
@ 2017-10-22 13:52     ` Dan Williams
  2017-12-27 18:53       ` Oren Berman
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2017-10-22 13:52 UTC (permalink / raw)
  To: Oren Berman; +Cc: linux-nvdimm

On Sun, Oct 22, 2017 at 4:33 AM, Oren Berman <oren@lightbitslabs.com> wrote:
> Hi Ross
>
> Thanks for the speedy reply. I am also adding the public list to this
> thread as you suggested.
>
> We have tried to dump the SPA table and this is what we get:
>
> /*
>  * Intel ACPI Component Architecture
>  * AML/ASL+ Disassembler version 20160108-64
>  * Copyright (c) 2000 - 2016 Intel Corporation
>  *
>  * Disassembly of NFIT, Sun Oct 22 10:46:19 2017
>  *
>  * ACPI Data Table [NFIT]
>  *
>  * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue
>  */
>
> [000h 0000   4]                    Signature : "NFIT"    [NVDIMM Firmware
> Interface Table]
> [004h 0004   4]                 Table Length : 00000028
> [008h 0008   1]                     Revision : 01
> [009h 0009   1]                     Checksum : B2
> [00Ah 0010   6]                       Oem ID : "SUPERM"
> [010h 0016   8]                 Oem Table ID : "SMCI--MB"
> [018h 0024   4]                 Oem Revision : 00000001
> [01Ch 0028   4]              Asl Compiler ID : " "
> [020h 0032   4]        Asl Compiler Revision : 00000001
>
> [024h 0036   4]                     Reserved : 00000000
>
> Raw Table Data: Length 40 (0x28)
>
>   0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D  // NFIT(.....SUPERM
>   0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00  // SMCI--MB........
>   0020: 01 00 00 00 00 00 00 00
>
> As you can see the memory region info is missing.
>
> This specific check was done on a supermicro server.
> We also performed a bios update but the results were the same.
>
> As said before ,the pmem devices are detected correctly and we verified
> that they correspond to different numa nodes using the PCM utility.However,
>  linux still reports both pmem devices to be on the same numa - Numa 0.
>
> If this information is missing, why pmem devices and address ranges are
> still detected correctly?

I suspect your BIOS might be using E820-type-12 to describe the pmem
ranges which is not compliant with the ACPI specification and would
need a BIOS change.

> Is there another table that we need to check?

You can dump /proc/iomem.  If it shows "Persistent Memory (legacy)"
then the BIOS is using the E820-type-12 description scheme which does
not include NUMA information.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2017-10-22 13:52     ` Dan Williams
@ 2017-12-27 18:53       ` Oren Berman
  2017-12-28  9:14         ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Oren Berman @ 2017-12-27 18:53 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm

Hi

I have a question regrading NVDIMM detection.

When we are working with NVDIMM of type 12 it is detected by the linux in
legacy mode and we can
accesses it as pmem or dax device. we have an e820 bios.

When we are using a type 7 NVDIMM it is reported by the linux as
persistence type 7 memory but there is no pmem or dax device created.
Linux Kernel identifies this memory in the e820 table but it does not
trigger nvdimm probe for it.
Do you know what could be the cause? Is their a workaround for that?
Can it still be treated as legacy mode so we can access it through pmem/dax
device?

BR
Oren Berman

On 22 October 2017 at 16:52, Dan Williams <dan.j.williams@intel.com> wrote:

> On Sun, Oct 22, 2017 at 4:33 AM, Oren Berman <oren@lightbitslabs.com>
> wrote:
> > Hi Ross
> >
> > Thanks for the speedy reply. I am also adding the public list to this
> > thread as you suggested.
> >
> > We have tried to dump the SPA table and this is what we get:
> >
> > /*
> >  * Intel ACPI Component Architecture
> >  * AML/ASL+ Disassembler version 20160108-64
> >  * Copyright (c) 2000 - 2016 Intel Corporation
> >  *
> >  * Disassembly of NFIT, Sun Oct 22 10:46:19 2017
> >  *
> >  * ACPI Data Table [NFIT]
> >  *
> >  * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue
> >  */
> >
> > [000h 0000   4]                    Signature : "NFIT"    [NVDIMM Firmware
> > Interface Table]
> > [004h 0004   4]                 Table Length : 00000028
> > [008h 0008   1]                     Revision : 01
> > [009h 0009   1]                     Checksum : B2
> > [00Ah 0010   6]                       Oem ID : "SUPERM"
> > [010h 0016   8]                 Oem Table ID : "SMCI--MB"
> > [018h 0024   4]                 Oem Revision : 00000001
> > [01Ch 0028   4]              Asl Compiler ID : " "
> > [020h 0032   4]        Asl Compiler Revision : 00000001
> >
> > [024h 0036   4]                     Reserved : 00000000
> >
> > Raw Table Data: Length 40 (0x28)
> >
> >   0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D  //
> NFIT(.....SUPERM
> >   0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00  //
> SMCI--MB........
> >   0020: 01 00 00 00 00 00 00 00
> >
> > As you can see the memory region info is missing.
> >
> > This specific check was done on a supermicro server.
> > We also performed a bios update but the results were the same.
> >
> > As said before ,the pmem devices are detected correctly and we verified
> > that they correspond to different numa nodes using the PCM
> utility.However,
> >  linux still reports both pmem devices to be on the same numa - Numa 0.
> >
> > If this information is missing, why pmem devices and address ranges are
> > still detected correctly?
>
> I suspect your BIOS might be using E820-type-12 to describe the pmem
> ranges which is not compliant with the ACPI specification and would
> need a BIOS change.
>
> > Is there another table that we need to check?
>
> You can dump /proc/iomem.  If it shows "Persistent Memory (legacy)"
> then the BIOS is using the E820-type-12 description scheme which does
> not include NUMA information.
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2017-12-27 18:53       ` Oren Berman
@ 2017-12-28  9:14         ` Dan Williams
  2017-12-28 10:03           ` Oren Berman
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2017-12-28  9:14 UTC (permalink / raw)
  To: Oren Berman; +Cc: linux-nvdimm

[sent from my phone, forgive formatting]

Your BIOS would need to put SPA range entries in the ACPI NFIT. The problem
with legacy pmem ranges in the e820 table is that it omits critical details
like battery status and whether the platform supports flushing memory
controller buffers at power loss (ADR).

The NFIT can also reliably communicate NUMA information  for NVDIMMs that
e820 does not.

On Wednesday, December 27, 2017, Oren Berman <oren@lightbitslabs.com> wrote:

> Hi
>
> I have a question regrading NVDIMM detection.
>
> When we are working with NVDIMM of type 12 it is detected by the linux in
> legacy mode and we can
> accesses it as pmem or dax device. we have an e820 bios.
>
> When we are using a type 7 NVDIMM it is reported by the linux as
> persistence type 7 memory but there is no pmem or dax device created.
> Linux Kernel identifies this memory in the e820 table but it does not
> trigger nvdimm probe for it.
> Do you know what could be the cause? Is their a workaround for that?
> Can it still be treated as legacy mode so we can access it through pmem/dax
> device?
>
> BR
> Oren Berman
>
> On 22 October 2017 at 16:52, Dan Williams <dan.j.williams@intel.com>
> wrote:
>
> > On Sun, Oct 22, 2017 at 4:33 AM, Oren Berman <oren@lightbitslabs.com>
> > wrote:
> > > Hi Ross
> > >
> > > Thanks for the speedy reply. I am also adding the public list to this
> > > thread as you suggested.
> > >
> > > We have tried to dump the SPA table and this is what we get:
> > >
> > > /*
> > >  * Intel ACPI Component Architecture
> > >  * AML/ASL+ Disassembler version 20160108-64
> > >  * Copyright (c) 2000 - 2016 Intel Corporation
> > >  *
> > >  * Disassembly of NFIT, Sun Oct 22 10:46:19 2017
> > >  *
> > >  * ACPI Data Table [NFIT]
> > >  *
> > >  * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue
> > >  */
> > >
> > > [000h 0000   4]                    Signature : "NFIT"    [NVDIMM
> Firmware
> > > Interface Table]
> > > [004h 0004   4]                 Table Length : 00000028
> > > [008h 0008   1]                     Revision : 01
> > > [009h 0009   1]                     Checksum : B2
> > > [00Ah 0010   6]                       Oem ID : "SUPERM"
> > > [010h 0016   8]                 Oem Table ID : "SMCI--MB"
> > > [018h 0024   4]                 Oem Revision : 00000001
> > > [01Ch 0028   4]              Asl Compiler ID : " "
> > > [020h 0032   4]        Asl Compiler Revision : 00000001
> > >
> > > [024h 0036   4]                     Reserved : 00000000
> > >
> > > Raw Table Data: Length 40 (0x28)
> > >
> > >   0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D  //
> > NFIT(.....SUPERM
> > >   0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00  //
> > SMCI--MB........
> > >   0020: 01 00 00 00 00 00 00 00
> > >
> > > As you can see the memory region info is missing.
> > >
> > > This specific check was done on a supermicro server.
> > > We also performed a bios update but the results were the same.
> > >
> > > As said before ,the pmem devices are detected correctly and we verified
> > > that they correspond to different numa nodes using the PCM
> > utility.However,
> > >  linux still reports both pmem devices to be on the same numa - Numa 0.
> > >
> > > If this information is missing, why pmem devices and address ranges are
> > > still detected correctly?
> >
> > I suspect your BIOS might be using E820-type-12 to describe the pmem
> > ranges which is not compliant with the ACPI specification and would
> > need a BIOS change.
> >
> > > Is there another table that we need to check?
> >
> > You can dump /proc/iomem.  If it shows "Persistent Memory (legacy)"
> > then the BIOS is using the E820-type-12 description scheme which does
> > not include NUMA information.
> >
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2017-12-28  9:14         ` Dan Williams
@ 2017-12-28 10:03           ` Oren Berman
  2017-12-28 18:16             ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Oren Berman @ 2017-12-28 10:03 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm

Thanks Dan.
I understand the shortcomings of using legacy mode but currently my problem
is that TYPE 12 is detected and I can use dax even in legacy mode but for
some reason type 7 is not. Is there a way to force it be treated as legacy
as well.
The reason I am asking is that I am not sure I can change my bios and I
know at least that type 12 NVDIMM is working for me.

BR
Oren



On 28 December 2017 at 11:14, Dan Williams <dan.j.williams@intel.com> wrote:

> [sent from my phone, forgive formatting]
>
> Your BIOS would need to put SPA range entries in the ACPI NFIT. The
> problem with legacy pmem ranges in the e820 table is that it omits critical
> details like battery status and whether the platform supports flushing
> memory controller buffers at power loss (ADR).
>
> The NFIT can also reliably communicate NUMA information  for NVDIMMs that
> e820 does not.
>
> On Wednesday, December 27, 2017, Oren Berman <oren@lightbitslabs.com>
> wrote:
>
>> Hi
>>
>> I have a question regrading NVDIMM detection.
>>
>> When we are working with NVDIMM of type 12 it is detected by the linux in
>> legacy mode and we can
>> accesses it as pmem or dax device. we have an e820 bios.
>>
>> When we are using a type 7 NVDIMM it is reported by the linux as
>> persistence type 7 memory but there is no pmem or dax device created.
>> Linux Kernel identifies this memory in the e820 table but it does not
>> trigger nvdimm probe for it.
>> Do you know what could be the cause? Is their a workaround for that?
>> Can it still be treated as legacy mode so we can access it through
>> pmem/dax
>> device?
>>
>> BR
>> Oren Berman
>>
>> On 22 October 2017 at 16:52, Dan Williams <dan.j.williams@intel.com>
>> wrote:
>>
>> > On Sun, Oct 22, 2017 at 4:33 AM, Oren Berman <oren@lightbitslabs.com>
>> > wrote:
>> > > Hi Ross
>> > >
>> > > Thanks for the speedy reply. I am also adding the public list to this
>> > > thread as you suggested.
>> > >
>> > > We have tried to dump the SPA table and this is what we get:
>> > >
>> > > /*
>> > >  * Intel ACPI Component Architecture
>> > >  * AML/ASL+ Disassembler version 20160108-64
>> > >  * Copyright (c) 2000 - 2016 Intel Corporation
>> > >  *
>> > >  * Disassembly of NFIT, Sun Oct 22 10:46:19 2017
>> > >  *
>> > >  * ACPI Data Table [NFIT]
>> > >  *
>> > >  * Format: [HexOffset DecimalOffset ByteLength]  FieldName :
>> FieldValue
>> > >  */
>> > >
>> > > [000h 0000   4]                    Signature : "NFIT"    [NVDIMM
>> Firmware
>> > > Interface Table]
>> > > [004h 0004   4]                 Table Length : 00000028
>> > > [008h 0008   1]                     Revision : 01
>> > > [009h 0009   1]                     Checksum : B2
>> > > [00Ah 0010   6]                       Oem ID : "SUPERM"
>> > > [010h 0016   8]                 Oem Table ID : "SMCI--MB"
>> > > [018h 0024   4]                 Oem Revision : 00000001
>> > > [01Ch 0028   4]              Asl Compiler ID : " "
>> > > [020h 0032   4]        Asl Compiler Revision : 00000001
>> > >
>> > > [024h 0036   4]                     Reserved : 00000000
>> > >
>> > > Raw Table Data: Length 40 (0x28)
>> > >
>> > >   0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D  //
>> > NFIT(.....SUPERM
>> > >   0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00  //
>> > SMCI--MB........
>> > >   0020: 01 00 00 00 00 00 00 00
>> > >
>> > > As you can see the memory region info is missing.
>> > >
>> > > This specific check was done on a supermicro server.
>> > > We also performed a bios update but the results were the same.
>> > >
>> > > As said before ,the pmem devices are detected correctly and we
>> verified
>> > > that they correspond to different numa nodes using the PCM
>> > utility.However,
>> > >  linux still reports both pmem devices to be on the same numa - Numa
>> 0.
>> > >
>> > > If this information is missing, why pmem devices and address ranges
>> are
>> > > still detected correctly?
>> >
>> > I suspect your BIOS might be using E820-type-12 to describe the pmem
>> > ranges which is not compliant with the ACPI specification and would
>> > need a BIOS change.
>> >
>> > > Is there another table that we need to check?
>> >
>> > You can dump /proc/iomem.  If it shows "Persistent Memory (legacy)"
>> > then the BIOS is using the E820-type-12 description scheme which does
>> > not include NUMA information.
>> >
>> _______________________________________________
>> Linux-nvdimm mailing list
>> Linux-nvdimm@lists.01.org
>> https://lists.01.org/mailman/listinfo/linux-nvdimm
>>
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2017-12-28 10:03           ` Oren Berman
@ 2017-12-28 18:16             ` Dan Williams
  2017-12-31  8:23               ` Yigal Korman
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2017-12-28 18:16 UTC (permalink / raw)
  To: Oren Berman; +Cc: linux-nvdimm

Type-7 only tells the kernel to reserve the memory range. NFIT carves that
reservation into pmem devices. Type-12 skips the reservation and creates a
pmem device directly. There is no workaround if the platform only has a
BIOS that produces a type-12 range.


On Thursday, December 28, 2017, Oren Berman <oren@lightbitslabs.com> wrote:

> Thanks Dan.
> I understand the shortcomings of using legacy mode but currently my problem
> is that TYPE 12 is detected and I can use dax even in legacy mode but for
> some reason type 7 is not. Is there a way to force it be treated as legacy
> as well.
> The reason I am asking is that I am not sure I can change my bios and I
> know at least that type 12 NVDIMM is working for me.
>
> BR
> Oren
>
>
>
> On 28 December 2017 at 11:14, Dan Williams <dan.j.williams@intel.com>
> wrote:
>
> > [sent from my phone, forgive formatting]
> >
> > Your BIOS would need to put SPA range entries in the ACPI NFIT. The
> > problem with legacy pmem ranges in the e820 table is that it omits
> critical
> > details like battery status and whether the platform supports flushing
> > memory controller buffers at power loss (ADR).
> >
> > The NFIT can also reliably communicate NUMA information  for NVDIMMs that
> > e820 does not.
> >
> > On Wednesday, December 27, 2017, Oren Berman <oren@lightbitslabs.com>
> > wrote:
> >
> >> Hi
> >>
> >> I have a question regrading NVDIMM detection.
> >>
> >> When we are working with NVDIMM of type 12 it is detected by the linux
> in
> >> legacy mode and we can
> >> accesses it as pmem or dax device. we have an e820 bios.
> >>
> >> When we are using a type 7 NVDIMM it is reported by the linux as
> >> persistence type 7 memory but there is no pmem or dax device created.
> >> Linux Kernel identifies this memory in the e820 table but it does not
> >> trigger nvdimm probe for it.
> >> Do you know what could be the cause? Is their a workaround for that?
> >> Can it still be treated as legacy mode so we can access it through
> >> pmem/dax
> >> device?
> >>
> >> BR
> >> Oren Berman
> >>
> >> On 22 October 2017 at 16:52, Dan Williams <dan.j.williams@intel.com>
> >> wrote:
> >>
> >> > On Sun, Oct 22, 2017 at 4:33 AM, Oren Berman <oren@lightbitslabs.com>
> >> > wrote:
> >> > > Hi Ross
> >> > >
> >> > > Thanks for the speedy reply. I am also adding the public list to
> this
> >> > > thread as you suggested.
> >> > >
> >> > > We have tried to dump the SPA table and this is what we get:
> >> > >
> >> > > /*
> >> > >  * Intel ACPI Component Architecture
> >> > >  * AML/ASL+ Disassembler version 20160108-64
> >> > >  * Copyright (c) 2000 - 2016 Intel Corporation
> >> > >  *
> >> > >  * Disassembly of NFIT, Sun Oct 22 10:46:19 2017
> >> > >  *
> >> > >  * ACPI Data Table [NFIT]
> >> > >  *
> >> > >  * Format: [HexOffset DecimalOffset ByteLength]  FieldName :
> >> FieldValue
> >> > >  */
> >> > >
> >> > > [000h 0000   4]                    Signature : "NFIT"    [NVDIMM
> >> Firmware
> >> > > Interface Table]
> >> > > [004h 0004   4]                 Table Length : 00000028
> >> > > [008h 0008   1]                     Revision : 01
> >> > > [009h 0009   1]                     Checksum : B2
> >> > > [00Ah 0010   6]                       Oem ID : "SUPERM"
> >> > > [010h 0016   8]                 Oem Table ID : "SMCI--MB"
> >> > > [018h 0024   4]                 Oem Revision : 00000001
> >> > > [01Ch 0028   4]              Asl Compiler ID : " "
> >> > > [020h 0032   4]        Asl Compiler Revision : 00000001
> >> > >
> >> > > [024h 0036   4]                     Reserved : 00000000
> >> > >
> >> > > Raw Table Data: Length 40 (0x28)
> >> > >
> >> > >   0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D  //
> >> > NFIT(.....SUPERM
> >> > >   0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00  //
> >> > SMCI--MB........
> >> > >   0020: 01 00 00 00 00 00 00 00
> >> > >
> >> > > As you can see the memory region info is missing.
> >> > >
> >> > > This specific check was done on a supermicro server.
> >> > > We also performed a bios update but the results were the same.
> >> > >
> >> > > As said before ,the pmem devices are detected correctly and we
> >> verified
> >> > > that they correspond to different numa nodes using the PCM
> >> > utility.However,
> >> > >  linux still reports both pmem devices to be on the same numa - Numa
> >> 0.
> >> > >
> >> > > If this information is missing, why pmem devices and address ranges
> >> are
> >> > > still detected correctly?
> >> >
> >> > I suspect your BIOS might be using E820-type-12 to describe the pmem
> >> > ranges which is not compliant with the ACPI specification and would
> >> > need a BIOS change.
> >> >
> >> > > Is there another table that we need to check?
> >> >
> >> > You can dump /proc/iomem.  If it shows "Persistent Memory (legacy)"
> >> > then the BIOS is using the E820-type-12 description scheme which does
> >> > not include NUMA information.
> >> >
> >> _______________________________________________
> >> Linux-nvdimm mailing list
> >> Linux-nvdimm@lists.01.org
> >> https://lists.01.org/mailman/listinfo/linux-nvdimm
> >>
> >
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2017-12-28 18:16             ` Dan Williams
@ 2017-12-31  8:23               ` Yigal Korman
  2018-01-09 22:25                 ` Oren Berman
  0 siblings, 1 reply; 16+ messages in thread
From: Yigal Korman @ 2017-12-31  8:23 UTC (permalink / raw)
  To: Dan Williams; +Cc: Oren Berman, linux-nvdimm

You can try to force a legacy pmem device with memmap=XX!YY kernel
parameter.

On Thu, Dec 28, 2017 at 8:16 PM, Dan Williams <dan.j.williams@intel.com>
wrote:

> Type-7 only tells the kernel to reserve the memory range. NFIT carves that
> reservation into pmem devices. Type-12 skips the reservation and creates a
> pmem device directly. There is no workaround if the platform only has a
> BIOS that produces a type-12 range.
>
>
> On Thursday, December 28, 2017, Oren Berman <oren@lightbitslabs.com>
> wrote:
>
> > Thanks Dan.
> > I understand the shortcomings of using legacy mode but currently my
> problem
> > is that TYPE 12 is detected and I can use dax even in legacy mode but for
> > some reason type 7 is not. Is there a way to force it be treated as
> legacy
> > as well.
> > The reason I am asking is that I am not sure I can change my bios and I
> > know at least that type 12 NVDIMM is working for me.
> >
> > BR
> > Oren
> >
> >
> >
> > On 28 December 2017 at 11:14, Dan Williams <dan.j.williams@intel.com>
> > wrote:
> >
> > > [sent from my phone, forgive formatting]
> > >
> > > Your BIOS would need to put SPA range entries in the ACPI NFIT. The
> > > problem with legacy pmem ranges in the e820 table is that it omits
> > critical
> > > details like battery status and whether the platform supports flushing
> > > memory controller buffers at power loss (ADR).
> > >
> > > The NFIT can also reliably communicate NUMA information  for NVDIMMs
> that
> > > e820 does not.
> > >
> > > On Wednesday, December 27, 2017, Oren Berman <oren@lightbitslabs.com>
> > > wrote:
> > >
> > >> Hi
> > >>
> > >> I have a question regrading NVDIMM detection.
> > >>
> > >> When we are working with NVDIMM of type 12 it is detected by the linux
> > in
> > >> legacy mode and we can
> > >> accesses it as pmem or dax device. we have an e820 bios.
> > >>
> > >> When we are using a type 7 NVDIMM it is reported by the linux as
> > >> persistence type 7 memory but there is no pmem or dax device created.
> > >> Linux Kernel identifies this memory in the e820 table but it does not
> > >> trigger nvdimm probe for it.
> > >> Do you know what could be the cause? Is their a workaround for that?
> > >> Can it still be treated as legacy mode so we can access it through
> > >> pmem/dax
> > >> device?
> > >>
> > >> BR
> > >> Oren Berman
> > >>
> > >> On 22 October 2017 at 16:52, Dan Williams <dan.j.williams@intel.com>
> > >> wrote:
> > >>
> > >> > On Sun, Oct 22, 2017 at 4:33 AM, Oren Berman <
> oren@lightbitslabs.com>
> > >> > wrote:
> > >> > > Hi Ross
> > >> > >
> > >> > > Thanks for the speedy reply. I am also adding the public list to
> > this
> > >> > > thread as you suggested.
> > >> > >
> > >> > > We have tried to dump the SPA table and this is what we get:
> > >> > >
> > >> > > /*
> > >> > >  * Intel ACPI Component Architecture
> > >> > >  * AML/ASL+ Disassembler version 20160108-64
> > >> > >  * Copyright (c) 2000 - 2016 Intel Corporation
> > >> > >  *
> > >> > >  * Disassembly of NFIT, Sun Oct 22 10:46:19 2017
> > >> > >  *
> > >> > >  * ACPI Data Table [NFIT]
> > >> > >  *
> > >> > >  * Format: [HexOffset DecimalOffset ByteLength]  FieldName :
> > >> FieldValue
> > >> > >  */
> > >> > >
> > >> > > [000h 0000   4]                    Signature : "NFIT"    [NVDIMM
> > >> Firmware
> > >> > > Interface Table]
> > >> > > [004h 0004   4]                 Table Length : 00000028
> > >> > > [008h 0008   1]                     Revision : 01
> > >> > > [009h 0009   1]                     Checksum : B2
> > >> > > [00Ah 0010   6]                       Oem ID : "SUPERM"
> > >> > > [010h 0016   8]                 Oem Table ID : "SMCI--MB"
> > >> > > [018h 0024   4]                 Oem Revision : 00000001
> > >> > > [01Ch 0028   4]              Asl Compiler ID : " "
> > >> > > [020h 0032   4]        Asl Compiler Revision : 00000001
> > >> > >
> > >> > > [024h 0036   4]                     Reserved : 00000000
> > >> > >
> > >> > > Raw Table Data: Length 40 (0x28)
> > >> > >
> > >> > >   0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D  //
> > >> > NFIT(.....SUPERM
> > >> > >   0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00  //
> > >> > SMCI--MB........
> > >> > >   0020: 01 00 00 00 00 00 00 00
> > >> > >
> > >> > > As you can see the memory region info is missing.
> > >> > >
> > >> > > This specific check was done on a supermicro server.
> > >> > > We also performed a bios update but the results were the same.
> > >> > >
> > >> > > As said before ,the pmem devices are detected correctly and we
> > >> verified
> > >> > > that they correspond to different numa nodes using the PCM
> > >> > utility.However,
> > >> > >  linux still reports both pmem devices to be on the same numa -
> Numa
> > >> 0.
> > >> > >
> > >> > > If this information is missing, why pmem devices and address
> ranges
> > >> are
> > >> > > still detected correctly?
> > >> >
> > >> > I suspect your BIOS might be using E820-type-12 to describe the pmem
> > >> > ranges which is not compliant with the ACPI specification and would
> > >> > need a BIOS change.
> > >> >
> > >> > > Is there another table that we need to check?
> > >> >
> > >> > You can dump /proc/iomem.  If it shows "Persistent Memory (legacy)"
> > >> > then the BIOS is using the E820-type-12 description scheme which
> does
> > >> > not include NUMA information.
> > >> >
> > >> _______________________________________________
> > >> Linux-nvdimm mailing list
> > >> Linux-nvdimm@lists.01.org
> > >> https://lists.01.org/mailman/listinfo/linux-nvdimm
> > >>
> > >
> > _______________________________________________
> > Linux-nvdimm mailing list
> > Linux-nvdimm@lists.01.org
> > https://lists.01.org/mailman/listinfo/linux-nvdimm
> >
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2017-12-31  8:23               ` Yigal Korman
@ 2018-01-09 22:25                 ` Oren Berman
  2018-01-09 23:05                   ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Oren Berman @ 2018-01-09 22:25 UTC (permalink / raw)
  To: Yigal Korman; +Cc: linux-nvdimm

Hi

I would like to know if you encountered such a problem.

We are accessing the nvram as memory from withing the kernel.
By mapping dax device and reading its mapping we can know the physical
address of the nvram.
As a result we can access this address range in the kernel by calling
phys_to_virt.
This  is working in most case but we saw some issue that after reboot, when
trying to read the info saved
on the nvram before the power off, one kernel thread was able to read
from this range but another kernel thread got page fault.

This is not recreated very easily and we need run many reboot sequences to
get this failure again.
Are you aware of any mapping issues of nvram to kernel space?

Thanks for any suggestions you might have.
BR
Oren

On 31 December 2017 at 10:23, Yigal Korman <yigal@plexistor.com> wrote:

> You can try to force a legacy pmem device with memmap=XX!YY kernel
> parameter.
>
> On Thu, Dec 28, 2017 at 8:16 PM, Dan Williams <dan.j.williams@intel.com>
> wrote:
>
>> Type-7 only tells the kernel to reserve the memory range. NFIT carves that
>> reservation into pmem devices. Type-12 skips the reservation and creates a
>> pmem device directly. There is no workaround if the platform only has a
>> BIOS that produces a type-12 range.
>>
>>
>> On Thursday, December 28, 2017, Oren Berman <oren@lightbitslabs.com>
>> wrote:
>>
>> > Thanks Dan.
>> > I understand the shortcomings of using legacy mode but currently my
>> problem
>> > is that TYPE 12 is detected and I can use dax even in legacy mode but
>> for
>> > some reason type 7 is not. Is there a way to force it be treated as
>> legacy
>> > as well.
>> > The reason I am asking is that I am not sure I can change my bios and I
>> > know at least that type 12 NVDIMM is working for me.
>> >
>> > BR
>> > Oren
>> >
>> >
>> >
>> > On 28 December 2017 at 11:14, Dan Williams <dan.j.williams@intel.com>
>> > wrote:
>> >
>> > > [sent from my phone, forgive formatting]
>> > >
>> > > Your BIOS would need to put SPA range entries in the ACPI NFIT. The
>> > > problem with legacy pmem ranges in the e820 table is that it omits
>> > critical
>> > > details like battery status and whether the platform supports flushing
>> > > memory controller buffers at power loss (ADR).
>> > >
>> > > The NFIT can also reliably communicate NUMA information  for NVDIMMs
>> that
>> > > e820 does not.
>> > >
>> > > On Wednesday, December 27, 2017, Oren Berman <oren@lightbitslabs.com>
>> > > wrote:
>> > >
>> > >> Hi
>> > >>
>> > >> I have a question regrading NVDIMM detection.
>> > >>
>> > >> When we are working with NVDIMM of type 12 it is detected by the
>> linux
>> > in
>> > >> legacy mode and we can
>> > >> accesses it as pmem or dax device. we have an e820 bios.
>> > >>
>> > >> When we are using a type 7 NVDIMM it is reported by the linux as
>> > >> persistence type 7 memory but there is no pmem or dax device created.
>> > >> Linux Kernel identifies this memory in the e820 table but it does not
>> > >> trigger nvdimm probe for it.
>> > >> Do you know what could be the cause? Is their a workaround for that?
>> > >> Can it still be treated as legacy mode so we can access it through
>> > >> pmem/dax
>> > >> device?
>> > >>
>> > >> BR
>> > >> Oren Berman
>> > >>
>> > >> On 22 October 2017 at 16:52, Dan Williams <dan.j.williams@intel.com>
>> > >> wrote:
>> > >>
>> > >> > On Sun, Oct 22, 2017 at 4:33 AM, Oren Berman <
>> oren@lightbitslabs.com>
>> > >> > wrote:
>> > >> > > Hi Ross
>> > >> > >
>> > >> > > Thanks for the speedy reply. I am also adding the public list to
>> > this
>> > >> > > thread as you suggested.
>> > >> > >
>> > >> > > We have tried to dump the SPA table and this is what we get:
>> > >> > >
>> > >> > > /*
>> > >> > >  * Intel ACPI Component Architecture
>> > >> > >  * AML/ASL+ Disassembler version 20160108-64
>> > >> > >  * Copyright (c) 2000 - 2016 Intel Corporation
>> > >> > >  *
>> > >> > >  * Disassembly of NFIT, Sun Oct 22 10:46:19 2017
>> > >> > >  *
>> > >> > >  * ACPI Data Table [NFIT]
>> > >> > >  *
>> > >> > >  * Format: [HexOffset DecimalOffset ByteLength]  FieldName :
>> > >> FieldValue
>> > >> > >  */
>> > >> > >
>> > >> > > [000h 0000   4]                    Signature : "NFIT"    [NVDIMM
>> > >> Firmware
>> > >> > > Interface Table]
>> > >> > > [004h 0004   4]                 Table Length : 00000028
>> > >> > > [008h 0008   1]                     Revision : 01
>> > >> > > [009h 0009   1]                     Checksum : B2
>> > >> > > [00Ah 0010   6]                       Oem ID : "SUPERM"
>> > >> > > [010h 0016   8]                 Oem Table ID : "SMCI--MB"
>> > >> > > [018h 0024   4]                 Oem Revision : 00000001
>> > >> > > [01Ch 0028   4]              Asl Compiler ID : " "
>> > >> > > [020h 0032   4]        Asl Compiler Revision : 00000001
>> > >> > >
>> > >> > > [024h 0036   4]                     Reserved : 00000000
>> > >> > >
>> > >> > > Raw Table Data: Length 40 (0x28)
>> > >> > >
>> > >> > >   0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D  //
>> > >> > NFIT(.....SUPERM
>> > >> > >   0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00  //
>> > >> > SMCI--MB........
>> > >> > >   0020: 01 00 00 00 00 00 00 00
>> > >> > >
>> > >> > > As you can see the memory region info is missing.
>> > >> > >
>> > >> > > This specific check was done on a supermicro server.
>> > >> > > We also performed a bios update but the results were the same.
>> > >> > >
>> > >> > > As said before ,the pmem devices are detected correctly and we
>> > >> verified
>> > >> > > that they correspond to different numa nodes using the PCM
>> > >> > utility.However,
>> > >> > >  linux still reports both pmem devices to be on the same numa -
>> Numa
>> > >> 0.
>> > >> > >
>> > >> > > If this information is missing, why pmem devices and address
>> ranges
>> > >> are
>> > >> > > still detected correctly?
>> > >> >
>> > >> > I suspect your BIOS might be using E820-type-12 to describe the
>> pmem
>> > >> > ranges which is not compliant with the ACPI specification and would
>> > >> > need a BIOS change.
>> > >> >
>> > >> > > Is there another table that we need to check?
>> > >> >
>> > >> > You can dump /proc/iomem.  If it shows "Persistent Memory (legacy)"
>> > >> > then the BIOS is using the E820-type-12 description scheme which
>> does
>> > >> > not include NUMA information.
>> > >> >
>> > >> _______________________________________________
>> > >> Linux-nvdimm mailing list
>> > >> Linux-nvdimm@lists.01.org
>> > >> https://lists.01.org/mailman/listinfo/linux-nvdimm
>> > >>
>> > >
>> > _______________________________________________
>> > Linux-nvdimm mailing list
>> > Linux-nvdimm@lists.01.org
>> > https://lists.01.org/mailman/listinfo/linux-nvdimm
>> >
>> _______________________________________________
>> Linux-nvdimm mailing list
>> Linux-nvdimm@lists.01.org
>> https://lists.01.org/mailman/listinfo/linux-nvdimm
>>
>
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2018-01-09 22:25                 ` Oren Berman
@ 2018-01-09 23:05                   ` Dan Williams
  2018-01-10  7:21                     ` Oren Berman
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2018-01-09 23:05 UTC (permalink / raw)
  To: Oren Berman; +Cc: linux-nvdimm

On Tue, Jan 9, 2018 at 2:25 PM, Oren Berman <oren@lightbitslabs.com> wrote:
> Hi
>
> I would like to know if you encountered such a problem.
>
> We are accessing the nvram as memory from withing the kernel.
> By mapping dax device and reading its mapping we can know the physical
> address of the nvram.
> As a result we can access this address range in the kernel by calling
> phys_to_virt.
> This  is working in most case but we saw some issue that after reboot, when
> trying to read the info saved
> on the nvram before the power off, one kernel thread was able to read
> from this range but another kernel thread got page fault.
>
> This is not recreated very easily and we need run many reboot sequences to
> get this failure again.
> Are you aware of any mapping issues of nvram to kernel space?

When are you using phys_to_virt()? That will only return a valid
virtual address as long as the driver is loaded. It sounds like you
may be losing a race with the driver setting up or tearing down the
mappings.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2018-01-09 23:05                   ` Dan Williams
@ 2018-01-10  7:21                     ` Oren Berman
  2018-01-10 13:13                       ` Oren Berman
  0 siblings, 1 reply; 16+ messages in thread
From: Oren Berman @ 2018-01-10  7:21 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm

Hi Dan

Which driver are you referring to?
If it is the dax driver than it is always loaded - we see /dev/dax0.
If you refer to the user space application which called the mmap on the dax
device then this application is not running anymore.
We used this application to get the virtual address mapping(doing mmap on
dax) and then by going over the proc pagemap we got the physical address.
After that the application terminates and we pass this physical address to
our kernel thread .
Then from the kernel thread we access this range by using phys_to_virt (we
know the physical so we convert it virtual).
As far as I know once in kernel space all address range  should be mapped
to the kernel page tables in 64 bit architecture ofcourse,
thus accessible using phys_to_virt.
Is this a wrong assumption when dealing with NVRAM?
If I know the physical address of the nvram isn't it accessible from the
kernel  using the simple conversion of phys_to_virt?

Thanks
Oren




On 10 January 2018 at 01:05, Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, Jan 9, 2018 at 2:25 PM, Oren Berman <oren@lightbitslabs.com>
> wrote:
> > Hi
> >
> > I would like to know if you encountered such a problem.
> >
> > We are accessing the nvram as memory from withing the kernel.
> > By mapping dax device and reading its mapping we can know the physical
> > address of the nvram.
> > As a result we can access this address range in the kernel by calling
> > phys_to_virt.
> > This  is working in most case but we saw some issue that after reboot,
> when
> > trying to read the info saved
> > on the nvram before the power off, one kernel thread was able to read
> > from this range but another kernel thread got page fault.
> >
> > This is not recreated very easily and we need run many reboot sequences
> to
> > get this failure again.
> > Are you aware of any mapping issues of nvram to kernel space?
>
> When are you using phys_to_virt()? That will only return a valid
> virtual address as long as the driver is loaded. It sounds like you
> may be losing a race with the driver setting up or tearing down the
> mappings.
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2018-01-10  7:21                     ` Oren Berman
@ 2018-01-10 13:13                       ` Oren Berman
  2018-01-10 14:51                         ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Oren Berman @ 2018-01-10 13:13 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm

Hi

A few more updates:

a) This issue does not happen a lot I need to do a lot of power cycles for
this to happen.
b) We use 2 numa nodes but for some reason it happens only on the second
numa.
c) We added a debug feature that when page fault occurs we will trigger a
thread from command line in the kernel(through configfs) that will read
from
any physical address that we give it. When we give it the faulty address -
the thread succeeds reading from it with no page fault.
We verified that it accessed the same virtual address that caused the page
fault.

BR
Oren


On 10 January 2018 at 09:21, Oren Berman <oren@lightbitslabs.com> wrote:

> Hi Dan
>
> Which driver are you referring to?
> If it is the dax driver than it is always loaded - we see /dev/dax0.
> If you refer to the user space application which called the mmap on the
> dax device then this application is not running anymore.
> We used this application to get the virtual address mapping(doing mmap on
> dax) and then by going over the proc pagemap we got the physical address.
> After that the application terminates and we pass this physical address to
> our kernel thread .
> Then from the kernel thread we access this range by using phys_to_virt (we
> know the physical so we convert it virtual).
> As far as I know once in kernel space all address range  should be mapped
> to the kernel page tables in 64 bit architecture ofcourse,
> thus accessible using phys_to_virt.
> Is this a wrong assumption when dealing with NVRAM?
> If I know the physical address of the nvram isn't it accessible from the
> kernel  using the simple conversion of phys_to_virt?
>
> Thanks
> Oren
>
>
>
>
> On 10 January 2018 at 01:05, Dan Williams <dan.j.williams@intel.com>
> wrote:
>
>> On Tue, Jan 9, 2018 at 2:25 PM, Oren Berman <oren@lightbitslabs.com>
>> wrote:
>> > Hi
>> >
>> > I would like to know if you encountered such a problem.
>> >
>> > We are accessing the nvram as memory from withing the kernel.
>> > By mapping dax device and reading its mapping we can know the physical
>> > address of the nvram.
>> > As a result we can access this address range in the kernel by calling
>> > phys_to_virt.
>> > This  is working in most case but we saw some issue that after reboot,
>> when
>> > trying to read the info saved
>> > on the nvram before the power off, one kernel thread was able to read
>> > from this range but another kernel thread got page fault.
>> >
>> > This is not recreated very easily and we need run many reboot sequences
>> to
>> > get this failure again.
>> > Are you aware of any mapping issues of nvram to kernel space?
>>
>> When are you using phys_to_virt()? That will only return a valid
>> virtual address as long as the driver is loaded. It sounds like you
>> may be losing a race with the driver setting up or tearing down the
>> mappings.
>>
>
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2018-01-10 13:13                       ` Oren Berman
@ 2018-01-10 14:51                         ` Dan Williams
  2018-01-10 15:23                           ` Oren Berman
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2018-01-10 14:51 UTC (permalink / raw)
  To: Oren Berman; +Cc: linux-nvdimm

On Wed, Jan 10, 2018 at 5:13 AM, Oren Berman <oren@lightbitslabs.com> wrote:
> Hi
>
> A few more updates:
>
> a) This issue does not happen a lot I need to do a lot of power cycles for
> this to happen.
> b) We use 2 numa nodes but for some reason it happens only on the second
> numa.
> c) We added a debug feature that when page fault occurs we will trigger a
> thread from command line in the kernel(through configfs) that will read
> from
> any physical address that we give it. When we give it the faulty address -
> the thread succeeds reading from it with no page fault.
> We verified that it accessed the same virtual address that caused the page
> fault.

Does the problem go away if you specify:

    nokaslr

...on the kernel command line?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2018-01-10 14:51                         ` Dan Williams
@ 2018-01-10 15:23                           ` Oren Berman
  2018-01-10 16:38                             ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Oren Berman @ 2018-01-10 15:23 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm

Now to all of the forum

Hi Dan

Thanks we are going to try this.

Can you explain why this can cause this issue - is the NVDIMM memory space
also being randomized?
Is it done during runtime?

BR
Oren

On 10 January 2018 at 16:51, Dan Williams <dan.j.williams@intel.com> wrote:

> On Wed, Jan 10, 2018 at 5:13 AM, Oren Berman <oren@lightbitslabs.com>
> wrote:
> > Hi
> >
> > A few more updates:
> >
> > a) This issue does not happen a lot I need to do a lot of power cycles
> for
> > this to happen.
> > b) We use 2 numa nodes but for some reason it happens only on the second
> > numa.
> > c) We added a debug feature that when page fault occurs we will trigger a
> > thread from command line in the kernel(through configfs) that will read
> > from
> > any physical address that we give it. When we give it the faulty address
> -
> > the thread succeeds reading from it with no page fault.
> > We verified that it accessed the same virtual address that caused the
> page
> > fault.
>
> Does the problem go away if you specify:
>
>     nokaslr
>
> ...on the kernel command line?
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2018-01-10 15:23                           ` Oren Berman
@ 2018-01-10 16:38                             ` Dan Williams
  2018-01-10 17:41                               ` Oren Berman
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2018-01-10 16:38 UTC (permalink / raw)
  To: Oren Berman; +Cc: linux-nvdimm

On Wed, Jan 10, 2018 at 7:23 AM, Oren Berman <oren@lightbitslabs.com> wrote:
> Now to all of the forum
>
> Hi Dan
>
> Thanks we are going to try this.
>
> Can you explain why this can cause this issue - is the NVDIMM memory space
> also being randomized?
> Is it done during runtime?
>

Yes, kaslr randomizes the direct map. We have seen problems with it in
the past relative to setting up pmem mappings. We fixed one such bug
with this commit:

fc5f9d5f151c x86/mm: Fix boot crash caused by incorrect loop count
calculation in sync_global_pgds()

...but it appears we may have another bug in this area.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2018-01-10 16:38                             ` Dan Williams
@ 2018-01-10 17:41                               ` Oren Berman
  2018-06-29  5:17                                 ` Oren Berman
  0 siblings, 1 reply; 16+ messages in thread
From: Oren Berman @ 2018-01-10 17:41 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm

Hi
Thanks for your answer
If we do memremap on the physical address
Of the nvram from within the kernel to get a new virtual address mapping will it lock the mapping?
Can this be also a workaround?
Oren

נשלח מה-iPhone שלי

‫ב-10 בינו׳ 2018, בשעה 18:38, ‏‏Dan Williams ‏<dan.j.williams@intel.com> כתב/ה:‬

>> On Wed, Jan 10, 2018 at 7:23 AM, Oren Berman <oren@lightbitslabs.com> wrote:
>> Now to all of the forum
>> 
>> Hi Dan
>> 
>> Thanks we are going to try this.
>> 
>> Can you explain why this can cause this issue - is the NVDIMM memory space
>> also being randomized?
>> Is it done during runtime?
>> 
> 
> Yes, kaslr randomizes the direct map. We have seen problems with it in
> the past relative to setting up pmem mappings. We fixed one such bug
> with this commit:
> 
> fc5f9d5f151c x86/mm: Fix boot crash caused by incorrect loop count
> calculation in sync_global_pgds()
> 
> ...but it appears we may have another bug in this area.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Detecting NUMA per pmem
  2018-01-10 17:41                               ` Oren Berman
@ 2018-06-29  5:17                                 ` Oren Berman
  0 siblings, 0 replies; 16+ messages in thread
From: Oren Berman @ 2018-06-29  5:17 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm

Hi All

We encountered a strange issue using pmem emulation.
We configure emulated pmem devices in our system by adding these params to
the boot command:
memmap=32G!224G memmap=32G!480G .

PMEM devices look OK.

This issue is specific to 4.14 kernel AND centos 7.5, other combinations
follow exactly the same flow but don't crash.
Happens when ndctl converts an <emulated> /dev/pmem0 to /dev/dax0.0
command: ndctl create-namespace -f -e namespace0.0 --type=pmem --mode=dax

This crashes the kernel in udev when trying to free page in do_munmap()
syscall. And hangs.
echo namespace0.0 >
/sys/devices/platform/e820_pmem/ndbus0/region0/namespace0.0/driver/unbind
Calltrace():
ndctl_unbind()
ndctl_namespace_disable()
ndctl_namespace_disable_invalidate()
ndctl_namespace_disable_safe()
namespace_destroy()
namespace_reconfig()
do_xaction_namespace()

When we use for example ubuntu 16.04 with this kernel version it does not
happen.
When we use centos7.5 and kernel version 4.9 is also does not happen.
When working with actual NVDIMMs this does not happen

Did you encounter such an issue?
Why does the kernel think that this area is mapped and who might be mapping
it?
If no one maps it why does the kernel has indication that these pages are
mapped?

Any help would be highly appreciated.
Thanks
Oren Berman



On 10 January 2018 at 09:41, Oren Berman <oren@lightbitslabs.com> wrote:

> Hi
> Thanks for your answer
> If we do memremap on the physical address
> Of the nvram from within the kernel to get a new virtual address mapping
> will it lock the mapping?
> Can this be also a workaround?
> Oren
>
> נשלח מה-iPhone שלי
>
> ‫ב-10 בינו׳ 2018, בשעה 18:38, ‏‏Dan Williams ‏<dan.j.williams@intel.com>
> כתב/ה:‬
>
> On Wed, Jan 10, 2018 at 7:23 AM, Oren Berman <oren@lightbitslabs.com>
> wrote:
>
> Now to all of the forum
>
>
> Hi Dan
>
>
> Thanks we are going to try this.
>
>
> Can you explain why this can cause this issue - is the NVDIMM memory space
>
> also being randomized?
>
> Is it done during runtime?
>
>
>
> Yes, kaslr randomizes the direct map. We have seen problems with it in
> the past relative to setting up pmem mappings. We fixed one such bug
> with this commit:
>
> fc5f9d5f151c x86/mm: Fix boot crash caused by incorrect loop count
> calculation in sync_global_pgds()
>
> ...but it appears we may have another bug in this area.
>
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-06-29  5:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAN=ZobSmQ97gRKFaho7DPvXVD18W4145JfTQ5Ncf80Tw17fkGA@mail.gmail.com>
     [not found] ` <20171020162227.GA8576@linux.intel.com>
2017-10-22 11:33   ` Detecting NUMA per pmem Oren Berman
2017-10-22 13:52     ` Dan Williams
2017-12-27 18:53       ` Oren Berman
2017-12-28  9:14         ` Dan Williams
2017-12-28 10:03           ` Oren Berman
2017-12-28 18:16             ` Dan Williams
2017-12-31  8:23               ` Yigal Korman
2018-01-09 22:25                 ` Oren Berman
2018-01-09 23:05                   ` Dan Williams
2018-01-10  7:21                     ` Oren Berman
2018-01-10 13:13                       ` Oren Berman
2018-01-10 14:51                         ` Dan Williams
2018-01-10 15:23                           ` Oren Berman
2018-01-10 16:38                             ` Dan Williams
2018-01-10 17:41                               ` Oren Berman
2018-06-29  5:17                                 ` Oren Berman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).