All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-02-23  6:24 ` lizhijian
  0 siblings, 0 replies; 24+ messages in thread
From: lizhijian @ 2023-02-23  6:24 UTC (permalink / raw)
  To: kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dan.j.williams,
	dave.jiang, horms, k-hagio-ab, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

Hello folks,

This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
I really hope you can provide some feedback.

pmem memmap can also be called pmem metadata here.

### Background and motivate overview ###
---
Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
trouble around pmem (especially Filesystem-DAX).


A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
more details. In fsdax, struct page array becomes very important, it is one of the key data to find
status of reverse map.

So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
troubleshooters are unable to check more details about pmem from the dumpfile.

### Make pmem memmap dump support ###
---
Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.

First, based on our previous investigation, according to the location of metadata and the scope of
dump, we can divide it into the following four cases: A, B, C, D.
It should be noted that although we mentioned case A&B below, we do not want these two cases to be
part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
it may contain user sensitive data.

+-------------+----------+------------+
|\+--------+\     metadata location   |
|            ++-----------------------+
| dump scope  |  mem     |   PMEM     |
+-------------+----------+------------+
| entire pmem |     A    |     B      |
+-------------+----------+------------+
| metadata    |     C    |     D      |
+-------------+----------+------------+

Case A&B: unsupported
- Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
region into vmcore's PT_LOADs in kexec-tools.
- For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
are readable, and then skips/excludes the specific page according to its attributes. But in the case
of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
errors[2] when specific -d options are specified.
Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.

Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
from reading by the dump application(makedumpfile).

Case C: native supported
metadata is stored in mem, and the entire mem/ram is dumpable.

Case D: unsupported && need your input
To support this situation, the makedumpfile needs to know the location of metadata for each pmem
namespace and the address and size of metadata in the pmem [start, end)

We have thought of a few possible options:

1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.
3) others ?

But then we found that we have always ignored a user case, that is, the user could save the dumpfile
to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
we dumped is inconsistent with the metadata at the moment of the crash happening.
Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
dumpfile on the filesystem/partition based on pmem.

So here I hope you can provide some ideas about this feature/requirement and on the possible solution
for the cases A&B&D mentioned above, it would be greatly appreciated.

If I’m missing something, feel free to let me know. Any feedback & comment are very welcome.


[1] Pmem region layout:
   ^<--namespace0.0---->^<--namespace0.1------>^
   |                    |                      |
   +--+m----------------+--+m------------------+---------------------+-+a
   |++|e                |++|e                  |                     |+|l
   |++|t                |++|t                  |                     |+|i
   |++|a                |++|a                  |                     |+|g
   |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
   |++|a    fsdax       |++|a     devdax       |                     |+|m
   |++|t                |++|t                  |                     |+|e
   +--+a----------------+--+a------------------+---------------------+-+n
   |                                                                   |t
   v<-----------------------pmem region------------------------------->v

[2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/


Thanks
Zhijian

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-02-23  6:24 ` lizhijian
  0 siblings, 0 replies; 24+ messages in thread
From: lizhijian @ 2023-02-23  6:24 UTC (permalink / raw)
  To: kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dan.j.williams,
	dave.jiang, horms, k-hagio-ab, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

Hello folks,

This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
I really hope you can provide some feedback.

pmem memmap can also be called pmem metadata here.

### Background and motivate overview ###
---
Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
trouble around pmem (especially Filesystem-DAX).


A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
more details. In fsdax, struct page array becomes very important, it is one of the key data to find
status of reverse map.

So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
troubleshooters are unable to check more details about pmem from the dumpfile.

### Make pmem memmap dump support ###
---
Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.

First, based on our previous investigation, according to the location of metadata and the scope of
dump, we can divide it into the following four cases: A, B, C, D.
It should be noted that although we mentioned case A&B below, we do not want these two cases to be
part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
it may contain user sensitive data.

+-------------+----------+------------+
|\+--------+\     metadata location   |
|            ++-----------------------+
| dump scope  |  mem     |   PMEM     |
+-------------+----------+------------+
| entire pmem |     A    |     B      |
+-------------+----------+------------+
| metadata    |     C    |     D      |
+-------------+----------+------------+

Case A&B: unsupported
- Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
region into vmcore's PT_LOADs in kexec-tools.
- For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
are readable, and then skips/excludes the specific page according to its attributes. But in the case
of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
errors[2] when specific -d options are specified.
Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.

Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
from reading by the dump application(makedumpfile).

Case C: native supported
metadata is stored in mem, and the entire mem/ram is dumpable.

Case D: unsupported && need your input
To support this situation, the makedumpfile needs to know the location of metadata for each pmem
namespace and the address and size of metadata in the pmem [start, end)

We have thought of a few possible options:

1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.
3) others ?

But then we found that we have always ignored a user case, that is, the user could save the dumpfile
to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
we dumped is inconsistent with the metadata at the moment of the crash happening.
Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
dumpfile on the filesystem/partition based on pmem.

So here I hope you can provide some ideas about this feature/requirement and on the possible solution
for the cases A&B&D mentioned above, it would be greatly appreciated.

If I’m missing something, feel free to let me know. Any feedback & comment are very welcome.


[1] Pmem region layout:
   ^<--namespace0.0---->^<--namespace0.1------>^
   |                    |                      |
   +--+m----------------+--+m------------------+---------------------+-+a
   |++|e                |++|e                  |                     |+|l
   |++|t                |++|t                  |                     |+|i
   |++|a                |++|a                  |                     |+|g
   |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
   |++|a    fsdax       |++|a     devdax       |                     |+|m
   |++|t                |++|t                  |                     |+|e
   +--+a----------------+--+a------------------+---------------------+-+n
   |                                                                   |t
   v<-----------------------pmem region------------------------------->v

[2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/


Thanks
Zhijian
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
  2023-02-23  6:24 ` lizhijian
@ 2023-02-28 14:03   ` Baoquan He
  -1 siblings, 0 replies; 24+ messages in thread
From: Baoquan He @ 2023-02-28 14:03 UTC (permalink / raw)
  To: lizhijian
  Cc: kexec, nvdimm, linux-mm, vgoyal, dyoung, vishal.l.verma,
	dan.j.williams, dave.jiang, horms, k-hagio-ab, akpm,
	Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

On 02/23/23 at 06:24am, lizhijian@fujitsu.com wrote:
> Hello folks,
> 
> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
> I really hope you can provide some feedback.
> 
> pmem memmap can also be called pmem metadata here.
> 
> ### Background and motivate overview ###
> ---
> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
> trouble around pmem (especially Filesystem-DAX).
> 
> 
> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
> status of reverse map.
> 
> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
> troubleshooters are unable to check more details about pmem from the dumpfile.
> 
> ### Make pmem memmap dump support ###
> ---
> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
> 
> First, based on our previous investigation, according to the location of metadata and the scope of
> dump, we can divide it into the following four cases: A, B, C, D.
> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
> it may contain user sensitive data.
> 
> +-------------+----------+------------+
> |\+--------+\     metadata location   |
> |            ++-----------------------+
> | dump scope  |  mem     |   PMEM     |
> +-------------+----------+------------+
> | entire pmem |     A    |     B      |
> +-------------+----------+------------+
> | metadata    |     C    |     D      |
> +-------------+----------+------------+
> 
> Case A&B: unsupported
> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
> region into vmcore's PT_LOADs in kexec-tools.
> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
> are readable, and then skips/excludes the specific page according to its attributes. But in the case
> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
> errors[2] when specific -d options are specified.
> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
> 
> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
> from reading by the dump application(makedumpfile).
> 
> Case C: native supported
> metadata is stored in mem, and the entire mem/ram is dumpable.
> 
> Case D: unsupported && need your input
> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
> namespace and the address and size of metadata in the pmem [start, end)
> 
> We have thought of a few possible options:
> 
> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.
> 3) others ?
> 
> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
> we dumped is inconsistent with the metadata at the moment of the crash happening.
> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
> dumpfile on the filesystem/partition based on pmem.

1) In kernel side, export info of pmem meta data;
2) in makedumpfile size, add an option to specify if we want to dump
   pmem meta data; An option or in dump level?
3) In glue script, detect and warn if pmem data is in pmem and wanted,
   and dump target is the same pmem.

Does this work for you?

Not sure if above items are all do-able. As for parking pmem device
till in kdump kernel, I believe intel pmem expert know how to achieve
that. If there's no way to park pmem during kdump jumping, case D) is
daydream.

Thanks
Baoquan


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-02-28 14:03   ` Baoquan He
  0 siblings, 0 replies; 24+ messages in thread
From: Baoquan He @ 2023-02-28 14:03 UTC (permalink / raw)
  To: lizhijian
  Cc: kexec, nvdimm, linux-mm, vgoyal, dyoung, vishal.l.verma,
	dan.j.williams, dave.jiang, horms, k-hagio-ab, akpm,
	Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

On 02/23/23 at 06:24am, lizhijian@fujitsu.com wrote:
> Hello folks,
> 
> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
> I really hope you can provide some feedback.
> 
> pmem memmap can also be called pmem metadata here.
> 
> ### Background and motivate overview ###
> ---
> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
> trouble around pmem (especially Filesystem-DAX).
> 
> 
> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
> status of reverse map.
> 
> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
> troubleshooters are unable to check more details about pmem from the dumpfile.
> 
> ### Make pmem memmap dump support ###
> ---
> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
> 
> First, based on our previous investigation, according to the location of metadata and the scope of
> dump, we can divide it into the following four cases: A, B, C, D.
> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
> it may contain user sensitive data.
> 
> +-------------+----------+------------+
> |\+--------+\     metadata location   |
> |            ++-----------------------+
> | dump scope  |  mem     |   PMEM     |
> +-------------+----------+------------+
> | entire pmem |     A    |     B      |
> +-------------+----------+------------+
> | metadata    |     C    |     D      |
> +-------------+----------+------------+
> 
> Case A&B: unsupported
> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
> region into vmcore's PT_LOADs in kexec-tools.
> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
> are readable, and then skips/excludes the specific page according to its attributes. But in the case
> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
> errors[2] when specific -d options are specified.
> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
> 
> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
> from reading by the dump application(makedumpfile).
> 
> Case C: native supported
> metadata is stored in mem, and the entire mem/ram is dumpable.
> 
> Case D: unsupported && need your input
> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
> namespace and the address and size of metadata in the pmem [start, end)
> 
> We have thought of a few possible options:
> 
> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.
> 3) others ?
> 
> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
> we dumped is inconsistent with the metadata at the moment of the crash happening.
> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
> dumpfile on the filesystem/partition based on pmem.

1) In kernel side, export info of pmem meta data;
2) in makedumpfile size, add an option to specify if we want to dump
   pmem meta data; An option or in dump level?
3) In glue script, detect and warn if pmem data is in pmem and wanted,
   and dump target is the same pmem.

Does this work for you?

Not sure if above items are all do-able. As for parking pmem device
till in kdump kernel, I believe intel pmem expert know how to achieve
that. If there's no way to park pmem during kdump jumping, case D) is
daydream.

Thanks
Baoquan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
  2023-02-28 14:03   ` Baoquan He
@ 2023-03-01  6:27     ` lizhijian
  -1 siblings, 0 replies; 24+ messages in thread
From: lizhijian @ 2023-03-01  6:27 UTC (permalink / raw)
  To: Baoquan He
  Cc: kexec, nvdimm, linux-mm, vgoyal, dyoung, vishal.l.verma,
	dan.j.williams, dave.jiang, horms, k-hagio-ab, akpm,
	Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst



On 28/02/2023 22:03, Baoquan He wrote:
> On 02/23/23 at 06:24am, lizhijian@fujitsu.com wrote:
>> Hello folks,
>>
>> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
>> I really hope you can provide some feedback.
>>
>> pmem memmap can also be called pmem metadata here.
>>
>> ### Background and motivate overview ###
>> ---
>> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
>> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
>> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
>> trouble around pmem (especially Filesystem-DAX).
>>
>>
>> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
>> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
>> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
>> status of reverse map.
>>
>> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
>> troubleshooters are unable to check more details about pmem from the dumpfile.
>>
>> ### Make pmem memmap dump support ###
>> ---
>> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
>> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
>>
>> First, based on our previous investigation, according to the location of metadata and the scope of
>> dump, we can divide it into the following four cases: A, B, C, D.
>> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
>> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
>> it may contain user sensitive data.
>>
>> +-------------+----------+------------+
>> |\+--------+\     metadata location   |
>> |            ++-----------------------+
>> | dump scope  |  mem     |   PMEM     |
>> +-------------+----------+------------+
>> | entire pmem |     A    |     B      |
>> +-------------+----------+------------+
>> | metadata    |     C    |     D      |
>> +-------------+----------+------------+
>>
>> Case A&B: unsupported
>> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
>> region into vmcore's PT_LOADs in kexec-tools.
>> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
>> are readable, and then skips/excludes the specific page according to its attributes. But in the case
>> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
>> errors[2] when specific -d options are specified.
>> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
>>
>> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
>> from reading by the dump application(makedumpfile).
>>
>> Case C: native supported
>> metadata is stored in mem, and the entire mem/ram is dumpable.
>>
>> Case D: unsupported && need your input
>> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
>> namespace and the address and size of metadata in the pmem [start, end)
>>
>> We have thought of a few possible options:
>>
>> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
>> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
>> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
>> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.
>> 3) others ?
>>
>> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
>> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
>> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
>> we dumped is inconsistent with the metadata at the moment of the crash happening.
>> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
>> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
>> dumpfile on the filesystem/partition based on pmem.
> 


Hi Baoquan

Greatly appreciate your feedback.


> 1) In kernel side, export info of pmem meta data;
> 2) in makedumpfile size, add an option to specify if we want to dump
>     pmem meta data; An option or in dump level?

Yes, I'm working on these 2 step.

> 3) In glue script, detect and warn if pmem data is in pmem and wanted,
>     and dump target is the same pmem.
> 

The 'glue script' means the scirpt like '/usr/bin/kdump.sh' in 2nd kernel? That would be an option,
Shall we abort this dump if "pmem data is in pmem and wanted, and dump target is the same pmem" ?


> Does this work for you?
> 
> Not sure if above items are all do-able. As for parking pmem device
> till in kdump kernel, I believe intel pmem expert know how to achieve
> that. If there's no way to park pmem during kdump jumping, case D) is
> daydream.

What's "kdump jumping" timing here ?
A. 1st kernel crashed and jumping to 2nd kernel or
B. 2nd/kdump kernel do the dump operation.

In my understanding, dumping application(makedumpfile) in kdump kernel will do the dump operation
after modules loaded. Does "parking pmem" mean to postpone pmem modules loading until dump
operation finished ? if so, i think it has the same effect with disabling pmem device in kdump kernel.


Thanks
Zhijian

> 
> Thanks
> Baoquan
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-03-01  6:27     ` lizhijian
  0 siblings, 0 replies; 24+ messages in thread
From: lizhijian @ 2023-03-01  6:27 UTC (permalink / raw)
  To: Baoquan He
  Cc: kexec, nvdimm, linux-mm, vgoyal, dyoung, vishal.l.verma,
	dan.j.williams, dave.jiang, horms, k-hagio-ab, akpm,
	Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst



On 28/02/2023 22:03, Baoquan He wrote:
> On 02/23/23 at 06:24am, lizhijian@fujitsu.com wrote:
>> Hello folks,
>>
>> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
>> I really hope you can provide some feedback.
>>
>> pmem memmap can also be called pmem metadata here.
>>
>> ### Background and motivate overview ###
>> ---
>> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
>> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
>> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
>> trouble around pmem (especially Filesystem-DAX).
>>
>>
>> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
>> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
>> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
>> status of reverse map.
>>
>> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
>> troubleshooters are unable to check more details about pmem from the dumpfile.
>>
>> ### Make pmem memmap dump support ###
>> ---
>> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
>> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
>>
>> First, based on our previous investigation, according to the location of metadata and the scope of
>> dump, we can divide it into the following four cases: A, B, C, D.
>> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
>> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
>> it may contain user sensitive data.
>>
>> +-------------+----------+------------+
>> |\+--------+\     metadata location   |
>> |            ++-----------------------+
>> | dump scope  |  mem     |   PMEM     |
>> +-------------+----------+------------+
>> | entire pmem |     A    |     B      |
>> +-------------+----------+------------+
>> | metadata    |     C    |     D      |
>> +-------------+----------+------------+
>>
>> Case A&B: unsupported
>> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
>> region into vmcore's PT_LOADs in kexec-tools.
>> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
>> are readable, and then skips/excludes the specific page according to its attributes. But in the case
>> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
>> errors[2] when specific -d options are specified.
>> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
>>
>> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
>> from reading by the dump application(makedumpfile).
>>
>> Case C: native supported
>> metadata is stored in mem, and the entire mem/ram is dumpable.
>>
>> Case D: unsupported && need your input
>> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
>> namespace and the address and size of metadata in the pmem [start, end)
>>
>> We have thought of a few possible options:
>>
>> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
>> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
>> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
>> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.
>> 3) others ?
>>
>> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
>> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
>> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
>> we dumped is inconsistent with the metadata at the moment of the crash happening.
>> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
>> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
>> dumpfile on the filesystem/partition based on pmem.
> 


Hi Baoquan

Greatly appreciate your feedback.


> 1) In kernel side, export info of pmem meta data;
> 2) in makedumpfile size, add an option to specify if we want to dump
>     pmem meta data; An option or in dump level?

Yes, I'm working on these 2 step.

> 3) In glue script, detect and warn if pmem data is in pmem and wanted,
>     and dump target is the same pmem.
> 

The 'glue script' means the scirpt like '/usr/bin/kdump.sh' in 2nd kernel? That would be an option,
Shall we abort this dump if "pmem data is in pmem and wanted, and dump target is the same pmem" ?


> Does this work for you?
> 
> Not sure if above items are all do-able. As for parking pmem device
> till in kdump kernel, I believe intel pmem expert know how to achieve
> that. If there's no way to park pmem during kdump jumping, case D) is
> daydream.

What's "kdump jumping" timing here ?
A. 1st kernel crashed and jumping to 2nd kernel or
B. 2nd/kdump kernel do the dump operation.

In my understanding, dumping application(makedumpfile) in kdump kernel will do the dump operation
after modules loaded. Does "parking pmem" mean to postpone pmem modules loading until dump
operation finished ? if so, i think it has the same effect with disabling pmem device in kdump kernel.


Thanks
Zhijian

> 
> Thanks
> Baoquan
> 
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
  2023-03-01  6:27     ` lizhijian
@ 2023-03-01  8:17       ` Baoquan He
  -1 siblings, 0 replies; 24+ messages in thread
From: Baoquan He @ 2023-03-01  8:17 UTC (permalink / raw)
  To: lizhijian
  Cc: kexec, nvdimm, linux-mm, vgoyal, dyoung, vishal.l.verma,
	dan.j.williams, dave.jiang, horms, k-hagio-ab, akpm,
	Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

On 03/01/23 at 06:27am, lizhijian@fujitsu.com wrote:
...... 
> Hi Baoquan
> 
> Greatly appreciate your feedback.
> 
> 
> > 1) In kernel side, export info of pmem meta data;
> > 2) in makedumpfile size, add an option to specify if we want to dump
> >     pmem meta data; An option or in dump level?
> 
> Yes, I'm working on these 2 step.
> 
> > 3) In glue script, detect and warn if pmem data is in pmem and wanted,
> >     and dump target is the same pmem.
> > 
> 
> The 'glue script' means the scirpt like '/usr/bin/kdump.sh' in 2nd kernel? That would be an option,
> Shall we abort this dump if "pmem data is in pmem and wanted, and dump target is the same pmem" ?

Guess you are saying scripts in RHEL/centos/fedora, and yes if I guess
righ. Other distros could have different scripts. For kdump, we need
load kdump kernel/initramfs in advance, then wait to capture any crash.
When we load, we can detect and check whether the environment and
setup is expected. If not, we can warn or error out message to users.
We don't need to do the checking until crash is triggered, then decide
to abort the dump or not.

> > Does this work for you?
> > 
> > Not sure if above items are all do-able. As for parking pmem device
> > till in kdump kernel, I believe intel pmem expert know how to achieve
> > that. If there's no way to park pmem during kdump jumping, case D) is
> > daydream.
> 
> What's "kdump jumping" timing here ?
> A. 1st kernel crashed and jumping to 2nd kernel or
> B. 2nd/kdump kernel do the dump operation.
> 
> In my understanding, dumping application(makedumpfile) in kdump kernel will do the dump operation
> after modules loaded. Does "parking pmem" mean to postpone pmem modules loading until dump
> operation finished ? if so, i think it has the same effect with disabling pmem device in kdump kernel.

I used parking which should be wrong. When crash happened, we currently
only shutdown unrelated CPU and interupt controller, but keep other
devices on-flight. This is why we can preserve the content of crash-ed
kernel's memory. For normal memory device, we reserve small part as
crashkernel to run kdump kernel and dumping, keep the 1st kernel's
memory untouched. For pmem, we may need to do something similar to keep
its content untouched. I am not sure if disabling pmem device is the
thing we need do in kdump kernel, what we want is
1) not shutdown pmem in 1st kernel when crash-ed
2) do not re-initialize pmem, at least do not remove its content

1) has been there with the current handling. We need do something to
guarantee 2)? I don't know pmem well, just personal thought.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-03-01  8:17       ` Baoquan He
  0 siblings, 0 replies; 24+ messages in thread
From: Baoquan He @ 2023-03-01  8:17 UTC (permalink / raw)
  To: lizhijian
  Cc: kexec, nvdimm, linux-mm, vgoyal, dyoung, vishal.l.verma,
	dan.j.williams, dave.jiang, horms, k-hagio-ab, akpm,
	Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

On 03/01/23 at 06:27am, lizhijian@fujitsu.com wrote:
...... 
> Hi Baoquan
> 
> Greatly appreciate your feedback.
> 
> 
> > 1) In kernel side, export info of pmem meta data;
> > 2) in makedumpfile size, add an option to specify if we want to dump
> >     pmem meta data; An option or in dump level?
> 
> Yes, I'm working on these 2 step.
> 
> > 3) In glue script, detect and warn if pmem data is in pmem and wanted,
> >     and dump target is the same pmem.
> > 
> 
> The 'glue script' means the scirpt like '/usr/bin/kdump.sh' in 2nd kernel? That would be an option,
> Shall we abort this dump if "pmem data is in pmem and wanted, and dump target is the same pmem" ?

Guess you are saying scripts in RHEL/centos/fedora, and yes if I guess
righ. Other distros could have different scripts. For kdump, we need
load kdump kernel/initramfs in advance, then wait to capture any crash.
When we load, we can detect and check whether the environment and
setup is expected. If not, we can warn or error out message to users.
We don't need to do the checking until crash is triggered, then decide
to abort the dump or not.

> > Does this work for you?
> > 
> > Not sure if above items are all do-able. As for parking pmem device
> > till in kdump kernel, I believe intel pmem expert know how to achieve
> > that. If there's no way to park pmem during kdump jumping, case D) is
> > daydream.
> 
> What's "kdump jumping" timing here ?
> A. 1st kernel crashed and jumping to 2nd kernel or
> B. 2nd/kdump kernel do the dump operation.
> 
> In my understanding, dumping application(makedumpfile) in kdump kernel will do the dump operation
> after modules loaded. Does "parking pmem" mean to postpone pmem modules loading until dump
> operation finished ? if so, i think it has the same effect with disabling pmem device in kdump kernel.

I used parking which should be wrong. When crash happened, we currently
only shutdown unrelated CPU and interupt controller, but keep other
devices on-flight. This is why we can preserve the content of crash-ed
kernel's memory. For normal memory device, we reserve small part as
crashkernel to run kdump kernel and dumping, keep the 1st kernel's
memory untouched. For pmem, we may need to do something similar to keep
its content untouched. I am not sure if disabling pmem device is the
thing we need do in kdump kernel, what we want is
1) not shutdown pmem in 1st kernel when crash-ed
2) do not re-initialize pmem, at least do not remove its content

1) has been there with the current handling. We need do something to
guarantee 2)? I don't know pmem well, just personal thought.


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
  2023-03-01  8:17       ` Baoquan He
@ 2023-03-03  2:27         ` lizhijian
  -1 siblings, 0 replies; 24+ messages in thread
From: lizhijian @ 2023-03-03  2:27 UTC (permalink / raw)
  To: Baoquan He
  Cc: kexec, nvdimm, linux-mm, vgoyal, dyoung, vishal.l.verma,
	dan.j.williams, dave.jiang, horms, k-hagio-ab, akpm,
	Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst



On 01/03/2023 16:17, Baoquan He wrote:
> On 03/01/23 at 06:27am, lizhijian@fujitsu.com wrote:
> ......
>> Hi Baoquan
>>
>> Greatly appreciate your feedback.
>>
>>
>>> 1) In kernel side, export info of pmem meta data;
>>> 2) in makedumpfile size, add an option to specify if we want to dump
>>>      pmem meta data; An option or in dump level?
>>
>> Yes, I'm working on these 2 step.
>>
>>> 3) In glue script, detect and warn if pmem data is in pmem and wanted,
>>>      and dump target is the same pmem.
>>>
>>
>> The 'glue script' means the scirpt like '/usr/bin/kdump.sh' in 2nd kernel? That would be an option,
>> Shall we abort this dump if "pmem data is in pmem and wanted, and dump target is the same pmem" ?
> 
> Guess you are saying scripts in RHEL/centos/fedora, and yes if I guess
> righ. Other distros could have different scripts. For kdump, we need
> load kdump kernel/initramfs in advance, then wait to capture any crash.
> When we load, we can detect and check whether the environment and
> setup is expected. If not, we can warn or error out message to users.


IIUC, take fedora for example,
T1: in 1st kernel, kdump.service(/usr/bin/kdumpctl) will do a sanity check before loading kernel and initramfs.
     In this moment, as you said "we can detect and check whether the environment and setup is expected. If not,
     we can warn or error out message to users."
     I think we should abort the kdump service if "pmem data is in pmem and wanted, and dump target is the same pmem".
     For OS administrators, they could either change the dump target or disable the pmem metadadata dump to make
     kdump.service work again.

But kdump.service is distros independent, some OS administrators will use `kexec` command directly instead of service/script helpers.


> We don't need to do the checking until crash is triggered, then decide
> to abort the dump or not.

T2: in 2nd kernel, since 1st kernel's glue scripts vary by distribution, we have to do the sanity check again to decide
to abort the dump or not.



> 
>>> Does this work for you?
>>>
>>> Not sure if above items are all do-able. As for parking pmem device
>>> till in kdump kernel, I believe intel pmem expert know how to achieve
>>> that. If there's no way to park pmem during kdump jumping, case D) is
>>> daydream.
>>
>> What's "kdump jumping" timing here ?
>> A. 1st kernel crashed and jumping to 2nd kernel or
>> B. 2nd/kdump kernel do the dump operation.
>>
>> In my understanding, dumping application(makedumpfile) in kdump kernel will do the dump operation
>> after modules loaded. Does "parking pmem" mean to postpone pmem modules loading until dump
>> operation finished ? if so, i think it has the same effect with disabling pmem device in kdump kernel.
> 
> I used parking which should be wrong. When crash happened, we currently
> only shutdown unrelated CPU and interupt controller, but keep other
> devices on-flight. This is why we can preserve the content of crash-ed
> kernel's memory. For normal memory device, we reserve small part as
> crashkernel to run kdump kernel and dumping, keep the 1st kernel's
> memory untouched. For pmem, we may need to do something similar to keep
> its content untouched. I am not sure if disabling pmem device is the
> thing we need do in kdump kernel, what we want is
> 1) not shutdown pmem in 1st kernel when crash-ed
> 2) do not re-initialize pmem, at least do not remove its content
> 
> 1) has been there with the current handling. 

I think so.


We need do something to
> guarantee 2)? I don't know pmem well, just personal thought.

thanks for your idea, i will take a deeper look.


Thanks
Zhijian


> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-03-03  2:27         ` lizhijian
  0 siblings, 0 replies; 24+ messages in thread
From: lizhijian @ 2023-03-03  2:27 UTC (permalink / raw)
  To: Baoquan He
  Cc: kexec, nvdimm, linux-mm, vgoyal, dyoung, vishal.l.verma,
	dan.j.williams, dave.jiang, horms, k-hagio-ab, akpm,
	Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst



On 01/03/2023 16:17, Baoquan He wrote:
> On 03/01/23 at 06:27am, lizhijian@fujitsu.com wrote:
> ......
>> Hi Baoquan
>>
>> Greatly appreciate your feedback.
>>
>>
>>> 1) In kernel side, export info of pmem meta data;
>>> 2) in makedumpfile size, add an option to specify if we want to dump
>>>      pmem meta data; An option or in dump level?
>>
>> Yes, I'm working on these 2 step.
>>
>>> 3) In glue script, detect and warn if pmem data is in pmem and wanted,
>>>      and dump target is the same pmem.
>>>
>>
>> The 'glue script' means the scirpt like '/usr/bin/kdump.sh' in 2nd kernel? That would be an option,
>> Shall we abort this dump if "pmem data is in pmem and wanted, and dump target is the same pmem" ?
> 
> Guess you are saying scripts in RHEL/centos/fedora, and yes if I guess
> righ. Other distros could have different scripts. For kdump, we need
> load kdump kernel/initramfs in advance, then wait to capture any crash.
> When we load, we can detect and check whether the environment and
> setup is expected. If not, we can warn or error out message to users.


IIUC, take fedora for example,
T1: in 1st kernel, kdump.service(/usr/bin/kdumpctl) will do a sanity check before loading kernel and initramfs.
     In this moment, as you said "we can detect and check whether the environment and setup is expected. If not,
     we can warn or error out message to users."
     I think we should abort the kdump service if "pmem data is in pmem and wanted, and dump target is the same pmem".
     For OS administrators, they could either change the dump target or disable the pmem metadadata dump to make
     kdump.service work again.

But kdump.service is distros independent, some OS administrators will use `kexec` command directly instead of service/script helpers.


> We don't need to do the checking until crash is triggered, then decide
> to abort the dump or not.

T2: in 2nd kernel, since 1st kernel's glue scripts vary by distribution, we have to do the sanity check again to decide
to abort the dump or not.



> 
>>> Does this work for you?
>>>
>>> Not sure if above items are all do-able. As for parking pmem device
>>> till in kdump kernel, I believe intel pmem expert know how to achieve
>>> that. If there's no way to park pmem during kdump jumping, case D) is
>>> daydream.
>>
>> What's "kdump jumping" timing here ?
>> A. 1st kernel crashed and jumping to 2nd kernel or
>> B. 2nd/kdump kernel do the dump operation.
>>
>> In my understanding, dumping application(makedumpfile) in kdump kernel will do the dump operation
>> after modules loaded. Does "parking pmem" mean to postpone pmem modules loading until dump
>> operation finished ? if so, i think it has the same effect with disabling pmem device in kdump kernel.
> 
> I used parking which should be wrong. When crash happened, we currently
> only shutdown unrelated CPU and interupt controller, but keep other
> devices on-flight. This is why we can preserve the content of crash-ed
> kernel's memory. For normal memory device, we reserve small part as
> crashkernel to run kdump kernel and dumping, keep the 1st kernel's
> memory untouched. For pmem, we may need to do something similar to keep
> its content untouched. I am not sure if disabling pmem device is the
> thing we need do in kdump kernel, what we want is
> 1) not shutdown pmem in 1st kernel when crash-ed
> 2) do not re-initialize pmem, at least do not remove its content
> 
> 1) has been there with the current handling. 

I think so.


We need do something to
> guarantee 2)? I don't know pmem well, just personal thought.

thanks for your idea, i will take a deeper look.


Thanks
Zhijian


> 
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
  2023-03-03  2:27         ` lizhijian
@ 2023-03-03  9:21           ` Baoquan He
  -1 siblings, 0 replies; 24+ messages in thread
From: Baoquan He @ 2023-03-03  9:21 UTC (permalink / raw)
  To: lizhijian
  Cc: kexec, nvdimm, linux-mm, vgoyal, dyoung, vishal.l.verma,
	dan.j.williams, dave.jiang, horms, k-hagio-ab, akpm,
	Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

On 03/03/23 at 02:27am, lizhijian@fujitsu.com wrote:
> 
> 
> On 01/03/2023 16:17, Baoquan He wrote:
> > On 03/01/23 at 06:27am, lizhijian@fujitsu.com wrote:
> > ......
> >> Hi Baoquan
> >>
> >> Greatly appreciate your feedback.
> >>
> >>
> >>> 1) In kernel side, export info of pmem meta data;
> >>> 2) in makedumpfile size, add an option to specify if we want to dump
> >>>      pmem meta data; An option or in dump level?
> >>
> >> Yes, I'm working on these 2 step.
> >>
> >>> 3) In glue script, detect and warn if pmem data is in pmem and wanted,
> >>>      and dump target is the same pmem.
> >>>
> >>
> >> The 'glue script' means the scirpt like '/usr/bin/kdump.sh' in 2nd kernel? That would be an option,
> >> Shall we abort this dump if "pmem data is in pmem and wanted, and dump target is the same pmem" ?
> > 
> > Guess you are saying scripts in RHEL/centos/fedora, and yes if I guess
> > righ. Other distros could have different scripts. For kdump, we need
> > load kdump kernel/initramfs in advance, then wait to capture any crash.
> > When we load, we can detect and check whether the environment and
> > setup is expected. If not, we can warn or error out message to users.
> 
> 
> IIUC, take fedora for example,
> T1: in 1st kernel, kdump.service(/usr/bin/kdumpctl) will do a sanity check before loading kernel and initramfs.
>      In this moment, as you said "we can detect and check whether the environment and setup is expected. If not,
>      we can warn or error out message to users."
>      I think we should abort the kdump service if "pmem data is in pmem and wanted, and dump target is the same pmem".
>      For OS administrators, they could either change the dump target or disable the pmem metadadata dump to make
>      kdump.service work again.
> 
> But kdump.service is distros independent, some OS administrators will use `kexec` command directly instead of service/script helpers.

Yeah, we can add document in kernel or somewhere else that dumping to
pmem is dangerous, especially when we want to dump pmem meta. People who
dare use kexec command directly, should handle it by her/his own.

> 
> > We don't need to do the checking until crash is triggered, then decide
> > to abort the dump or not.
> 
> T2: in 2nd kernel, since 1st kernel's glue scripts vary by distribution, we have to do the sanity check again to decide
> to abort the dump or not.

Hmm, we may not need to worry about that. kernel just need to do its own
business, not touching pmem data during kdump jumping and booting, and
provide way to allow makedumpfile to read out pmem meta. Anything else
should be taken care of by user or distros. 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-03-03  9:21           ` Baoquan He
  0 siblings, 0 replies; 24+ messages in thread
From: Baoquan He @ 2023-03-03  9:21 UTC (permalink / raw)
  To: lizhijian
  Cc: kexec, nvdimm, linux-mm, vgoyal, dyoung, vishal.l.verma,
	dan.j.williams, dave.jiang, horms, k-hagio-ab, akpm,
	Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

On 03/03/23 at 02:27am, lizhijian@fujitsu.com wrote:
> 
> 
> On 01/03/2023 16:17, Baoquan He wrote:
> > On 03/01/23 at 06:27am, lizhijian@fujitsu.com wrote:
> > ......
> >> Hi Baoquan
> >>
> >> Greatly appreciate your feedback.
> >>
> >>
> >>> 1) In kernel side, export info of pmem meta data;
> >>> 2) in makedumpfile size, add an option to specify if we want to dump
> >>>      pmem meta data; An option or in dump level?
> >>
> >> Yes, I'm working on these 2 step.
> >>
> >>> 3) In glue script, detect and warn if pmem data is in pmem and wanted,
> >>>      and dump target is the same pmem.
> >>>
> >>
> >> The 'glue script' means the scirpt like '/usr/bin/kdump.sh' in 2nd kernel? That would be an option,
> >> Shall we abort this dump if "pmem data is in pmem and wanted, and dump target is the same pmem" ?
> > 
> > Guess you are saying scripts in RHEL/centos/fedora, and yes if I guess
> > righ. Other distros could have different scripts. For kdump, we need
> > load kdump kernel/initramfs in advance, then wait to capture any crash.
> > When we load, we can detect and check whether the environment and
> > setup is expected. If not, we can warn or error out message to users.
> 
> 
> IIUC, take fedora for example,
> T1: in 1st kernel, kdump.service(/usr/bin/kdumpctl) will do a sanity check before loading kernel and initramfs.
>      In this moment, as you said "we can detect and check whether the environment and setup is expected. If not,
>      we can warn or error out message to users."
>      I think we should abort the kdump service if "pmem data is in pmem and wanted, and dump target is the same pmem".
>      For OS administrators, they could either change the dump target or disable the pmem metadadata dump to make
>      kdump.service work again.
> 
> But kdump.service is distros independent, some OS administrators will use `kexec` command directly instead of service/script helpers.

Yeah, we can add document in kernel or somewhere else that dumping to
pmem is dangerous, especially when we want to dump pmem meta. People who
dare use kexec command directly, should handle it by her/his own.

> 
> > We don't need to do the checking until crash is triggered, then decide
> > to abort the dump or not.
> 
> T2: in 2nd kernel, since 1st kernel's glue scripts vary by distribution, we have to do the sanity check again to decide
> to abort the dump or not.

Hmm, we may not need to worry about that. kernel just need to do its own
business, not touching pmem data during kdump jumping and booting, and
provide way to allow makedumpfile to read out pmem meta. Anything else
should be taken care of by user or distros. 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
  2023-02-23  6:24 ` lizhijian
@ 2023-03-07  2:05   ` HAGIO KAZUHITO(萩尾 一仁)
  -1 siblings, 0 replies; 24+ messages in thread
From: HAGIO KAZUHITO(萩尾 一仁) @ 2023-03-07  2:05 UTC (permalink / raw)
  To: lizhijian, kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dan.j.williams,
	dave.jiang, horms, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

On 2023/02/23 15:24, lizhijian@fujitsu.com wrote:
> Hello folks,
> 
> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
> I really hope you can provide some feedback.
> 
> pmem memmap can also be called pmem metadata here.
> 
> ### Background and motivate overview ###
> ---
> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
> trouble around pmem (especially Filesystem-DAX).
> 
> 
> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
> status of reverse map.
> 
> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
> troubleshooters are unable to check more details about pmem from the dumpfile.
> 
> ### Make pmem memmap dump support ###
> ---
> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
> 
> First, based on our previous investigation, according to the location of metadata and the scope of
> dump, we can divide it into the following four cases: A, B, C, D.
> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
> it may contain user sensitive data.
> 
> +-------------+----------+------------+
> |\+--------+\     metadata location   |
> |            ++-----------------------+
> | dump scope  |  mem     |   PMEM     |
> +-------------+----------+------------+
> | entire pmem |     A    |     B      |
> +-------------+----------+------------+
> | metadata    |     C    |     D      |
> +-------------+----------+------------+
> 
> Case A&B: unsupported
> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
> region into vmcore's PT_LOADs in kexec-tools.
> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
> are readable, and then skips/excludes the specific page according to its attributes. But in the case
> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
> errors[2] when specific -d options are specified.
> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
> 
> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
> from reading by the dump application(makedumpfile).
> 
> Case C: native supported
> metadata is stored in mem, and the entire mem/ram is dumpable.
> 
> Case D: unsupported && need your input
> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
> namespace and the address and size of metadata in the pmem [start, end)
> 
> We have thought of a few possible options:
> 
> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.

Hi Zhijian,

sorry, probably I don't understand enough, but do these mean that
  1. /proc/vmcore exports pmem regions with PT_LOADs, which contain
     unreadable ones, and
  2. makedumpfile gets to know the readable regions somehow?

Then /proc/vmcore with pmem cannot be captured by other commands,
e.g. cp command?

Thanks,
Kazu

> 3) others ?
> 
> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
> we dumped is inconsistent with the metadata at the moment of the crash happening.
> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
> dumpfile on the filesystem/partition based on pmem.
> 
> So here I hope you can provide some ideas about this feature/requirement and on the possible solution
> for the cases A&B&D mentioned above, it would be greatly appreciated.
> 
> If I’m missing something, feel free to let me know. Any feedback & comment are very welcome.
> 
> 
> [1] Pmem region layout:
>     ^<--namespace0.0---->^<--namespace0.1------>^
>     |                    |                      |
>     +--+m----------------+--+m------------------+---------------------+-+a
>     |++|e                |++|e                  |                     |+|l
>     |++|t                |++|t                  |                     |+|i
>     |++|a                |++|a                  |                     |+|g
>     |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
>     |++|a    fsdax       |++|a     devdax       |                     |+|m
>     |++|t                |++|t                  |                     |+|e
>     +--+a----------------+--+a------------------+---------------------+-+n
>     |                                                                   |t
>     v<-----------------------pmem region------------------------------->v
> 
> [2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/
> 
> 
> Thanks
> Zhijian

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-03-07  2:05   ` HAGIO KAZUHITO(萩尾 一仁)
  0 siblings, 0 replies; 24+ messages in thread
From: HAGIO KAZUHITO(萩尾 一仁) @ 2023-03-07  2:05 UTC (permalink / raw)
  To: lizhijian, kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dan.j.williams,
	dave.jiang, horms, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

On 2023/02/23 15:24, lizhijian@fujitsu.com wrote:
> Hello folks,
> 
> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
> I really hope you can provide some feedback.
> 
> pmem memmap can also be called pmem metadata here.
> 
> ### Background and motivate overview ###
> ---
> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
> trouble around pmem (especially Filesystem-DAX).
> 
> 
> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
> status of reverse map.
> 
> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
> troubleshooters are unable to check more details about pmem from the dumpfile.
> 
> ### Make pmem memmap dump support ###
> ---
> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
> 
> First, based on our previous investigation, according to the location of metadata and the scope of
> dump, we can divide it into the following four cases: A, B, C, D.
> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
> it may contain user sensitive data.
> 
> +-------------+----------+------------+
> |\+--------+\     metadata location   |
> |            ++-----------------------+
> | dump scope  |  mem     |   PMEM     |
> +-------------+----------+------------+
> | entire pmem |     A    |     B      |
> +-------------+----------+------------+
> | metadata    |     C    |     D      |
> +-------------+----------+------------+
> 
> Case A&B: unsupported
> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
> region into vmcore's PT_LOADs in kexec-tools.
> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
> are readable, and then skips/excludes the specific page according to its attributes. But in the case
> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
> errors[2] when specific -d options are specified.
> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
> 
> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
> from reading by the dump application(makedumpfile).
> 
> Case C: native supported
> metadata is stored in mem, and the entire mem/ram is dumpable.
> 
> Case D: unsupported && need your input
> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
> namespace and the address and size of metadata in the pmem [start, end)
> 
> We have thought of a few possible options:
> 
> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.

Hi Zhijian,

sorry, probably I don't understand enough, but do these mean that
  1. /proc/vmcore exports pmem regions with PT_LOADs, which contain
     unreadable ones, and
  2. makedumpfile gets to know the readable regions somehow?

Then /proc/vmcore with pmem cannot be captured by other commands,
e.g. cp command?

Thanks,
Kazu

> 3) others ?
> 
> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
> we dumped is inconsistent with the metadata at the moment of the crash happening.
> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
> dumpfile on the filesystem/partition based on pmem.
> 
> So here I hope you can provide some ideas about this feature/requirement and on the possible solution
> for the cases A&B&D mentioned above, it would be greatly appreciated.
> 
> If I’m missing something, feel free to let me know. Any feedback & comment are very welcome.
> 
> 
> [1] Pmem region layout:
>     ^<--namespace0.0---->^<--namespace0.1------>^
>     |                    |                      |
>     +--+m----------------+--+m------------------+---------------------+-+a
>     |++|e                |++|e                  |                     |+|l
>     |++|t                |++|t                  |                     |+|i
>     |++|a                |++|a                  |                     |+|g
>     |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
>     |++|a    fsdax       |++|a     devdax       |                     |+|m
>     |++|t                |++|t                  |                     |+|e
>     +--+a----------------+--+a------------------+---------------------+-+n
>     |                                                                   |t
>     v<-----------------------pmem region------------------------------->v
> 
> [2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/
> 
> 
> Thanks
> Zhijian
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
  2023-03-07  2:05   ` HAGIO KAZUHITO(萩尾 一仁)
@ 2023-03-07  2:49     ` lizhijian
  -1 siblings, 0 replies; 24+ messages in thread
From: lizhijian @ 2023-03-07  2:49 UTC (permalink / raw)
  To: HAGIO KAZUHITO(萩尾 一仁),
	kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dan.j.williams,
	dave.jiang, horms, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst



On 07/03/2023 10:05, HAGIO KAZUHITO(萩尾 一仁) wrote:
> On 2023/02/23 15:24, lizhijian@fujitsu.com wrote:
>> Hello folks,
>>
>> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
>> I really hope you can provide some feedback.
>>
>> pmem memmap can also be called pmem metadata here.
>>
>> ### Background and motivate overview ###
>> ---
>> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
>> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
>> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
>> trouble around pmem (especially Filesystem-DAX).
>>
>>
>> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
>> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
>> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
>> status of reverse map.
>>
>> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
>> troubleshooters are unable to check more details about pmem from the dumpfile.
>>
>> ### Make pmem memmap dump support ###
>> ---
>> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
>> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
>>
>> First, based on our previous investigation, according to the location of metadata and the scope of
>> dump, we can divide it into the following four cases: A, B, C, D.
>> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
>> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
>> it may contain user sensitive data.
>>
>> +-------------+----------+------------+
>> |\+--------+\     metadata location   |
>> |            ++-----------------------+
>> | dump scope  |  mem     |   PMEM     |
>> +-------------+----------+------------+
>> | entire pmem |     A    |     B      |
>> +-------------+----------+------------+
>> | metadata    |     C    |     D      |
>> +-------------+----------+------------+
>>
>> Case A&B: unsupported
>> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
>> region into vmcore's PT_LOADs in kexec-tools.
>> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
>> are readable, and then skips/excludes the specific page according to its attributes. But in the case
>> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
>> errors[2] when specific -d options are specified.
>> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
>>
>> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
>> from reading by the dump application(makedumpfile).
>>
>> Case C: native supported
>> metadata is stored in mem, and the entire mem/ram is dumpable.
>>
>> Case D: unsupported && need your input
>> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
>> namespace and the address and size of metadata in the pmem [start, end)
>>
>> We have thought of a few possible options:
>>
>> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
>> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
>> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
>> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.
> 
> Hi Zhijian,
> 
> sorry, probably I don't understand enough, but do these mean that
>    1. /proc/vmcore exports pmem regions with PT_LOADs, which contain
>       unreadable ones, and
>    2. makedumpfile gets to know the readable regions somehow?

Kazu,

Generally, only metadata of pmem is readable by crash-utilities, because metadata contains its own memmap(page array).
The rest part of pmem which could be used as a block device(DAX filesystem) or other purpose, so it's not much helpful
for the troubleshooting.

In my understanding, PT_LOADs is part of ELF format, it complies with what it's.
In my current thoughts,
1. crash-tool will export the entire pmem region to /proc/vmcore. makedumpfile/cp etc commands can read the entire
pmem region directly.
2. export the namespace layout to vmcore as a symbol, then dumping applications(makedumpfile) can figure out where
the metadata is, and read metadata only.

Not sure whether the reply is helpful, if you have any other questions, feel free to let me know. :)


Thanks
Zhijian

> 
> Then /proc/vmcore with pmem cannot be captured by other commands,
> e.g. cp command?
> 
> Thanks,
> Kazu
> 
>> 3) others ?
>>
>> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
>> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
>> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
>> we dumped is inconsistent with the metadata at the moment of the crash happening.
>> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
>> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
>> dumpfile on the filesystem/partition based on pmem.
>>
>> So here I hope you can provide some ideas about this feature/requirement and on the possible solution
>> for the cases A&B&D mentioned above, it would be greatly appreciated.
>>
>> If I’m missing something, feel free to let me know. Any feedback & comment are very welcome.
>>
>>
>> [1] Pmem region layout:
>>      ^<--namespace0.0---->^<--namespace0.1------>^
>>      |                    |                      |
>>      +--+m----------------+--+m------------------+---------------------+-+a
>>      |++|e                |++|e                  |                     |+|l
>>      |++|t                |++|t                  |                     |+|i
>>      |++|a                |++|a                  |                     |+|g
>>      |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
>>      |++|a    fsdax       |++|a     devdax       |                     |+|m
>>      |++|t                |++|t                  |                     |+|e
>>      +--+a----------------+--+a------------------+---------------------+-+n
>>      |                                                                   |t
>>      v<-----------------------pmem region------------------------------->v
>>
>> [2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/
>>
>>
>> Thanks
>> Zhijian

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-03-07  2:49     ` lizhijian
  0 siblings, 0 replies; 24+ messages in thread
From: lizhijian @ 2023-03-07  2:49 UTC (permalink / raw)
  To: HAGIO KAZUHITO(萩尾 一仁),
	kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dan.j.williams,
	dave.jiang, horms, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst



On 07/03/2023 10:05, HAGIO KAZUHITO(萩尾 一仁) wrote:
> On 2023/02/23 15:24, lizhijian@fujitsu.com wrote:
>> Hello folks,
>>
>> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
>> I really hope you can provide some feedback.
>>
>> pmem memmap can also be called pmem metadata here.
>>
>> ### Background and motivate overview ###
>> ---
>> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
>> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
>> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
>> trouble around pmem (especially Filesystem-DAX).
>>
>>
>> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
>> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
>> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
>> status of reverse map.
>>
>> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
>> troubleshooters are unable to check more details about pmem from the dumpfile.
>>
>> ### Make pmem memmap dump support ###
>> ---
>> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
>> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
>>
>> First, based on our previous investigation, according to the location of metadata and the scope of
>> dump, we can divide it into the following four cases: A, B, C, D.
>> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
>> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
>> it may contain user sensitive data.
>>
>> +-------------+----------+------------+
>> |\+--------+\     metadata location   |
>> |            ++-----------------------+
>> | dump scope  |  mem     |   PMEM     |
>> +-------------+----------+------------+
>> | entire pmem |     A    |     B      |
>> +-------------+----------+------------+
>> | metadata    |     C    |     D      |
>> +-------------+----------+------------+
>>
>> Case A&B: unsupported
>> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
>> region into vmcore's PT_LOADs in kexec-tools.
>> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
>> are readable, and then skips/excludes the specific page according to its attributes. But in the case
>> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
>> errors[2] when specific -d options are specified.
>> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
>>
>> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
>> from reading by the dump application(makedumpfile).
>>
>> Case C: native supported
>> metadata is stored in mem, and the entire mem/ram is dumpable.
>>
>> Case D: unsupported && need your input
>> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
>> namespace and the address and size of metadata in the pmem [start, end)
>>
>> We have thought of a few possible options:
>>
>> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
>> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
>> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
>> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.
> 
> Hi Zhijian,
> 
> sorry, probably I don't understand enough, but do these mean that
>    1. /proc/vmcore exports pmem regions with PT_LOADs, which contain
>       unreadable ones, and
>    2. makedumpfile gets to know the readable regions somehow?

Kazu,

Generally, only metadata of pmem is readable by crash-utilities, because metadata contains its own memmap(page array).
The rest part of pmem which could be used as a block device(DAX filesystem) or other purpose, so it's not much helpful
for the troubleshooting.

In my understanding, PT_LOADs is part of ELF format, it complies with what it's.
In my current thoughts,
1. crash-tool will export the entire pmem region to /proc/vmcore. makedumpfile/cp etc commands can read the entire
pmem region directly.
2. export the namespace layout to vmcore as a symbol, then dumping applications(makedumpfile) can figure out where
the metadata is, and read metadata only.

Not sure whether the reply is helpful, if you have any other questions, feel free to let me know. :)


Thanks
Zhijian

> 
> Then /proc/vmcore with pmem cannot be captured by other commands,
> e.g. cp command?
> 
> Thanks,
> Kazu
> 
>> 3) others ?
>>
>> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
>> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
>> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
>> we dumped is inconsistent with the metadata at the moment of the crash happening.
>> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
>> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
>> dumpfile on the filesystem/partition based on pmem.
>>
>> So here I hope you can provide some ideas about this feature/requirement and on the possible solution
>> for the cases A&B&D mentioned above, it would be greatly appreciated.
>>
>> If I’m missing something, feel free to let me know. Any feedback & comment are very welcome.
>>
>>
>> [1] Pmem region layout:
>>      ^<--namespace0.0---->^<--namespace0.1------>^
>>      |                    |                      |
>>      +--+m----------------+--+m------------------+---------------------+-+a
>>      |++|e                |++|e                  |                     |+|l
>>      |++|t                |++|t                  |                     |+|i
>>      |++|a                |++|a                  |                     |+|g
>>      |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
>>      |++|a    fsdax       |++|a     devdax       |                     |+|m
>>      |++|t                |++|t                  |                     |+|e
>>      +--+a----------------+--+a------------------+---------------------+-+n
>>      |                                                                   |t
>>      v<-----------------------pmem region------------------------------->v
>>
>> [2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/
>>
>>
>> Thanks
>> Zhijian
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
  2023-03-07  2:49     ` lizhijian
@ 2023-03-07  8:31       ` HAGIO KAZUHITO(萩尾 一仁)
  -1 siblings, 0 replies; 24+ messages in thread
From: HAGIO KAZUHITO(萩尾 一仁) @ 2023-03-07  8:31 UTC (permalink / raw)
  To: lizhijian, kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dan.j.williams,
	dave.jiang, horms, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

On 2023/03/07 11:49, lizhijian@fujitsu.com wrote:
> On 07/03/2023 10:05, HAGIO KAZUHITO(萩尾 一仁) wrote:
>> On 2023/02/23 15:24, lizhijian@fujitsu.com wrote:
>>> Hello folks,
>>>
>>> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
>>> I really hope you can provide some feedback.
>>>
>>> pmem memmap can also be called pmem metadata here.
>>>
>>> ### Background and motivate overview ###
>>> ---
>>> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
>>> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
>>> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
>>> trouble around pmem (especially Filesystem-DAX).
>>>
>>>
>>> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
>>> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
>>> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
>>> status of reverse map.
>>>
>>> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
>>> troubleshooters are unable to check more details about pmem from the dumpfile.
>>>
>>> ### Make pmem memmap dump support ###
>>> ---
>>> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
>>> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
>>>
>>> First, based on our previous investigation, according to the location of metadata and the scope of
>>> dump, we can divide it into the following four cases: A, B, C, D.
>>> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
>>> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
>>> it may contain user sensitive data.
>>>
>>> +-------------+----------+------------+
>>> |\+--------+\     metadata location   |
>>> |            ++-----------------------+
>>> | dump scope  |  mem     |   PMEM     |
>>> +-------------+----------+------------+
>>> | entire pmem |     A    |     B      |
>>> +-------------+----------+------------+
>>> | metadata    |     C    |     D      |
>>> +-------------+----------+------------+
>>>
>>> Case A&B: unsupported
>>> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
>>> region into vmcore's PT_LOADs in kexec-tools.
>>> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
>>> are readable, and then skips/excludes the specific page according to its attributes. But in the case
>>> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
>>> errors[2] when specific -d options are specified.
>>> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
>>>
>>> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
>>> from reading by the dump application(makedumpfile).
>>>
>>> Case C: native supported
>>> metadata is stored in mem, and the entire mem/ram is dumpable.
>>>
>>> Case D: unsupported && need your input
>>> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
>>> namespace and the address and size of metadata in the pmem [start, end)
>>>
>>> We have thought of a few possible options:
>>>
>>> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
>>> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
>>> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
>>> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.
>>
>> Hi Zhijian,
>>
>> sorry, probably I don't understand enough, but do these mean that
>>     1. /proc/vmcore exports pmem regions with PT_LOADs, which contain
>>        unreadable ones, and
>>     2. makedumpfile gets to know the readable regions somehow?
> 
> Kazu,
> 
> Generally, only metadata of pmem is readable by crash-utilities, because metadata contains its own memmap(page array).
> The rest part of pmem which could be used as a block device(DAX filesystem) or other purpose, so it's not much helpful
> for the troubleshooting.
> 
> In my understanding, PT_LOADs is part of ELF format, it complies with what it's.
> In my current thoughts,
> 1. crash-tool will export the entire pmem region to /proc/vmcore. makedumpfile/cp etc commands can read the entire
> pmem region directly.
> 2. export the namespace layout to vmcore as a symbol, then dumping applications(makedumpfile) can figure out where
> the metadata is, and read metadata only.

Ah got it, Thanks!

My understanding is that makedumpfile/cp will be able to read the entire
pmem, but with some makedumpfile -d option values it cannot get the
physical address of struct page for data pages and throws an error.  So
you think there will be need to export the ranges of allocated metadata.

Thanks,
Kazu

> 
> Not sure whether the reply is helpful, if you have any other questions, feel free to let me know. :)
> 
> 
> Thanks
> Zhijian
> 
>>
>> Then /proc/vmcore with pmem cannot be captured by other commands,
>> e.g. cp command?
>>
>> Thanks,
>> Kazu
>>
>>> 3) others ?
>>>
>>> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
>>> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
>>> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
>>> we dumped is inconsistent with the metadata at the moment of the crash happening.
>>> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
>>> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
>>> dumpfile on the filesystem/partition based on pmem.
>>>
>>> So here I hope you can provide some ideas about this feature/requirement and on the possible solution
>>> for the cases A&B&D mentioned above, it would be greatly appreciated.
>>>
>>> If I’m missing something, feel free to let me know. Any feedback & comment are very welcome.
>>>
>>>
>>> [1] Pmem region layout:
>>>       ^<--namespace0.0---->^<--namespace0.1------>^
>>>       |                    |                      |
>>>       +--+m----------------+--+m------------------+---------------------+-+a
>>>       |++|e                |++|e                  |                     |+|l
>>>       |++|t                |++|t                  |                     |+|i
>>>       |++|a                |++|a                  |                     |+|g
>>>       |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
>>>       |++|a    fsdax       |++|a     devdax       |                     |+|m
>>>       |++|t                |++|t                  |                     |+|e
>>>       +--+a----------------+--+a------------------+---------------------+-+n
>>>       |                                                                   |t
>>>       v<-----------------------pmem region------------------------------->v
>>>
>>> [2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/
>>>
>>>
>>> Thanks
>>> Zhijian

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-03-07  8:31       ` HAGIO KAZUHITO(萩尾 一仁)
  0 siblings, 0 replies; 24+ messages in thread
From: HAGIO KAZUHITO(萩尾 一仁) @ 2023-03-07  8:31 UTC (permalink / raw)
  To: lizhijian, kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dan.j.williams,
	dave.jiang, horms, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

On 2023/03/07 11:49, lizhijian@fujitsu.com wrote:
> On 07/03/2023 10:05, HAGIO KAZUHITO(萩尾 一仁) wrote:
>> On 2023/02/23 15:24, lizhijian@fujitsu.com wrote:
>>> Hello folks,
>>>
>>> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
>>> I really hope you can provide some feedback.
>>>
>>> pmem memmap can also be called pmem metadata here.
>>>
>>> ### Background and motivate overview ###
>>> ---
>>> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
>>> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
>>> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
>>> trouble around pmem (especially Filesystem-DAX).
>>>
>>>
>>> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
>>> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
>>> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
>>> status of reverse map.
>>>
>>> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
>>> troubleshooters are unable to check more details about pmem from the dumpfile.
>>>
>>> ### Make pmem memmap dump support ###
>>> ---
>>> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
>>> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
>>>
>>> First, based on our previous investigation, according to the location of metadata and the scope of
>>> dump, we can divide it into the following four cases: A, B, C, D.
>>> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
>>> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
>>> it may contain user sensitive data.
>>>
>>> +-------------+----------+------------+
>>> |\+--------+\     metadata location   |
>>> |            ++-----------------------+
>>> | dump scope  |  mem     |   PMEM     |
>>> +-------------+----------+------------+
>>> | entire pmem |     A    |     B      |
>>> +-------------+----------+------------+
>>> | metadata    |     C    |     D      |
>>> +-------------+----------+------------+
>>>
>>> Case A&B: unsupported
>>> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
>>> region into vmcore's PT_LOADs in kexec-tools.
>>> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
>>> are readable, and then skips/excludes the specific page according to its attributes. But in the case
>>> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
>>> errors[2] when specific -d options are specified.
>>> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
>>>
>>> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
>>> from reading by the dump application(makedumpfile).
>>>
>>> Case C: native supported
>>> metadata is stored in mem, and the entire mem/ram is dumpable.
>>>
>>> Case D: unsupported && need your input
>>> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
>>> namespace and the address and size of metadata in the pmem [start, end)
>>>
>>> We have thought of a few possible options:
>>>
>>> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
>>> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
>>> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
>>> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.
>>
>> Hi Zhijian,
>>
>> sorry, probably I don't understand enough, but do these mean that
>>     1. /proc/vmcore exports pmem regions with PT_LOADs, which contain
>>        unreadable ones, and
>>     2. makedumpfile gets to know the readable regions somehow?
> 
> Kazu,
> 
> Generally, only metadata of pmem is readable by crash-utilities, because metadata contains its own memmap(page array).
> The rest part of pmem which could be used as a block device(DAX filesystem) or other purpose, so it's not much helpful
> for the troubleshooting.
> 
> In my understanding, PT_LOADs is part of ELF format, it complies with what it's.
> In my current thoughts,
> 1. crash-tool will export the entire pmem region to /proc/vmcore. makedumpfile/cp etc commands can read the entire
> pmem region directly.
> 2. export the namespace layout to vmcore as a symbol, then dumping applications(makedumpfile) can figure out where
> the metadata is, and read metadata only.

Ah got it, Thanks!

My understanding is that makedumpfile/cp will be able to read the entire
pmem, but with some makedumpfile -d option values it cannot get the
physical address of struct page for data pages and throws an error.  So
you think there will be need to export the ranges of allocated metadata.

Thanks,
Kazu

> 
> Not sure whether the reply is helpful, if you have any other questions, feel free to let me know. :)
> 
> 
> Thanks
> Zhijian
> 
>>
>> Then /proc/vmcore with pmem cannot be captured by other commands,
>> e.g. cp command?
>>
>> Thanks,
>> Kazu
>>
>>> 3) others ?
>>>
>>> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
>>> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
>>> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
>>> we dumped is inconsistent with the metadata at the moment of the crash happening.
>>> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
>>> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
>>> dumpfile on the filesystem/partition based on pmem.
>>>
>>> So here I hope you can provide some ideas about this feature/requirement and on the possible solution
>>> for the cases A&B&D mentioned above, it would be greatly appreciated.
>>>
>>> If I’m missing something, feel free to let me know. Any feedback & comment are very welcome.
>>>
>>>
>>> [1] Pmem region layout:
>>>       ^<--namespace0.0---->^<--namespace0.1------>^
>>>       |                    |                      |
>>>       +--+m----------------+--+m------------------+---------------------+-+a
>>>       |++|e                |++|e                  |                     |+|l
>>>       |++|t                |++|t                  |                     |+|i
>>>       |++|a                |++|a                  |                     |+|g
>>>       |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
>>>       |++|a    fsdax       |++|a     devdax       |                     |+|m
>>>       |++|t                |++|t                  |                     |+|e
>>>       +--+a----------------+--+a------------------+---------------------+-+n
>>>       |                                                                   |t
>>>       v<-----------------------pmem region------------------------------->v
>>>
>>> [2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/
>>>
>>>
>>> Thanks
>>> Zhijian
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC][nvdimm][crash] pmem memmap dump support
  2023-02-23  6:24 ` lizhijian
@ 2023-03-17  6:12   ` Dan Williams
  -1 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2023-03-17  6:12 UTC (permalink / raw)
  To: lizhijian, kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dan.j.williams,
	dave.jiang, horms, k-hagio-ab, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

lizhijian@fujitsu.com wrote:
[..]
> Case D: unsupported && need your input To support this situation, the
> makedumpfile needs to know the location of metadata for each pmem
> namespace and the address and size of metadata in the pmem [start,
> end)

My first reaction is that you should copy what the ndctl utility does
when it needs to manipulate or interrogate the metadata space.

For example, see namespace_rw_infoblock():

https://github.com/pmem/ndctl/blob/main/ndctl/namespace.c#L2022

That facility uses the force_raw attribute
("/sys/bus/nd/devices/namespaceX.Y/force_raw") to arrange for the
namespace to initalize without considering any pre-existing metdata
*and* without overwriting it. In that mode makedumpfile can walk the
namespaces and retrieve the metadata written by the previous kernel.

The module to block to allow makedumpfile to access the namespace in raw
mode is the nd_pmem module, or if it is builtin the
nd_pmem_driver_init() initcall.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-03-17  6:12   ` Dan Williams
  0 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2023-03-17  6:12 UTC (permalink / raw)
  To: lizhijian, kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dan.j.williams,
	dave.jiang, horms, k-hagio-ab, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

lizhijian@fujitsu.com wrote:
[..]
> Case D: unsupported && need your input To support this situation, the
> makedumpfile needs to know the location of metadata for each pmem
> namespace and the address and size of metadata in the pmem [start,
> end)

My first reaction is that you should copy what the ndctl utility does
when it needs to manipulate or interrogate the metadata space.

For example, see namespace_rw_infoblock():

https://github.com/pmem/ndctl/blob/main/ndctl/namespace.c#L2022

That facility uses the force_raw attribute
("/sys/bus/nd/devices/namespaceX.Y/force_raw") to arrange for the
namespace to initalize without considering any pre-existing metdata
*and* without overwriting it. In that mode makedumpfile can walk the
namespaces and retrieve the metadata written by the previous kernel.

The module to block to allow makedumpfile to access the namespace in raw
mode is the nd_pmem module, or if it is builtin the
nd_pmem_driver_init() initcall.

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
  2023-03-17  6:12   ` Dan Williams
@ 2023-03-17  7:30     ` lizhijian
  -1 siblings, 0 replies; 24+ messages in thread
From: lizhijian @ 2023-03-17  7:30 UTC (permalink / raw)
  To: Dan Williams, kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dave.jiang, horms,
	k-hagio-ab, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst



On 17/03/2023 14:12, Dan Williams wrote:
> lizhijian@fujitsu.com wrote:
> [..]
>> Case D: unsupported && need your input To support this situation, the
>> makedumpfile needs to know the location of metadata for each pmem
>> namespace and the address and size of metadata in the pmem [start,
>> end)
> 
> My first reaction is that you should copy what the ndctl utility does
> when it needs to manipulate or interrogate the metadata space.
> 
> For example, see namespace_rw_infoblock():> 
> https://github.com/pmem/ndctl/blob/main/ndctl/namespace.c#L2022
> 
> That facility uses the force_raw attribute
> ("/sys/bus/nd/devices/namespaceX.Y/force_raw") to arrange for the
> namespace to initalize without considering any pre-existing metdata
> *and* without overwriting it. In that mode makedumpfile can walk the
> namespaces and retrieve the metadata written by the previous kernel.

For the dumping application(makedumpfile or cp), it will/should reads /proc/vmcore to construct the dumpfile,
So makedumpfile need to know the *address* and *size/end* of metadata in the view of 1st kernel address space.

I haven't known much about namespace_rw_infoblock() , so it is also an option if we can know such information from it.

My current WIP propose is to export a list linking all pmem namespaces to vmcore, with this, the kdump kernel don't need to
rely on the pmem driver.

Thanks
Zhijian

> 
> The module to block to allow makedumpfile to access the namespace in raw
> mode is the nd_pmem module, or if it is builtin the
> nd_pmem_driver_init() initcall.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-03-17  7:30     ` lizhijian
  0 siblings, 0 replies; 24+ messages in thread
From: lizhijian @ 2023-03-17  7:30 UTC (permalink / raw)
  To: Dan Williams, kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dave.jiang, horms,
	k-hagio-ab, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst



On 17/03/2023 14:12, Dan Williams wrote:
> lizhijian@fujitsu.com wrote:
> [..]
>> Case D: unsupported && need your input To support this situation, the
>> makedumpfile needs to know the location of metadata for each pmem
>> namespace and the address and size of metadata in the pmem [start,
>> end)
> 
> My first reaction is that you should copy what the ndctl utility does
> when it needs to manipulate or interrogate the metadata space.
> 
> For example, see namespace_rw_infoblock():> 
> https://github.com/pmem/ndctl/blob/main/ndctl/namespace.c#L2022
> 
> That facility uses the force_raw attribute
> ("/sys/bus/nd/devices/namespaceX.Y/force_raw") to arrange for the
> namespace to initalize without considering any pre-existing metdata
> *and* without overwriting it. In that mode makedumpfile can walk the
> namespaces and retrieve the metadata written by the previous kernel.

For the dumping application(makedumpfile or cp), it will/should reads /proc/vmcore to construct the dumpfile,
So makedumpfile need to know the *address* and *size/end* of metadata in the view of 1st kernel address space.

I haven't known much about namespace_rw_infoblock() , so it is also an option if we can know such information from it.

My current WIP propose is to export a list linking all pmem namespaces to vmcore, with this, the kdump kernel don't need to
rely on the pmem driver.

Thanks
Zhijian

> 
> The module to block to allow makedumpfile to access the namespace in raw
> mode is the nd_pmem module, or if it is builtin the
> nd_pmem_driver_init() initcall.
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
  2023-03-17  7:30     ` lizhijian
@ 2023-03-17 15:19       ` Dan Williams
  -1 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2023-03-17 15:19 UTC (permalink / raw)
  To: lizhijian, Dan Williams, kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dave.jiang, horms,
	k-hagio-ab, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

lizhijian@fujitsu.com wrote:
> 
> 
> On 17/03/2023 14:12, Dan Williams wrote:
> > lizhijian@fujitsu.com wrote:
> > [..]
> >> Case D: unsupported && need your input To support this situation, the
> >> makedumpfile needs to know the location of metadata for each pmem
> >> namespace and the address and size of metadata in the pmem [start,
> >> end)
> > 
> > My first reaction is that you should copy what the ndctl utility does
> > when it needs to manipulate or interrogate the metadata space.
> > 
> > For example, see namespace_rw_infoblock():> 
> > https://github.com/pmem/ndctl/blob/main/ndctl/namespace.c#L2022
> > 
> > That facility uses the force_raw attribute
> > ("/sys/bus/nd/devices/namespaceX.Y/force_raw") to arrange for the
> > namespace to initalize without considering any pre-existing metdata
> > *and* without overwriting it. In that mode makedumpfile can walk the
> > namespaces and retrieve the metadata written by the previous kernel.
> 
> For the dumping application(makedumpfile or cp), it will/should reads
> /proc/vmcore to construct the dumpfile, So makedumpfile need to know
> the *address* and *size/end* of metadata in the view of 1st kernel
> address space.

Another option, instead of passing the metadata layout into the crash
kernel, is to just parse the infoblock and calculate teh boundaries of
userdata and metadata.

> I haven't known much about namespace_rw_infoblock() , so it is also an
> option if we can know such information from it.
> 
> My current WIP propose is to export a list linking all pmem namespaces
> to vmcore, with this, the kdump kernel don't need to rely on the pmem
> driver.

Seems like more work to avoid using the pmem driver as new information
passing infrastructure needs to be built vs reusing what is already
there.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-03-17 15:19       ` Dan Williams
  0 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2023-03-17 15:19 UTC (permalink / raw)
  To: lizhijian, Dan Williams, kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dave.jiang, horms,
	k-hagio-ab, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

lizhijian@fujitsu.com wrote:
> 
> 
> On 17/03/2023 14:12, Dan Williams wrote:
> > lizhijian@fujitsu.com wrote:
> > [..]
> >> Case D: unsupported && need your input To support this situation, the
> >> makedumpfile needs to know the location of metadata for each pmem
> >> namespace and the address and size of metadata in the pmem [start,
> >> end)
> > 
> > My first reaction is that you should copy what the ndctl utility does
> > when it needs to manipulate or interrogate the metadata space.
> > 
> > For example, see namespace_rw_infoblock():> 
> > https://github.com/pmem/ndctl/blob/main/ndctl/namespace.c#L2022
> > 
> > That facility uses the force_raw attribute
> > ("/sys/bus/nd/devices/namespaceX.Y/force_raw") to arrange for the
> > namespace to initalize without considering any pre-existing metdata
> > *and* without overwriting it. In that mode makedumpfile can walk the
> > namespaces and retrieve the metadata written by the previous kernel.
> 
> For the dumping application(makedumpfile or cp), it will/should reads
> /proc/vmcore to construct the dumpfile, So makedumpfile need to know
> the *address* and *size/end* of metadata in the view of 1st kernel
> address space.

Another option, instead of passing the metadata layout into the crash
kernel, is to just parse the infoblock and calculate teh boundaries of
userdata and metadata.

> I haven't known much about namespace_rw_infoblock() , so it is also an
> option if we can know such information from it.
> 
> My current WIP propose is to export a list linking all pmem namespaces
> to vmcore, with this, the kdump kernel don't need to rely on the pmem
> driver.

Seems like more work to avoid using the pmem driver as new information
passing infrastructure needs to be built vs reusing what is already
there.

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-03-17 15:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-23  6:24 [RFC][nvdimm][crash] pmem memmap dump support lizhijian
2023-02-23  6:24 ` lizhijian
2023-02-28 14:03 ` Baoquan He
2023-02-28 14:03   ` Baoquan He
2023-03-01  6:27   ` lizhijian
2023-03-01  6:27     ` lizhijian
2023-03-01  8:17     ` Baoquan He
2023-03-01  8:17       ` Baoquan He
2023-03-03  2:27       ` lizhijian
2023-03-03  2:27         ` lizhijian
2023-03-03  9:21         ` Baoquan He
2023-03-03  9:21           ` Baoquan He
2023-03-07  2:05 ` HAGIO KAZUHITO(萩尾 一仁)
2023-03-07  2:05   ` HAGIO KAZUHITO(萩尾 一仁)
2023-03-07  2:49   ` lizhijian
2023-03-07  2:49     ` lizhijian
2023-03-07  8:31     ` HAGIO KAZUHITO(萩尾 一仁)
2023-03-07  8:31       ` HAGIO KAZUHITO(萩尾 一仁)
2023-03-17  6:12 ` Dan Williams
2023-03-17  6:12   ` Dan Williams
2023-03-17  7:30   ` lizhijian
2023-03-17  7:30     ` lizhijian
2023-03-17 15:19     ` Dan Williams
2023-03-17 15:19       ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.