[RFC][nvdimm][crash] pmem memmap dump support

* [RFC][nvdimm][crash] pmem memmap dump support
@ 2023-02-23  6:24 ` lizhijian
  0 siblings, 0 replies; 24+ messages in thread
From: lizhijian @ 2023-02-23  6:24 UTC (permalink / raw)
  To: kexec, nvdimm, linux-mm
  Cc: Baoquan He, vgoyal, dyoung, vishal.l.verma, dan.j.williams,
	dave.jiang, horms, k-hagio-ab, akpm, Yasunori Gotou (Fujitsu),
	yangx.jy, ruansy.fnst

Hello folks,

This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
I really hope you can provide some feedback.

pmem memmap can also be called pmem metadata here.

### Background and motivate overview ###
---
Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
trouble around pmem (especially Filesystem-DAX).

A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
more details. In fsdax, struct page array becomes very important, it is one of the key data to find
status of reverse map.

So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
troubleshooters are unable to check more details about pmem from the dumpfile.

### Make pmem memmap dump support ###
---
Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.

First, based on our previous investigation, according to the location of metadata and the scope of
dump, we can divide it into the following four cases: A, B, C, D.
It should be noted that although we mentioned case A&B below, we do not want these two cases to be
part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
it may contain user sensitive data.

+-------------+----------+------------+
|\+--------+\     metadata location   |
|            ++-----------------------+
| dump scope  |  mem     |   PMEM     |
+-------------+----------+------------+
| entire pmem |     A    |     B      |
+-------------+----------+------------+
| metadata    |     C    |     D      |
+-------------+----------+------------+

Case A&B: unsupported
- Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
region into vmcore's PT_LOADs in kexec-tools.
- For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
are readable, and then skips/excludes the specific page according to its attributes. But in the case
of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
errors[2] when specific -d options are specified.
Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.

Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
from reading by the dump application(makedumpfile).

Case C: native supported
metadata is stored in mem, and the entire mem/ram is dumpable.

Case D: unsupported && need your input
To support this situation, the makedumpfile needs to know the location of metadata for each pmem
namespace and the address and size of metadata in the pmem [start, end)

We have thought of a few possible options:

1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.
3) others ?

But then we found that we have always ignored a user case, that is, the user could save the dumpfile
to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
we dumped is inconsistent with the metadata at the moment of the crash happening.
Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
dumpfile on the filesystem/partition based on pmem.

So here I hope you can provide some ideas about this feature/requirement and on the possible solution
for the cases A&B&D mentioned above, it would be greatly appreciated.

If I’m missing something, feel free to let me know. Any feedback & comment are very welcome.

[1] Pmem region layout:
   ^<--namespace0.0---->^<--namespace0.1------>^
   |                    |                      |
   +--+m----------------+--+m------------------+---------------------+-+a
   |++|e                |++|e                  |                     |+|l
   |++|t                |++|t                  |                     |+|i
   |++|a                |++|a                  |                     |+|g
   |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
   |++|a    fsdax       |++|a     devdax       |                     |+|m
   |++|t                |++|t                  |                     |+|e
   +--+a----------------+--+a------------------+---------------------+-+n
   |                                                                   |t
   v<-----------------------pmem region------------------------------->v

[2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/

Thanks
Zhijian

^ permalink raw reply	[flat|nested] 24+ messages in thread