dax alignment problem on arm64 (and other achitectures)

* dax alignment problem on arm64 (and other achitectures)
@ 2021-01-27 20:43 Pavel Tatashin
  2021-01-27 21:09 ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Pavel Tatashin @ 2021-01-27 20:43 UTC (permalink / raw)
  To: linux-mm, LKML, Sasha Levin, Tyler Hicks, Andrew Morton,
	Dan Williams, David Hildenbrand, Michal Hocko, Oscar Salvador,
	Vlastimil Babka, Joonsoo Kim, Jason Gunthorpe, Marc Zyngier,
	Linux ARM, Will Deacon, James Morse, James Morris

This is something that Dan Williams and I discussed off the mailing
list sometime ago, but I want to have a broader discussion about this
problem so I could send out a fix that would be acceptable.

We have a 2G pmem device that is carved out of regular memory that we
use to pass data across reboots. After the machine is rebooted we
hotplug that memory back, so we do not lose 2G of system memory
(machine is small, only 8G of RAM total).

In order to hotplug pmem memory it first must be converted to devdax.
Devdax has a label 2M in size that is placed at the beginning of the
pmem device memory which brings the problem.

The section size is a hotplugging unit on Linux. Whatever gets
hot-plugged or hot-removed must be section size aligned. On x86
section size is 128M on arm64 it is 1G (because arm64 supports 64K
pages, and 128M does not work with 64K pages). Because the first 2M
are subtracted from the pmem device to create devdax, that actual
hot-pluggable memory is not 1G/128M aligned, and instead we lose 126M
on x86 or 1022M on arm64 of memory that is getting hot-plugged, the
whole first section is skipped when memory gets hot plugged because of
2M label.

As a  workaround, so we do not lose 1022M out of 8G of memory on arm64
we have section size reduced to 128M. We are using this patch [1].
This way we are losing 126M (which I still hate!)

I would like to get rid of this workaround. First, because I would
like us to switch to 64K pages to gain performance, and second so we
do not depend on an unofficial patch which already has given us some
headache with kdump support.

Here are some solutions that I think we can do:

1. Instead of carving the memory at 1G aligned address, do it at 1G -
2M address, this way when devdax is created it is perfectly 1G
aligned. On ARM64 it causes a panic because there is a 2M hole in
memory. Even if panic is fixed, I do not think this is a proper fix.
This is simply a workaround to the underlying problem.

2.  Dan Williams introduced subsections [2]. They, however do not work
with devdax, and hot-plugging in general. Those patches take care of
__add_pages() side of things, and not add_memory(). Also, it is
unclear what kind of user interface changes need to be made in order
to enable subsection features to online/offline pages.

3. Allow to hot plug daxdev together with the label, but teach the
kernel not to touch label (i.e. allocate its memory). IMO, kind of
ugly solution, because when devdax is hot-plugged it is not even aware
of label size. But, perhaps that can be changed.

4. Other ideas? (move dax label to the end? a special case without a
label? label outside of data?)

Thank you,
Pasha

[1] https://lore.kernel.org/lkml/20190423203843.2898-1-pasha.tatashin@soleen.com
[2] https://lore.kernel.org/lkml/156092349300.979959.17603710711957735135.stgit@dwillia2-desk3.amr.corp.intel.com

^ permalink raw reply	[flat|nested] 18+ messages in thread