Re: dax alignment problem on arm64 (and other achitectures)

From: David Hildenbrand <david@redhat.com>
To: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Sasha Levin <sashal@kernel.org>,
	Tyler Hicks <tyhicks@linux.microsoft.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Oscar Salvador <osalvador@suse.de>,
	Vlastimil Babka <vbabka@suse.cz>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Jason Gunthorpe <jgg@ziepe.ca>, Marc Zyngier <maz@kernel.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>,
	Will Deacon <will.deacon@arm.com>,
	James Morse <james.morse@arm.com>,
	James Morris <jmorris@namei.org>
Subject: Re: dax alignment problem on arm64 (and other achitectures)
Date: Thu, 28 Jan 2021 16:03:07 +0100	[thread overview]
Message-ID: <db692fcd-40e8-9c2b-d63b-9803f4bf9d5e@redhat.com> (raw)
In-Reply-To: <CA+CK2bCjD7PujEwWMT32p4e6x6hZ-f5QOKXir10mT8RfijvnUA@mail.gmail.com>

>> One issue usually is that often firmware can allocate from available
>> system RAM and/or modify/initialize it. I assume you're running some
>> custom firmware :)
> 
> We have a special firmware that does not touch the last 2G of physical
> memory for its allocations :)
> 

Fancy :)

[...]

>> Personally, I think the future is 4k, especially for smaller machines.
>> (also, imagine right now how many 512MB THP you can actually use in your
>> 8GB VM ..., simply not suitable for small machines).
> 
> Um, this is not really about 512THP. Yes, this is smaller machine, but
> performance is very important to us. Boot budget for the kernel is
> under half a second. With 64K we save 0.2s  0.35s vs 0.55s. This is
> because fewer struct pages need to be initialized. Also, fewer TLB
> misses, and 3-level page tables add up as performance benefits. >
> For larger servers 64K pages make total sense: Less memory is wasted as metdata.

Yes, indeed, for very large servers it might make sense in that regard. 
However, once we can eventually free vmemmap of hugetlbfs things could 
change; assuming user space will be consuming huge pages (which large 
machines better be doing ... databases, hypervisors ... ).

Also, some hypervisors try allocating the memmap completely ... but I 
consider that rather a special case.

Personally, I consider being able to use THP/huge pages more important 
than having 64k base pages and saving some TLB space there. Also, with 
64k you have other drawbacks: for example, each stack, each TLS for 
threads in applications suddenly consumes 16 times more memory as "minimum".

Optimizing boot time/memmap initialization further is certainly an 
interesting topic.

Anyhow, you know your use case best, just sharing my thoughts :)

[...]

>>>
>>> Right, but I do not think it is possible to do for dax devices (as of
>>> right now). I assume, it contains information about what kind of
>>> device it is: devdax, fsdax, sector, uuid etc.
>>> See [1] namespaces tabel. It contains summary of pmem devices types,
>>> and which of them have label (all except for raw).
>>
>> Interesting, I wonder if the label is really required to get this
>> special use case running. I mean, all you want is to have dax/kmem
>> expose the whole thing as system RAM. You don't want to lose even 2MB if
>> it's just for the sake of unnecessary metadata - this is not a real
>> device, it's "fake" already.
> 
> Hm, would not it essentially  mean allowing memory hot-plug for raw
> pmem devices? Something like create mmap, and hot-add raw pmem?

Theoretically yes, but I have no idea if that would make sense for real 
"raw pmem" as well. Hope some of the pmem/nvdimm experts can clarify 
what's possible and what's not :)

-- 
Thanks,

David / dhildenb