From: Pavel Tatashin <pasha.tatashin@soleen.com>
To: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>,
linux-mm <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>,
Sasha Levin <sashal@kernel.org>,
Tyler Hicks <tyhicks@linux.microsoft.com>,
Andrew Morton <akpm@linux-foundation.org>,
Dan Williams <dan.j.williams@intel.com>,
Michal Hocko <mhocko@suse.com>,
Oscar Salvador <osalvador@suse.de>,
Vlastimil Babka <vbabka@suse.cz>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Jason Gunthorpe <jgg@ziepe.ca>, Marc Zyngier <maz@kernel.org>,
Linux ARM <linux-arm-kernel@lists.infradead.org>,
Will Deacon <will.deacon@arm.com>,
James Morse <james.morse@arm.com>,
James Morris <jmorris@namei.org>
Subject: Re: dax alignment problem on arm64 (and other achitectures)
Date: Fri, 29 Jan 2021 11:24:21 -0500 [thread overview]
Message-ID: <CA+CK2bDJ3hrWoE91L2wpAk+Yu0_=GtYw=4gLDDD7mxs321b_aA@mail.gmail.com> (raw)
In-Reply-To: <92912784-f3a3-b5a5-2d45-4c86ae26315f@redhat.com>
On Fri, Jan 29, 2021 at 8:19 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 29.01.21 03:06, Pavel Tatashin wrote:
> >>> Might be related to the broken custom pfn_valid() implementation for
> >>> ZONE_DEVICE.
> >>>
> >>> https://lkml.kernel.org/r/1608621144-4001-1-git-send-email-anshuman.khandual@arm.com
> >>>
> >>> And essentially ignoring sub-section data in there for now as well (but
> >>> might not be that relevant yet). In addition, this might also be related to
> >>>
> >>> https://lkml.kernel.org/r/161058499000.1840162.702316708443239771.stgit@dwillia2-desk3.amr.corp.intel.com
> >>
> >> I will check it, and see what I find. I saw that panic almost a year
> >> ago, things might have changed since then.
> >
> > Hi David,
> >
> > There is no panic anymore, but I also can't offset by 2M anymore, the
> > minimum that works now is 16M, and if alignment is less than 16M
> > creating devdax device fails.
>
> I wonder why we get such different namespace sizes? Where do the
> differences come from? This looks very weird.
>
> >
> > So, I tried the new ARM64 patch that reduces section sizes, and two
> > alignments for pmem: regular 2G alignment, and 2G+16M alignment.
> > (subtracted 16M from the bottom)
> >
> > ***** 4K page, 6G RAM, 2G PRAM *****
> > BOOT:
> > 40000000-1bfffffff : System RAM
> > 1c0000000-23fffffff : namespace0.0
> > DEVDAX:
> > 40000000-1bfffffff : System RAM
> > 1c0000000-1c21fffff : namespace0.0
> > 1c2200000-23fffffff : dax0.0
> > HOTPLUG:
> > 40000000-1bfffffff : System RAM
> > 1c0000000-1c21fffff : namespace0.0
> > 1c8000000-23fffffff : dax0.0
> > 1c8000000-23fffffff : System RAM (kmem) 128M Wasted (Expected)
>
> The namespace spans 34MB??
>
> >
> > ***** 4K page, 6G-16M RAM, 2G+16M PRAM *****
> > BOOT:
> > 40000000-1beffffff : System RAM
> > 1bf000000-23fffffff : namespace0.0
> > DEVDAX:
> > 40000000-1beffffff : System RAM
> > 1bf000000-1c11fffff : namespace0.0
> > 1c1200000-23fffffff : dax0.0
> > HOTPLUG:
> > 40000000-1beffffff : System RAM
> > 1bf000000-1c11fffff : namespace0.0
> > 1c8000000-23fffffff : dax0.0
> > 1c8000000-23fffffff : System RAM (kmem) 144M Wasted (????)
>
> The namespace spans 34MB??
Right, this seems like a bug
>
> >
> > ***** 64K page, 6G RAM, 2G PRAM *****
> > BOOT:
> > 40000000-1bfffffff : System RAM
> > 1c0000000-23fffffff : namespace0.0
> > DEVDAX:
> > 40000000-1bfffffff : System RAM
> > 1c0000000-1dfffffff : namespace0.0
> > 1e0000000-23fffffff : dax0.0
> > HOTPLUG:
> > 40000000-1bfffffff : System RAM
> > 1c0000000-1dfffffff : namespace0.0
>
> The namespace spans 512MB ?!? What?
This is because section size is 512M with 64K pages.
>
> > 1e0000000-23fffffff : dax0.0
> > 1e0000000-23fffffff : System RAM (kmem) 512M Wasted (Expected)
> >
> > ***** 64K page, 6G-16M RAM, 2G+16M PRAM *****
> > BOOT:
> > 40000000-1beffffff : System RAM
> > 1bf000000-23fffffff : namespace0.0
> > DEVDAX:
> > 40000000-1beffffff : System RAM
> > 1bf000000-1bf3fffff : namespace0.0
> > 1bf400000-23fffffff : dax0.0
> > HOTPLUG:
> > 40000000-1beffffff : System RAM
> > 1bf000000-1bf3fffff : namespace0.0
>
> The namespace now consumes 4MB ?!?
>
> > 1c0000000-23fffffff : dax0.0
> > 1c0000000-23fffffff : System RAM (kmem) 16M Wasted (Optimal)
>
> Good :) I guess more optimal would be 2MB/0MB :)
Agree, but for the offset 16M this is optimal, because 16M is smaller
than section size.
>
> >
> > In all three cases only System RAM, namespace0.0, and dax0.0 were
> > printed from /proc/iomem.
> > BOOT content of iomem right after boot
> > DEVDAX content of iomem after devdax is created
> > ndctl create-namespace --mode devdax -e namespace0.0"
> > HOTPLUG content of imem after dax0.0 is hotplugged:
> > echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> > echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> >
> >
> > The most surprising part is why with 4K pages and 16M offset 144M is
> > wasted? For whatever reason, when devdax is created 34 goes wasted to
> > the label? Something is wrong here.. However, I am happy with 64K
> > pages result, and that only 16M is wasted, of course optimally, we
> > should be using any memory here, but it is still much better than what
> > we have now.
>
> Definitely, but we should try figuring out what's going on here. I
> assume on x86-64 it behaves differently?
Yes, we should root cause. I highly suspect that there is somewhere
alignment miscalculations happen that cause this memory waste with the
offset 16M. I am also not sure why the 2M label size was increased,
and why 16M is now an alignment requirement.
I tested on x86, and got pretty much the same results as on ARM64: 2M
offset is not allowed anymore 16M minimum, and even with 16M offset,
144M is wasted. Here is full QEMU command if anyone wants to repro it:
KERNEL_PARAM='console=ttyS0 ip=dhcp'
KERNEL_PARAM+=' memmap=2G!8G'
#KERNEL_PARAM+=' memmap=2064M!8176M'
qemu-system-x86_64
\
-m 8G -smp 1
\
-machine q35
\
-nographic
\
-enable-kvm
\
-kernel pmem/native/arch/x86/boot/bzImage
\
-initrd
../poky/build/tmp/deploy/images/qemux86-64/core-image-minimal-qemux86-64.cpio.gz
\
-chardev stdio,id=console,signal=off,mux=on
\
-mon chardev=console
\
-serial chardev:console
\
-netdev user,hostfwd=tcp::5000-:22,id=netdev0
\
-device virtio-net-pci,netdev=netdev0
\
-append "$KERNEL_PARAM"
Also, I am using current master branch tip for ndctl command:
root@qemux86-64:~# ndctl --version
71.2.gea014c0
***** 4K page, 6G RAM, 2G PRAM: kernel parameter memmap=2G!8G *****
BOOT:
100000000-1ffffffff : System RAM
200000000-27fffffff : Persistent Memory (legacy)
200000000-27fffffff : namespace0.0
DEVDAX:
100000000-1ffffffff : System RAM
200000000-27fffffff : Persistent Memory (legacy)
200000000-2021fffff : namespace0.0
202200000-27fffffff : dax0.0
HOTPLUG:
100000000-1ffffffff : System RAM
200000000-27fffffff : Persistent Memory (legacy)
200000000-2021fffff : namespace0.0
208000000-27fffffff : dax0.0
208000000-27fffffff : System RAM (kmem) (128M Wasted)
***** 4K page, 6G-16M RAM, 2G+16M PRAM: kernel parameter
memmap=2064M!8176M *****
BOOT:
100000000-1feffffff : System RAM
1ff000000-27fffffff : Persistent Memory (legacy)
1ff000000-27fffffff : namespace0.0
DEVDAX:
100000000-1feffffff : System RAM
1ff000000-27fffffff : Persistent Memory (legacy)
1ff000000-2011fffff : namespace0.0
201200000-27fffffff : dax0.0
HOTPLUG:
100000000-1feffffff : System RAM
1ff000000-27fffffff : Persistent Memory (legacy)
1ff000000-2011fffff : namespace0.0
208000000-27fffffff : dax0.0
208000000-27fffffff : System RAM (kmem) (144M Wasted)
The least amount of wasted memory I can get on x86 with this
experiment is with offset that is larger than 34M, and 16M aligned:
48M: memmap=2096M!8144M
root@qemux86-64:~# cat /proc/iomem | grep 'dax\|namespace\|System\|Pers'
100000000-1fcffffff : System RAM
1fd000000-27fffffff : Persistent Memory (legacy)
1fd000000-1ff1fffff : namespace0.0
200000000-27fffffff : dax0.0
200000000-27fffffff : System RAM (kmem) (48M Wasted)
Pasha
>
> Thanks
>
>
> --
> Thanks,
>
> David / dhildenb
>
next prev parent reply other threads:[~2021-01-29 16:25 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-27 20:43 dax alignment problem on arm64 (and other achitectures) Pavel Tatashin
2021-01-27 21:09 ` David Hildenbrand
2021-01-27 21:49 ` Pavel Tatashin
2021-01-27 22:18 ` David Hildenbrand
2021-01-27 23:33 ` Pavel Tatashin
2021-01-28 15:03 ` David Hildenbrand
2021-01-29 2:06 ` Pavel Tatashin
2021-01-29 13:19 ` David Hildenbrand
2021-01-29 16:24 ` Pavel Tatashin [this message]
2021-01-29 19:06 ` Pavel Tatashin
2021-01-29 19:12 ` Pavel Tatashin
2021-01-29 19:41 ` Pavel Tatashin
2021-01-29 2:55 ` Dan Williams
2021-01-29 13:50 ` Pavel Tatashin
2021-01-29 14:50 ` Joao Martins
2021-01-29 16:32 ` Pavel Tatashin
2021-01-29 17:22 ` Joao Martins
2021-01-29 20:26 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CA+CK2bDJ3hrWoE91L2wpAk+Yu0_=GtYw=4gLDDD7mxs321b_aA@mail.gmail.com' \
--to=pasha.tatashin@soleen.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=dan.j.williams@intel.com \
--cc=david@redhat.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=james.morse@arm.com \
--cc=jgg@ziepe.ca \
--cc=jmorris@namei.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=maz@kernel.org \
--cc=mhocko@suse.com \
--cc=osalvador@suse.de \
--cc=sashal@kernel.org \
--cc=tyhicks@linux.microsoft.com \
--cc=vbabka@suse.cz \
--cc=will.deacon@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).