From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B79C8C433DB for ; Wed, 27 Jan 2021 21:09:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 28AAA64DBD for ; Wed, 27 Jan 2021 21:09:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 28AAA64DBD Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 35A126B0005; Wed, 27 Jan 2021 16:09:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 30A9B6B0006; Wed, 27 Jan 2021 16:09:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1D2A76B006E; Wed, 27 Jan 2021 16:09:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0165.hostedemail.com [216.40.44.165]) by kanga.kvack.org (Postfix) with ESMTP id 0494D6B0005 for ; Wed, 27 Jan 2021 16:09:27 -0500 (EST) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id AEDDD180AD82F for ; Wed, 27 Jan 2021 21:09:26 +0000 (UTC) X-FDA: 77752795932.03.milk69_531001a2759a Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin03.hostedemail.com (Postfix) with ESMTP id 83B2328A4EE for ; Wed, 27 Jan 2021 21:09:25 +0000 (UTC) X-HE-Tag: milk69_531001a2759a X-Filterd-Recvd-Size: 7935 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Wed, 27 Jan 2021 21:09:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1611781764; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PwZnRNUISBa76GEKB2Ydz6dKcGCvFcUPfybcxGaEEb8=; b=h6yuw9AUHeaYtn1MogTYLlFqFdn4SjR6+zXAEDgVXE//h/cSXmZ53BnSViWbHfPIGDx4WP cy8KS0lEc/maFU2cOmrX2aezKbFunTJOdEyFb5o+0s2rEZWnZA0Ym09pQu4QA2+9PbmjEp E0aS25RLS1RY9mdhpJgvK/mkYrRbA9E= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-491-tDww7UCdMN6Ww4Hhd9zFpg-1; Wed, 27 Jan 2021 16:09:19 -0500 X-MC-Unique: tDww7UCdMN6Ww4Hhd9zFpg-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 790368066EA; Wed, 27 Jan 2021 21:09:17 +0000 (UTC) Received: from [10.36.114.237] (ovpn-114-237.ams2.redhat.com [10.36.114.237]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3B3B010021AA; Wed, 27 Jan 2021 21:09:13 +0000 (UTC) To: Pavel Tatashin , linux-mm , LKML , Sasha Levin , Tyler Hicks , Andrew Morton , Dan Williams , Michal Hocko , Oscar Salvador , Vlastimil Babka , Joonsoo Kim , Jason Gunthorpe , Marc Zyngier , Linux ARM , Will Deacon , James Morse , James Morris References: From: David Hildenbrand Organization: Red Hat GmbH Subject: Re: dax alignment problem on arm64 (and other achitectures) Message-ID: <8c2b75fe-a3e5-8eff-7f37-5d23c7ad9742@redhat.com> Date: Wed, 27 Jan 2021 22:09:12 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 27.01.21 21:43, Pavel Tatashin wrote: > This is something that Dan Williams and I discussed off the mailing > list sometime ago, but I want to have a broader discussion about this > problem so I could send out a fix that would be acceptable. >=20 > We have a 2G pmem device that is carved out of regular memory that we > use to pass data across reboots. After the machine is rebooted we Ordinary reboots or kexec-style reboots? I assume the latter, because=20 otherwise there is no guarantee about persistence, right? I remember for kexec-style reboots there is a different approach (using=20 tmpfs) on the list. > hotplug that memory back, so we do not lose 2G of system memory > (machine is small, only 8G of RAM total). >=20 > In order to hotplug pmem memory it first must be converted to devdax. > Devdax has a label 2M in size that is placed at the beginning of the > pmem device memory which brings the problem. >=20 > The section size is a hotplugging unit on Linux. Whatever gets > hot-plugged or hot-removed must be section size aligned. On x86 > section size is 128M on arm64 it is 1G (because arm64 supports 64K > pages, and 128M does not work with 64K pages). Because the first 2M Note that it's soon 128M with 4k and 16k base pages and 512MB with 64k.=20 The arm64 patch for that is already queued. > are subtracted from the pmem device to create devdax, that actual > hot-pluggable memory is not 1G/128M aligned, and instead we lose 126M > on x86 or 1022M on arm64 of memory that is getting hot-plugged, the > whole first section is skipped when memory gets hot plugged because of > 2M label. >=20 > As a workaround, so we do not lose 1022M out of 8G of memory on arm64 > we have section size reduced to 128M. We are using this patch [1]. > This way we are losing 126M (which I still hate!) >=20 > I would like to get rid of this workaround. First, because I would > like us to switch to 64K pages to gain performance, and second so we > do not depend on an unofficial patch which already has given us some > headache with kdump support. I'd want to see 128M sections on arm64 with 64k base pages. "How?" you=20 might ask. One idea would be to switch from 512M THP to 2MB THP (using=20 cont pages), and instead implement 512MB gigantic pages. Then we can=20 reduce pageblock_order / MAX_ORDER - 1 and no longer have the section=20 limitations. Stuff for the future, though (if even ever). >=20 > Here are some solutions that I think we can do: >=20 > 1. Instead of carving the memory at 1G aligned address, do it at 1G - > 2M address, this way when devdax is created it is perfectly 1G > aligned. On ARM64 it causes a panic because there is a 2M hole in > memory. Even if panic is fixed, I do not think this is a proper fix. > This is simply a workaround to the underlying problem. I remember arm64 already has to deal with all different kinds of memory=20 holes (including huge ones). I don't think this should be a fundamental=20 issue. I think it might be a reasonable thing to do for such a special use=20 case. Does it work on x86-64? >=20 > 2. Dan Williams introduced subsections [2]. They, however do not work > with devdax, and hot-plugging in general. Those patches take care of > __add_pages() side of things, and not add_memory(). Also, it is > unclear what kind of user interface changes need to be made in order > to enable subsection features to online/offline pages. I am absolutely no fan of teaching add_memory() and friends in general=20 about sub-sections. >=20 > 3. Allow to hot plug daxdev together with the label, but teach the > kernel not to touch label (i.e. allocate its memory). IMO, kind of > ugly solution, because when devdax is hot-plugged it is not even aware > of label size. But, perhaps that can be changed. I mean, we could teach add_memory() to "skip the first X pages" when=20 onlining/offlining, not exposing them to the buddy. Something similar we=20 already do with Oscars vmemmap-on-memory series. But I guess the issue is that the memmap for the label etc. is already=20 allocated? Is the label memremapped ZONE_DEVICE memory or what is it? Is=20 the label exposed in the resource tree? In case "it's just untouched/unexposed memory", it's fairly simple. In=20 case the label is exposed as ZONE_DEVICE already, it's more of an issue=20 and might require further tweaks. >=20 > 4. Other ideas? (move dax label to the end? a special case without a > label? label outside of data?) What does the label include in your example? Sorry, I have no idea about=20 devdax labels. I read "ndctl-create-namespace" - "--no-autolabel: Manage labels for=20 legacy NVDIMMs that do not support labels". So I assume there is at=20 least some theoretical way to not have a label on the memory? >=20 > Thank you, > Pasha >=20 > [1] https://lore.kernel.org/lkml/20190423203843.2898-1-pasha.tatashin@s= oleen.com > [2] https://lore.kernel.org/lkml/156092349300.979959.176037107119577351= 35.stgit@dwillia2-desk3.amr.corp.intel.com >=20 --=20 Thanks, David / dhildenb