From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4B85FC433DF for ; Thu, 25 Jun 2020 22:35:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 11497206BE for ; Thu, 25 Jun 2020 22:35:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 11497206BE Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8DE126B0003; Thu, 25 Jun 2020 18:35:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 890376B0005; Thu, 25 Jun 2020 18:35:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A46D6B0006; Thu, 25 Jun 2020 18:35:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0072.hostedemail.com [216.40.44.72]) by kanga.kvack.org (Postfix) with ESMTP id 607A16B0003 for ; Thu, 25 Jun 2020 18:35:41 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id F2492181ABE86 for ; Thu, 25 Jun 2020 22:35:40 +0000 (UTC) X-FDA: 76969192440.24.wing41_260f78726e50 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin24.hostedemail.com (Postfix) with ESMTP id CFA4C1A4A0 for ; Thu, 25 Jun 2020 22:35:40 +0000 (UTC) X-HE-Tag: wing41_260f78726e50 X-Filterd-Recvd-Size: 6336 Received: from out30-56.freemail.mail.aliyun.com (out30-56.freemail.mail.aliyun.com [115.124.30.56]) by imf33.hostedemail.com (Postfix) with ESMTP for ; Thu, 25 Jun 2020 22:35:39 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R761e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=richard.weiyang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0U0iFBoh_1593124536; Received: from localhost(mailfrom:richard.weiyang@linux.alibaba.com fp:SMTPD_---0U0iFBoh_1593124536) by smtp.aliyun-inc.com(127.0.0.1); Fri, 26 Jun 2020 06:35:36 +0800 From: Wei Yang To: akpm@linux-foundation.org, osalvador@suse.de, dan.j.williams@intel.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, david@redhat.com, Wei Yang Subject: [Patch v2] mm/sparse: never partially remove memmap for early section Date: Fri, 26 Jun 2020 06:35:34 +0800 Message-Id: <20200625223534.18024-1-richard.weiyang@linux.alibaba.com> X-Mailer: git-send-email 2.20.1 (Apple Git-117) MIME-Version: 1.0 X-Rspamd-Queue-Id: CFA4C1A4A0 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: For early sections, its memmap is handled specially even sub-section is enabled. The memmap could only be populated as a whole. Quoted from the comment of section_activate(): * The early init code does not consider partially populated * initial sections, it simply assumes that memory will never be * referenced. If we hot-add memory into such a section then we * do not need to populate the memmap and can simply reuse what * is already there. While current section_deactivate() breaks this rule. When hot-remove a sub-section, section_deactivate() would depopulate its memmap. The consequence is if we hot-add this subsection again, its memmap never get proper populated. We can reproduce the case by following steps: 1. Hacking qemu to allow sub-section early section diff --git a/hw/i386/pc.c b/hw/i386/pc.c index 51b3050d01..c6a78d83c0 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -1010,7 +1010,7 @@ void pc_memory_init(PCMachineState *pcms, } machine->device_memory->base =3D - ROUND_UP(0x100000000ULL + x86ms->above_4g_mem_size, 1 * G= iB); + 0x100000000ULL + x86ms->above_4g_mem_size; if (pcmc->enforce_aligned_dimm) { /* size device region assuming 1G page max alignment per = slot */ 2. Bootup qemu with PSE disabled and a sub-section aligned memory size Part of the qemu command would look like this: sudo x86_64-softmmu/qemu-system-x86_64 \ --enable-kvm -cpu host,pse=3Doff \ -m 4160M,maxmem=3D20G,slots=3D1 \ -smp sockets=3D2,cores=3D16 \ -numa node,nodeid=3D0,cpus=3D0-1 -numa node,nodeid=3D1,cpus=3D2-3 = \ -machine pc,nvdimm \ -nographic \ -object memory-backend-ram,id=3Dmem0,size=3D8G \ -device nvdimm,id=3Dvm0,memdev=3Dmem0,node=3D0,addr=3D0x144000000,= label-size=3D128k 3. Re-config a pmem device with sub-section size in guest ndctl create-namespace --force --reconfig=3Dnamespace0.0 --mode=3Ddevd= ax --size=3D16M Then you would see the following call trace: pmem0: detected capacity change from 0 to 16777216 BUG: unable to handle page fault for address: ffffec73c51000b4 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 81ff8067 P4D 81ff8067 PUD 81ff7067 PMD 1437cb067 PTE 0 Oops: 0002 [#1] SMP NOPTI CPU: 16 PID: 1348 Comm: ndctl Kdump: loaded Tainted: G W = 5.8.0-rc2+ #24 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0= -0-gf21b5a4aeb02-prebuilt.qemu.4 RIP: 0010:memmap_init_zone+0x154/0x1c2 Code: 77 16 f6 40 10 02 74 10 48 03 48 08 48 89 cb 48 c1 eb 0c e9 3a f= f ff ff 48 89 df 48 c1 e7 06 48f RSP: 0018:ffffbdc7011a39b0 EFLAGS: 00010282 RAX: ffffec73c5100088 RBX: 0000000000144002 RCX: 0000000000144000 RDX: 0000000000000004 RSI: 007ffe0000000000 RDI: ffffec73c5100080 RBP: 027ffe0000000000 R08: 0000000000000001 R09: ffff9f8d38f6d708 R10: ffffec73c0000000 R11: 0000000000000000 R12: 0000000000000004 R13: 0000000000000001 R14: 0000000000144200 R15: 0000000000000000 FS: 00007efe6b65d780(0000) GS:ffff9f8d3f780000(0000) knlGS:0000000000= 000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffec73c51000b4 CR3: 000000007d718000 CR4: 0000000000340ee0 Call Trace: move_pfn_range_to_zone+0x128/0x150 memremap_pages+0x4e4/0x5a0 devm_memremap_pages+0x1e/0x60 dev_dax_probe+0x69/0x160 [device_dax] really_probe+0x298/0x3c0 driver_probe_device+0xe1/0x150 ? driver_allows_async_probing+0x50/0x50 bus_for_each_drv+0x7e/0xc0 __device_attach+0xdf/0x160 bus_probe_device+0x8e/0xa0 device_add+0x3b9/0x740 __devm_create_dev_dax+0x127/0x1c0 __dax_pmem_probe+0x1f2/0x219 [dax_pmem_core] dax_pmem_probe+0xc/0x1b [dax_pmem] nvdimm_bus_probe+0x69/0x1c0 [libnvdimm] really_probe+0x147/0x3c0 driver_probe_device+0xe1/0x150 device_driver_attach+0x53/0x60 bind_store+0xd1/0x110 kernfs_fop_write+0xce/0x1b0 vfs_write+0xb6/0x1a0 ksys_write+0x5f/0xe0 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") Signed-off-by: Wei Yang Acked-by: David Hildenbrand --- mm/sparse.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/sparse.c b/mm/sparse.c index b2b9a3e34696..a06085738295 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -825,10 +825,14 @@ static void section_deactivate(unsigned long pfn, u= nsigned long nr_pages, ms->section_mem_map &=3D ~SECTION_HAS_MEM_MAP; } =20 - if (section_is_early && memmap) - free_map_bootmem(memmap); - else + /* + * The memmap of early sections is always fully populated. See + * section_activate() and pfn_valid() . + */ + if (!section_is_early) depopulate_section_memmap(pfn, nr_pages, altmap); + else if (memmap) + free_map_bootmem(memmap); =20 if (empty) ms->section_mem_map =3D (unsigned long)NULL; --=20 2.20.1 (Apple Git-117)