All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@linux.ibm.com>
To: "Qian Cai (QUIC)" <quic_qiancai@quicinc.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Ard Biesheuvel <ardb@kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: Re: Arm64 crash while reading memory sysfs
Date: Wed, 26 May 2021 20:24:44 +0300	[thread overview]
Message-ID: <YK6EXNZHY1xt7Kjs@linux.ibm.com> (raw)
In-Reply-To: <CY4PR0201MB35539FF5EE729283C4241F5A8E249@CY4PR0201MB3553.namprd02.prod.outlook.com>

On Wed, May 26, 2021 at 12:09:14PM +0000, Qian Cai (QUIC) wrote:
> > 
> > On Tue, May 25, 2021 at 03:25:59PM +0000, Qian Cai (QUIC) wrote:
> > > Reverting the patchset "arm64: drop pfn_valid_within() and simplify pfn_valid()" [1] from today's linux-next fixed a crash while
> > reading files under /sys/devices/system/memory.

Does the issue persist of you only revert the latest patch in the series?
In next-20210525 it would be commit 
89fb47db72f2 ("arm64-drop-pfn_valid_within-and-simplify-pfn_valid-fix")
and commit
dfe215e9bac2 ("arm64: drop pfn_valid_within() and simplify pfn_valid()").

> > Can you please send the beginning of the boot log, up to the
> > 	 "Memory: xK/yK available ..."
> > line?
> 
> [    0.000000] NUMA: Failed to initialise from firmware
> [    0.000000] NUMA: Faking a node at [mem 0x0000000090000000-0x0000009fffffffff]
> [    0.000000] NUMA: NODE_DATA [mem 0x9ffefbabc0-0x9ffefbffff]
> [    0.000000] Zone ranges:
> [    0.000000]   Normal   [mem 0x0000000090000000-0x0000009fffffffff]
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000090000000-0x0000000091ffffff]
> [    0.000000]   node   0: [mem 0x0000000092000000-0x00000000928fffff]
> [    0.000000]   node   0: [mem 0x0000000092900000-0x00000000fffbffff]
> [    0.000000]   node   0: [mem 0x00000000fffc0000-0x00000000ffffffff]
> [    0.000000]   node   0: [mem 0x0000000880000000-0x0000000fffffffff]
> [    0.000000]   node   0: [mem 0x0000008800000000-0x0000009ff5aeffff]
> [    0.000000]   node   0: [mem 0x0000009ff5af0000-0x0000009ff5b2ffff]
> [    0.000000]   node   0: [mem 0x0000009ff5b30000-0x0000009ff5baffff]
> [    0.000000]   node   0: [mem 0x0000009ff5bb0000-0x0000009ff7deffff]
> [    0.000000]   node   0: [mem 0x0000009ff7df0000-0x0000009ff7e5ffff]
> [    0.000000]   node   0: [mem 0x0000009ff7e60000-0x0000009ff7ffffff]
> [    0.000000]   node   0: [mem 0x0000009ff8000000-0x0000009fffffffff]
> [    0.000000] Initmem setup node 0 [mem 0x0000000090000000-0x0000009fffffffff]
> [    0.000000] mem auto-init: stack:off, heap alloc:on, heap free:off
> [    0.000000] Memory: 777216K/133955584K available (17920K kernel code, 118786K rwdata, 4416K rodata, 6080K init, 67276K bss, 17379072K reserved, 0K cma-reserved)

The available and reserved sizes look weird. Can you post the log with
memblock=debug and mminit_loglevel=4 added to the kernel command line?
 
> > > [1] https://lore.kernel.org/kvmarm/20210511100550.28178-1-rppt@kernel.org/
> > >
> > > [  247.669668][ T1443] kernel BUG at include/linux/mm.h:1383!
> > > [  247.675987][ T1443] Internal error: Oops - BUG: 0 [#1] SMP
> > > [  247.681472][ T1443] Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb i2c_algo_bit
> > nvme mlx5_core i2c_core nvme_core firmware_class
> > > [  247.696894][ T1443] CPU: 15 PID: 1443 Comm: ranbug Not tainted 5.13.0-rc3-next-20210524+ #11
> > > [  247.705326][ T1443] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> > > [  247.713842][ T1443] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> > > [  247.720536][ T1443] pc : test_pages_in_a_zone+0x23c/0x300
> > > [  247.725935][ T1443] lr : test_pages_in_a_zone+0x23c/0x300

Do we know what PFN triggers it? Can you please run with this patch:

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70620d0dd923..b9d1dd0dae5f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1443,6 +1443,12 @@ struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 				i++;
 			if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
 				continue;
+
+			if (!pfn_valid(pfn))
+				pr_info("%s: pfn %lx is not valid\n", __func__, pfn);
+			else if (PagePoisoned(pfn_to_page(pfn)))
+				dump_page(pfn_to_page(pfn), "");
+
 			/* Check if we got outside of the zone */
 			if (zone && !zone_spans_pfn(zone, pfn + i))
 				return NULL;


-- 
Sincerely yours,
Mike.

WARNING: multiple messages have this Message-ID (diff)
From: Mike Rapoport <rppt@linux.ibm.com>
To: "Qian Cai (QUIC)" <quic_qiancai@quicinc.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Ard Biesheuvel <ardb@kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: Re: Arm64 crash while reading memory sysfs
Date: Wed, 26 May 2021 20:24:44 +0300	[thread overview]
Message-ID: <YK6EXNZHY1xt7Kjs@linux.ibm.com> (raw)
In-Reply-To: <CY4PR0201MB35539FF5EE729283C4241F5A8E249@CY4PR0201MB3553.namprd02.prod.outlook.com>

On Wed, May 26, 2021 at 12:09:14PM +0000, Qian Cai (QUIC) wrote:
> > 
> > On Tue, May 25, 2021 at 03:25:59PM +0000, Qian Cai (QUIC) wrote:
> > > Reverting the patchset "arm64: drop pfn_valid_within() and simplify pfn_valid()" [1] from today's linux-next fixed a crash while
> > reading files under /sys/devices/system/memory.

Does the issue persist of you only revert the latest patch in the series?
In next-20210525 it would be commit 
89fb47db72f2 ("arm64-drop-pfn_valid_within-and-simplify-pfn_valid-fix")
and commit
dfe215e9bac2 ("arm64: drop pfn_valid_within() and simplify pfn_valid()").

> > Can you please send the beginning of the boot log, up to the
> > 	 "Memory: xK/yK available ..."
> > line?
> 
> [    0.000000] NUMA: Failed to initialise from firmware
> [    0.000000] NUMA: Faking a node at [mem 0x0000000090000000-0x0000009fffffffff]
> [    0.000000] NUMA: NODE_DATA [mem 0x9ffefbabc0-0x9ffefbffff]
> [    0.000000] Zone ranges:
> [    0.000000]   Normal   [mem 0x0000000090000000-0x0000009fffffffff]
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000090000000-0x0000000091ffffff]
> [    0.000000]   node   0: [mem 0x0000000092000000-0x00000000928fffff]
> [    0.000000]   node   0: [mem 0x0000000092900000-0x00000000fffbffff]
> [    0.000000]   node   0: [mem 0x00000000fffc0000-0x00000000ffffffff]
> [    0.000000]   node   0: [mem 0x0000000880000000-0x0000000fffffffff]
> [    0.000000]   node   0: [mem 0x0000008800000000-0x0000009ff5aeffff]
> [    0.000000]   node   0: [mem 0x0000009ff5af0000-0x0000009ff5b2ffff]
> [    0.000000]   node   0: [mem 0x0000009ff5b30000-0x0000009ff5baffff]
> [    0.000000]   node   0: [mem 0x0000009ff5bb0000-0x0000009ff7deffff]
> [    0.000000]   node   0: [mem 0x0000009ff7df0000-0x0000009ff7e5ffff]
> [    0.000000]   node   0: [mem 0x0000009ff7e60000-0x0000009ff7ffffff]
> [    0.000000]   node   0: [mem 0x0000009ff8000000-0x0000009fffffffff]
> [    0.000000] Initmem setup node 0 [mem 0x0000000090000000-0x0000009fffffffff]
> [    0.000000] mem auto-init: stack:off, heap alloc:on, heap free:off
> [    0.000000] Memory: 777216K/133955584K available (17920K kernel code, 118786K rwdata, 4416K rodata, 6080K init, 67276K bss, 17379072K reserved, 0K cma-reserved)

The available and reserved sizes look weird. Can you post the log with
memblock=debug and mminit_loglevel=4 added to the kernel command line?
 
> > > [1] https://lore.kernel.org/kvmarm/20210511100550.28178-1-rppt@kernel.org/
> > >
> > > [  247.669668][ T1443] kernel BUG at include/linux/mm.h:1383!
> > > [  247.675987][ T1443] Internal error: Oops - BUG: 0 [#1] SMP
> > > [  247.681472][ T1443] Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb i2c_algo_bit
> > nvme mlx5_core i2c_core nvme_core firmware_class
> > > [  247.696894][ T1443] CPU: 15 PID: 1443 Comm: ranbug Not tainted 5.13.0-rc3-next-20210524+ #11
> > > [  247.705326][ T1443] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> > > [  247.713842][ T1443] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> > > [  247.720536][ T1443] pc : test_pages_in_a_zone+0x23c/0x300
> > > [  247.725935][ T1443] lr : test_pages_in_a_zone+0x23c/0x300

Do we know what PFN triggers it? Can you please run with this patch:

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70620d0dd923..b9d1dd0dae5f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1443,6 +1443,12 @@ struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 				i++;
 			if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
 				continue;
+
+			if (!pfn_valid(pfn))
+				pr_info("%s: pfn %lx is not valid\n", __func__, pfn);
+			else if (PagePoisoned(pfn_to_page(pfn)))
+				dump_page(pfn_to_page(pfn), "");
+
 			/* Check if we got outside of the zone */
 			if (zone && !zone_spans_pfn(zone, pfn + i))
 				return NULL;


-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  parent reply	other threads:[~2021-05-26 17:25 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-25 15:25 Arm64 crash while reading memory sysfs Qian Cai (QUIC)
2021-05-25 15:25 ` Qian Cai (QUIC)
2021-05-25 15:37 ` David Hildenbrand
2021-05-25 15:37   ` David Hildenbrand
2021-05-26  6:40 ` Mike Rapoport
2021-05-26  6:40   ` Mike Rapoport
2021-05-26 12:09   ` Qian Cai (QUIC)
2021-05-26 12:09     ` Qian Cai (QUIC)
2021-05-26 13:04     ` Catalin Marinas
2021-05-26 13:04       ` Catalin Marinas
2021-05-26 17:25       ` Mike Rapoport
2021-05-26 17:25         ` Mike Rapoport
2021-05-26 17:24     ` Mike Rapoport [this message]
2021-05-26 17:24       ` Mike Rapoport
2021-05-27  0:16       ` Qian Cai
2021-05-27  0:16         ` Qian Cai
2021-05-27  0:31         ` Andrew Morton
2021-05-27  0:31           ` Andrew Morton
2021-05-27  7:25           ` Stephen Rothwell
2021-05-27  7:25             ` Stephen Rothwell
2021-05-27  8:56         ` Mike Rapoport
2021-05-27  8:56           ` Mike Rapoport
2021-05-27 14:33           ` Qian Cai
2021-05-27 14:33             ` Qian Cai
2021-05-27 16:22             ` Mike Rapoport
2021-05-27 16:22               ` Mike Rapoport
2021-05-27 17:00               ` Qian Cai
2021-05-27 17:00                 ` Qian Cai
2021-05-27 17:12               ` David Hildenbrand
2021-05-27 17:12                 ` David Hildenbrand
2021-05-27 17:50               ` Catalin Marinas
2021-05-27 17:50                 ` Catalin Marinas
2021-05-27 22:56                 ` Andrew Morton
2021-05-27 22:56                   ` Andrew Morton
2021-05-28  5:13                   ` Mike Rapoport
2021-05-28  5:13                     ` Mike Rapoport
2021-06-08  7:06                     ` Anshuman Khandual
2021-06-08  7:06                       ` Anshuman Khandual
2021-06-14  8:25                       ` Mike Rapoport
2021-06-14  8:25                         ` Mike Rapoport
2021-06-15  0:13                         ` Andrew Morton
2021-06-15  0:13                           ` Andrew Morton
2021-06-15  6:05                           ` Mike Rapoport
2021-06-15  6:05                             ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YK6EXNZHY1xt7Kjs@linux.ibm.com \
    --to=rppt@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=ardb@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maz@kernel.org \
    --cc=quic_qiancai@quicinc.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.