linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Garry <john.garry@huawei.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "devicetree@vger.kernel.org" <devicetree@vger.kernel.org>,
	"Anshuman Khandual" <anshuman.khandual@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>, <linux-kernel@vger.kernel.org>,
	Linuxarm <linuxarm@huawei.com>, Rob Herring <robh+dt@kernel.org>,
	Frank Rowand <frowand.list@gmail.com>,
	Ingo Molnar <mingo@redhat.com>,
	"linux-arm-kernel@lists.infradead.org" 
	<linux-arm-kernel@lists.infradead.org>,
	<suravee.suthikulpanit@amd.com>
Subject: Re: Crash report: Broken NUMA distance map causes crash on arm64 system
Date: Fri, 2 Nov 2018 10:10:48 +0000	[thread overview]
Message-ID: <d6d7485a-a7c9-5327-fb47-f279d260ec44@huawei.com> (raw)
In-Reply-To: <20181102093904.GJ3178@hirez.programming.kicks-ass.net>

>>
>> static void free_sched_groups(struct sched_group *sg, int free_sgc)
>> {
>> ...
>>     do {
>>         tmp = sg->next;
>>
>>         if (free_sgc && atomic_dec_and_test(&sg->sgc->ref))***
>>             kfree(sg->sgc);
>>
>> ...
>> }
>>
>> *** crash occurs when free_sgc is non-zero and sg->sgc is NULL
>
> Yeah, turns out to be random memory corruption; I've had the crash in a
> number of weird places; also GCC version dependent.
>
> KASAN is awesome and pinpointed the problem though.
>
>> And, as I mentioned earlier, I bisected this problem to 58d5af59d55b.
>
> You mean:
>
>   051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain")
>
> right? and yes indeed! The below fixes my reproducer:
>

Yes, that's the one.

>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9d74371e4aad..039578429c25 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1337,7 +1348,7 @@ void sched_init_numa(void)
>  	int level = 0;
>  	int i, j, k;
>
> -	sched_domains_numa_distance = kzalloc(sizeof(int) * nr_node_ids, GFP_KERNEL);
> +	sched_domains_numa_distance = kzalloc(sizeof(int) * (nr_node_ids + 1), GFP_KERNEL);

Good find.

>  	if (!sched_domains_numa_distance)
>  		return;
>
>

So what about this:
 >>> I also note that if I apply the patch, below, to reject the invalid 
NUMA
 >>> distance, we're still getting a warning/error:
 >>>
 >>> [    7.144407] CPU: All CPU(s) started at EL2
 >>> [    7.148678] alternatives: patching kernel code
 >>> [    7.153557] ERROR: Node-0 not representative
 >>> [    7.153557]
 >>> [    7.159365]   10 15 20 25
 >>> [    7.162097]   15 10 25 30
 >>> [    7.164832]   20 25 10 15
 >>> [    7.167562]   25 30 15 10
 >>
 >> Yeah, that's an 'obviously' broken topology too.
 >>
 >
 > AFAICT, this conforms to ACPI spec SLIT rules, and the kernel SLIT
 > validation allows this also. So maybe we should shout louder here or
 > even mark the SLIT as invalid if totally broken.
 >

I plan to fix up OF map parsing to reject invalid distance maps.

However is this distance map so broken for the scheduler that it's 
better to reject this "valid" distance map also? If not, it may be nice 
for the user to know about it without having to enable scheduler debugging.

Thanks,
John


  reply	other threads:[~2018-11-02 10:11 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-23 10:30 Crash report: Broken NUMA distance map causes crash on arm64 system John Garry
2018-10-25 11:01 ` John Garry
2018-10-30  9:26 ` Peter Zijlstra
2018-10-30  9:55   ` John Garry
2018-10-30 15:35     ` John Garry
2018-10-31 20:46       ` Peter Zijlstra
2018-11-01 10:01         ` John Garry
2018-11-02  9:39           ` Peter Zijlstra
2018-11-02 10:10             ` John Garry [this message]
2018-11-02 10:50           ` Peter Zijlstra
2018-11-02 12:08             ` John Garry
2018-11-02 12:19               ` Peter Zijlstra
2018-11-07 18:42                 ` John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d6d7485a-a7c9-5327-fb47-f279d260ec44@huawei.com \
    --to=john.garry@huawei.com \
    --cc=anshuman.khandual@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=devicetree@vger.kernel.org \
    --cc=frowand.list@gmail.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=robh+dt@kernel.org \
    --cc=suravee.suthikulpanit@amd.com \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).