All of lore.kernel.org
 help / color / mirror / Atom feed
From: Heinrich Schuchardt <xypron.glpk@gmx.de>
To: Sean Anderson <seanga2@gmail.com>
Cc: Lukas Auer <lukas.auer@aisec.fraunhofer.de>,
	U-Boot Mailing List <u-boot@lists.denx.de>,
	Atish Patra <atishp@atishpatra.org>,
	Anup Patel <anup@brainfault.org>, Bin Meng <bmeng.cn@gmail.com>,
	Leo Liang <ycliang@andestech.com>, rick <rick@andestech.com>,
	Nikita Shubin <nikita.shubin@maquefel.me>,
	Rick Chen <rickchen36@gmail.com>
Subject: Re: RISCV: the machanism of available_harts may cause other harts boot failure
Date: Mon, 5 Sep 2022 17:41:09 +0200	[thread overview]
Message-ID: <53ef4762-eb1d-043c-69de-a621eb3806d2@gmx.de> (raw)
In-Reply-To: <c9de327a-a19b-2a59-0c06-7c55bc854476@gmail.com>

On 9/5/22 17:30, Sean Anderson wrote:
> On 9/5/22 3:47 AM, Nikita Shubin wrote:
>> Hi Rick!
>>
>> On Mon, 5 Sep 2022 14:22:41 +0800
>> Rick Chen <rickchen36@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> When I free-run a SMP system, I once hit a failure case where some
>>> harts didn't boot to the kernel shell successfully.
>>> However it can't be duplicated anymore even if I try many times.
>>>
>>> But when I set a break during debugging with GDB, it can trigger the
>>> failure case each time.
>>
>> If hart fails to register itself to available_harts before
>> send_ipi_many is hit by the main hart:
>> https://elixir.bootlin.com/u-boot/v2022.10-rc3/source/arch/riscv/lib/smp.c#L50
>>
>> it won't exit the secondary_hart_loop:
>> https://elixir.bootlin.com/u-boot/v2022.10-rc3/source/arch/riscv/cpu/start.S#L433
>> As no ipi will be sent to it.

Can we call send_ipi_many() again when booting?
Do we need to call it before booting?

Best regards

Heinrich

>>
>> This might be exactly your case.
>
> When working on the IPI mechanism, I considered this possibility. However,
> there's really no way to know how long to wait. On normal systems, the boot
> hart is going to do a lot of work before calling send_ipi_many, and the
> other harts just have to make it through ~100 instructions. So I figured we
> would never run into this issue.
>
> We might not even need the mask... the only direct reason we might is for
> OpenSBI, as spl_invoke_opensbi is the only function which uses the wait
> parameter.
>
>>> I think the mechanism of available_harts does not provide a method
>>> that guarantees the success of the SMP system.
>>> Maybe we shall think of a better way for the SMP booting or just
>>> remove it ?
>>
>> I haven't experienced any unexplained problem with hart_lottery or
>> available_harts_lock unless:
>>
>> 1) harts are started non-simultaneously
>> 2) SPL/U-Boot is in some kind of TCM, OCRAM, etc... which is not cleared
>> on reset which leaves available_harts dirty
>
> XIP, of course, has this problem every time and just doesn't use the mask.
> I remember thinking a lot about how to deal with this, but I never ended
> up sending a patch because I didn't have a XIP system.
>
> --Sean
>
>> 3) something is wrong with atomics
>>
>> Also there might be something wrong with IPI send/recieve.
>>
>>>
>>> Thread 8 hit Breakpoint 1, harts_early_init ()
>>>
>>> (gdb) c
>>> Continuing.
>>> [Switching to Thread 7]
>>>
>>> Thread 7 hit Breakpoint 1, harts_early_init ()
>>>
>>> (gdb)
>>> Continuing.
>>> [Switching to Thread 6]
>>>
>>> Thread 6 hit Breakpoint 1, harts_early_init ()
>>>
>>> (gdb)
>>> Continuing.
>>> [Switching to Thread 5]
>>>
>>> Thread 5 hit Breakpoint 1, harts_early_init ()
>>>
>>> (gdb)
>>> Continuing.
>>> [Switching to Thread 4]
>>>
>>> Thread 4 hit Breakpoint 1, harts_early_init ()
>>>
>>> (gdb)
>>> Continuing.
>>> [Switching to Thread 3]
>>>
>>> Thread 3 hit Breakpoint 1, harts_early_init ()
>>> (gdb)
>>> Continuing.
>>> [Switching to Thread 2]
>>>
>>> Thread 2 hit Breakpoint 1, harts_early_init ()
>>> (gdb)
>>> Continuing.
>>> [Switching to Thread 1]
>>>
>>> Thread 1 hit Breakpoint 1, harts_early_init ()
>>> (gdb)
>>> Continuing.
>>> [Switching to Thread 5]
>>>
>>>
>>> Thread 5 hit Breakpoint 3, 0x0000000001200000 in ?? ()
>>> (gdb) info threads
>>>    Id   Target Id         Frame
>>>    1    Thread 1 (hart 1) secondary_hart_loop () at
>>> arch/riscv/cpu/start.S:436 2    Thread 2 (hart 2) secondary_hart_loop
>>> () at arch/riscv/cpu/start.S:436 3    Thread 3 (hart 3)
>>> secondary_hart_loop () at arch/riscv/cpu/start.S:436 4    Thread 4
>>> (hart 4) secondary_hart_loop () at arch/riscv/cpu/start.S:436
>>> * 5    Thread 5 (hart 5) 0x0000000001200000 in ?? ()
>>>    6    Thread 6 (hart 6) 0x000000000000b650 in ?? ()
>>>    7    Thread 7 (hart 7) 0x000000000000b650 in ?? ()
>>>    8    Thread 8 (hart 8) 0x0000000000005fa0 in ?? ()
>>> (gdb) c
>>> Continuing.
>>
>> Do they all "offline" harts remain in SPL/U-Boot secondary_hart_loop ?
>>
>>>
>>>
>>>
>>> [    0.175619] smp: Bringing up secondary CPUs ...
>>> [    1.230474] CPU1: failed to come online
>>> [    2.282349] CPU2: failed to come online
>>> [    3.334394] CPU3: failed to come online
>>> [    4.386783] CPU4: failed to come online
>>> [    4.427829] smp: Brought up 1 node, 4 CPUs
>>>
>>>
>>> /root # cat /proc/cpuinfo
>>> processor       : 0
>>> hart            : 4
>>> isa     : rv64i2p0m2p0a2p0c2p0xv5-1p1
>>> mmu             : sv39
>>>
>>> processor       : 5
>>> hart            : 5
>>> isa     : rv64i2p0m2p0a2p0c2p0xv5-1p1
>>> mmu             : sv39
>>>
>>> processor       : 6
>>> hart            : 6
>>> isa     : rv64i2p0m2p0a2p0c2p0xv5-1p1
>>> mmu             : sv39
>>>
>>> processor       : 7
>>> hart            : 7
>>> isa     : rv64i2p0m2p0a2p0c2p0xv5-1p1
>>> mmu             : sv39
>>>
>>> /root #
>>>
>>> Thanks,
>>> Rick
>>
>


  reply	other threads:[~2022-09-05 15:41 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-05  6:22 RISCV: the machanism of available_harts may cause other harts boot failure Rick Chen
2022-09-05  7:47 ` Nikita Shubin
2022-09-05 15:30   ` Sean Anderson
2022-09-05 15:41     ` Heinrich Schuchardt [this message]
2022-09-05 15:45       ` Sean Anderson
2022-09-05 16:00         ` Heinrich Schuchardt
2022-09-05 16:14           ` Sean Anderson
2022-09-05 16:30             ` Heinrich Schuchardt
2022-09-05 17:10     ` Nikita Shubin
2022-09-06  1:51       ` Rick Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53ef4762-eb1d-043c-69de-a621eb3806d2@gmx.de \
    --to=xypron.glpk@gmx.de \
    --cc=anup@brainfault.org \
    --cc=atishp@atishpatra.org \
    --cc=bmeng.cn@gmail.com \
    --cc=lukas.auer@aisec.fraunhofer.de \
    --cc=nikita.shubin@maquefel.me \
    --cc=rick@andestech.com \
    --cc=rickchen36@gmail.com \
    --cc=seanga2@gmail.com \
    --cc=u-boot@lists.denx.de \
    --cc=ycliang@andestech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.