linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Marco Elver <elver@google.com>
To: Nick Desaulniers <ndesaulniers@google.com>, george.burgess.iv@gmail.com
Cc: Kan Liang <kan.liang@linux.intel.com>,
	clang-built-linux <clang-built-linux@googlegroups.com>,
	kasan-dev <kasan-dev@googlegroups.com>,
	kernel test robot <rong.a.chen@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>, LKP <lkp@lists.01.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Dmitry Vyukov <dvyukov@google.com>,
	Alexander Potapenko <glider@google.com>,
	Andrey Konovalov <andreyknvl@google.com>
Subject: Re: [rcu] 2f08469563: BUG:kernel_reboot-without-warning_in_boot_stage
Date: Tue, 19 May 2020 20:32:39 +0200	[thread overview]
Message-ID: <CANpmjNOeXmD5E3O50Z3MjkiuCYaYOPyi+1rq=GZvEKwBvLR0Ug@mail.gmail.com> (raw)
In-Reply-To: <CANpmjNO0kDVW4uaLcOF95L3FKc8WjawJqXKQtYbCad+W2r=75g@mail.gmail.com>

On Tue, 19 May 2020 at 15:40, Marco Elver <elver@google.com> wrote:
>
> On Tue, 19 May 2020 at 12:16, Marco Elver <elver@google.com> wrote:
> >
> > On Mon, 18 May 2020 at 20:05, Marco Elver <elver@google.com> wrote:
> > >
> > > On Mon, 18 May 2020, 'Nick Desaulniers' via kasan-dev wrote:
> > >
> > > > On Mon, May 18, 2020 at 7:34 AM Marco Elver <elver@google.com> wrote:
> > > > >
> > > > > On Mon, 18 May 2020 at 14:44, Marco Elver <elver@google.com> wrote:
> > > > > >
> > > > > > [+Cc clang-built-linux FYI]
> > > > > >
> > > > > > On Mon, 18 May 2020 at 12:11, Marco Elver <elver@google.com> wrote:
> > > > > > >
> > > > > > > On Sun, 17 May 2020 at 05:47, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Sun, May 17, 2020 at 09:17:32AM +0800, kernel test robot wrote:
> > > > > > > > > Greeting,
> > > > > > > > >
> > > > > > > > > FYI, we noticed the following commit (built with clang-11):
> > > > > > > > >
> > > > > > > > > commit: 2f08469563550d15cb08a60898d3549720600eee ("rcu: Mark rcu_state.ncpus to detect concurrent writes")
> > > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/paulmck/linux-rcu.git dev.2020.05.14c
> > > > > > > > >
> > > > > > > > > in testcase: boot
> > > > > > > > >
> > > > > > > > > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G
> > > > > > > > >
> > > > > > > > > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > If you fix the issue, kindly add following tag
> > > > > > > > > Reported-by: kernel test robot <rong.a.chen@intel.com>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [    0.054943] BRK [0x05204000, 0x05204fff] PGTABLE
> > > > > > > > > [    0.061181] BRK [0x05205000, 0x05205fff] PGTABLE
> > > > > > > > > [    0.062403] BRK [0x05206000, 0x05206fff] PGTABLE
> > > > > > > > > [    0.065200] RAMDISK: [mem 0x7a247000-0x7fffffff]
> > > > > > > > > [    0.067344] ACPI: Early table checksum verification disabled
> > > > > > > > > BUG: kernel reboot-without-warning in boot stage
> > > > > > > >
> > > > > > > > I am having some difficulty believing that this commit is at fault given
> > > > > > > > that the .config does not list CONFIG_KCSAN=y, but CCing Marco Elver
> > > > > > > > for his thoughts.  Especially given that I have never built with clang-11.
> > > > > > > >
> > > > > > > > But this does invoke ASSERT_EXCLUSIVE_WRITER() in early boot from
> > > > > > > > rcu_init().  Might clang-11 have objections to early use of this macro?
> > > > > > >
> > > > > > > The macro is a noop without KCSAN. I think the bisection went wrong.
> > > > > > >
> > > > > > > I am able to reproduce a reboot-without-warning when building with
> > > > > > > Clang 11 and the provided config. I did a bisect, starting with v5.6
> > > > > > > (good), and found this:
> > > > > > > - Since v5.6, first bad commit is
> > > > > > > 20e2aa812620439d010a3f78ba4e05bc0b3e2861 (Merge tag
> > > > > > > 'perf-urgent-2020-04-12' of
> > > > > > > git://git.kernel.org/pub/scm/linux/kernel//git/tip/tip)
> > > > > > > - The actual commit that introduced the problem is
> > > > > > > 2b3b76b5ec67568da4bb475d3ce8a92ef494b5de (perf/x86/intel/uncore: Add
> > > > > > > Ice Lake server uncore support) -- reverting it fixes the problem.
> > > > >
> > > > > Some more clues:
> > > > >
> > > > > 1. I should have noticed that this uses CONFIG_KASAN=y.
> > > >
> > > > Thanks for the report, testing, and bisection.  I don't see any
> > > > smoking gun in the code.
> > > > https://godbolt.org/z/qbK26r
> > >
> > > My guess is data layout and maybe some interaction with KASAN. I also
> > > played around with leaving icx_mmio_uncores empty, meaning none of the
> > > data it refers to end up in the data section (presumably because
> > > optimized out), which resulted in making the bug disappear as well.
> > >
> > > > >
> > > > > 2. Something about function icx_uncore_mmio_init(). Making it a noop
> > > > > also makes the issue go away.
> > > > >
> > > > > 3. Leaving icx_uncore_mmio_init() a noop but removing the 'static'
> > > > > from icx_mmio_uncores also presents the issue. So this seems to be
> > > > > something about how/where icx_mmio_uncores is allocated.
> > > >
> > > > Can you share the disassembly of icx_uncore_mmio_init() in the given
> > > > configuration?
> > >
> > > ffffffff8102c097 <icx_uncore_mmio_init>:
> > > ffffffff8102c097:       e8 b4 52 bd 01          callq  ffffffff82c01350 <__fentry__>
> > > ffffffff8102c09c:       48 c7 c7 e0 55 c3 83    mov    $0xffffffff83c355e0,%rdi
> > > ffffffff8102c0a3:       e8 69 9a 3b 00          callq  ffffffff813e5b11 <__asan_store8>
> > > ffffffff8102c0a8:       48 c7 05 2d 95 c0 02    movq   $0xffffffff83c388e0,0x2c0952d(%rip)        # ffffffff83c355e0 <uncore_mmio_uncores>
> > > ffffffff8102c0af:       e0 88 c3 83
> > > ffffffff8102c0b3:       c3                      retq
> > >
> > > The problem still happens if we add a __no_sanitize_address (or even
> > > KASAN_SANITIZE := n) here. I think this function is a red herring: you
> > > can make this function be empty, but as long as icx_mmio_uncores and its
> > > dependencies are added to the data section somewhere, does the bug
> > > appear.
> >
> > I also tried to bisect Clang/LLVM, and found that
> > https://reviews.llvm.org/D78162 introduced the breaking change to
> > Clang/LLVM. Reverting that change results in a bootable kernel *with*
> > "perf/x86/intel/uncore: Add Ice Lake server uncore support" still
> > applied.
>
> I found that with Clang/LLVM change D78162, a bunch of memcpys are
> optimized into just a bunch of loads/stores. It may turn out that this
> is again a red herring, because the result is that more code is
> generated, affecting layout. So in the end, the Clang/LLVM bisection
> might just point at the first change that causes data layout to change
> in a way that triggers the bug.

This fixes the problem:
https://lkml.kernel.org/r/20200519182459.87166-1-elver@google.com

I suppose there are several things that happened that caused the above
bisected changes to trigger this. Hard to say how exactly the above
bisected changes caused this to manifest, because during early boot
(while uninitialized) KASAN may just randomly enter kasan_report()
before the branch (annotated with likely(), which is caught by the
branch tracer) prevents it from actually generating a report. However,
if it goes branch tracer -> KASAN -> branch tracers -> KASAN ..., then
we crash. If I had to guess some combination of different code gen,
different stack and/or data usage. So all the above bisected changes
(AFAIK) were red herrings. :-)

Thanks,
-- Marco

  reply	other threads:[~2020-05-19 18:32 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-17  1:17 [rcu] 2f08469563: BUG:kernel_reboot-without-warning_in_boot_stage kernel test robot
2020-05-17  3:47 ` Paul E. McKenney
2020-05-18 10:11   ` Marco Elver
2020-05-18 12:44     ` Marco Elver
2020-05-18 14:34       ` Marco Elver
2020-05-18 17:49         ` Nick Desaulniers
2020-05-18 18:05           ` Marco Elver
2020-05-19 10:16             ` Marco Elver
2020-05-19 13:40               ` Marco Elver
2020-05-19 18:32                 ` Marco Elver [this message]
2020-05-20 16:32                   ` Nick Desaulniers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CANpmjNOeXmD5E3O50Z3MjkiuCYaYOPyi+1rq=GZvEKwBvLR0Ug@mail.gmail.com' \
    --to=elver@google.com \
    --cc=andreyknvl@google.com \
    --cc=clang-built-linux@googlegroups.com \
    --cc=dvyukov@google.com \
    --cc=george.burgess.iv@gmail.com \
    --cc=glider@google.com \
    --cc=kan.liang@linux.intel.com \
    --cc=kasan-dev@googlegroups.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@lists.01.org \
    --cc=ndesaulniers@google.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rong.a.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).