All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zhouyi Zhou <zhouzhouyi@gmail.com>
To: paulmck@kernel.org
Cc: linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	akaher@vmware.com, shuah@kernel.org, rcu@vger.kernel.org
Subject: Re: [BUG] unable to handle page fault for address
Date: Thu, 20 Jul 2023 03:03:53 +0800	[thread overview]
Message-ID: <CAABZP2xAi0FdEjmSV-DU_Pefk=Jya=XvL0OwDQspKg-jnq_fLQ@mail.gmail.com> (raw)
In-Reply-To: <60f88589-5f9f-4221-82f9-5c9c11fb5d95@paulmck-laptop>

On Thu, Jul 20, 2023 at 1:40 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> Hello!
>
> Just FYI for the moment, I hit a BUG in 10-hour overnight qemu/KVM-based
> rcutorture testing of the TASKS01, TASKS03, TREE03, and TREE07 rcutorture
> on the mainline commit:
>
> ccff6d117d8d ("Merge tag 'perf-tools-fixes-for-v6.5-1-2023-07-18' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools")
>
> This commit was merged with the -rcu "dev" branch, which just yesterday
> passes overnight tests when based on v6.5-rc1.  A web search got me the
> following, which is not a very close match, but the best that there is:
>
> https://www.spinics.net/lists/bpf/msg91848.html
>
> Here is a representative splat out of 29 of them:
>
> [22544.494701] BUG: unable to handle page fault for address: fffffe0000af0038
> [22544.496012] #PF: supervisor read access in user mode
> [22544.496985] #PF: error_code(0x0000) - not-present page
> [22544.497970] IDT: 0xfffffe0000000000 (limit=0xfff) GDT: 0xfffffe355b3fd000 (limit=0x7f)
> [22544.499479] LDTR: NULL
> [22544.499959] TR: 0x40 -- base=0xfffffe355b3ff000 limit=0x4087
> [22544.501073] PGD 1ffd2067 P4D 1ffd2067 PUD 1124067 PMD 0
> [22544.502149] Oops: 0000 [#1] PREEMPT SMP PTI
> [22544.502967] CPU: 0 PID: 1 Comm: init Not tainted 6.5.0-rc2-00128-g68052f2f9e35 #5844
> [22544.504435] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [22544.506581] RIP: 0033:0x400181
> [22544.507256] Code: 00 45 31 d2 4c 8d 44 24 f0 48 c7 44 24 f8 00 00 00 00 0f 05 b8 60 00 00 00 48 8d 7c 24 e0 0f 05 41 89 c0 85 c0 75 c6 44 89 c0 <89> c2 0f af d0 ff c0 48 63 d2 48 89 15 06 01 20 00 3d a0 86 01 00
> [22544.510954] RSP: 002b:00007ffde5192288 EFLAGS: 00010283
> [22544.511963] RAX: 000000000000ebe9 RBX: 0000000000000000 RCX: 00000000004001a7
> [22544.513386] RDX: ffffffffd963c240 RSI: 0000000000000000 RDI: 00007ffde5192258
> [22544.514751] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [22544.516079] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [22544.517452] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [22544.518818] FS:  0000000000000000 GS:  0000000000000000
> [22544.519864] Modules linked in:
> [22544.520653] CR2: fffffe0000af0038
> [22544.521401] ---[ end trace 0000000000000000 ]---
> [22544.522297] RIP: 0033:0x400181
> [22544.522887] RSP: 002b:00007ffde5192288 EFLAGS: 00010283
> [22544.523887] RAX: 000000000000ebe9 RBX: 0000000000000000 RCX: 00000000004001a7
> [22544.525257] RDX: ffffffffd963c240 RSI: 0000000000000000 RDI: 00007ffde5192258
> [22544.526623] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [22544.528016] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [22544.529439] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [22544.530909] FS:  0000000000000000(0000) GS:ffff8d101f400000(0000) knlGS:0000000000000000
> [22544.532564] CS:  0033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [22544.533760] CR2: fffffe0000af0038 CR3: 0000000002d8e000 CR4: 00000000000006f0
> [22544.535286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [22544.536738] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [22544.538228] note: init[1] exited with irqs disabled
> [22544.540087] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
> [22544.544806] Kernel Offset: 0x9600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
> The failure rates for each scenario are as follows:
>
> TASKS01: 13 of 15
> TASKS03: 7 of 15
> TREE03: 4 of 15
> TREE07: 5 of 15
>
> In other words, for TASKS01, there were 15 10-hour runs, and 13 of them
> hit this bug.
>
> Except that these failures were corellated:
>
> o       TASKS01 and TASKS03 failures are in batch 17 running on
>         kerneltest053.05.atn6.
>
> o       TREE03 and one of the TREE07 failures are in batch 4 running
>         on kerneltest025.05.atn6.
>
> o       Five of the TREE07 failures are in batch 6 running on
>         kerneltest010.05.atn6.
>
> All of the failures in a given batch happened at about the same time,
> suggesting a reaction to some stimulus from the host system.  There
> was no console output from any of these host systems during the test,
> other than the usual complaint about qemu having an executable stack.
>
> Now, perhaps this is hardware or configuration failure on those three
> systems.  But I have been using them for quite some time, and so why
> would three fail at the same time?
>
> Bisection will not be easy given that there were only three actual
> failures in ten hours over 20 batches.  If this is in fact a random
> kernel failure, there would be almost a 4% chance of a false positive
> on each bisection step, which gives about a 50% chance of bisecting to
> the wrong place if the usual 16 steps are required.
Hi Paul
I am also very interested in tracing this bug.
I have invoked ./tools/testing/selftests/rcutorture/bin/torture.sh on
my 16 core Intel(R) Xeon(R) CPU E5-2660 server a moment ago to see
what happens
Thanx, Zhouyi
>
> I will give this another try this evening, Pacific Time.  In the
> meantime, FYI.
>
>                                                 Thanx, Paul

  reply	other threads:[~2023-07-19 19:04 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-19 17:28 [BUG] unable to handle page fault for address Paul E. McKenney
2023-07-19 19:03 ` Zhouyi Zhou [this message]
2023-07-19 20:07   ` Paul E. McKenney
2023-07-21  1:41     ` Zhouyi Zhou
2023-07-21 19:28       ` Paul E. McKenney
2023-07-22  0:53         ` Zhouyi Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAABZP2xAi0FdEjmSV-DU_Pefk=Jya=XvL0OwDQspKg-jnq_fLQ@mail.gmail.com' \
    --to=zhouzhouyi@gmail.com \
    --cc=akaher@vmware.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=shuah@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.