All of lore.kernel.org
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Fengguang Wu <fengguang.wu@intel.com>,
	Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Network Development <netdev@vger.kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	intel-wired-lan@lists.osuosl.org
Subject: Re: [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf
Date: Wed, 8 Nov 2017 08:20:38 -0800	[thread overview]
Message-ID: <CA+55aFwK7935r325nmR-eQvanVCN3A+v3wEQHsW4NpSeBsybeA@mail.gmail.com> (raw)
In-Reply-To: <20171108094832.qxvkawpw2snpcbvh@wfg-t540p.sh.intel.com>

On Wed, Nov 8, 2017 at 1:48 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
>
> Now I got the faddr2line output. :)

Thank you, but this also shows that you then compress the output too
much for convenience.

> [  745.719623] BUG: unable to handle kernel paging request at 6b6b6f4f
> [  745.732871] IP: vlan_device_event at net/8021q/vlan.h:60

So this looks more legible than "vlan_device_event+0x7f5/0xa40" (or
whatever), but in fact it's not.

The reason faddr2line is great is because it gives inlining
information, so that you can see exactly which access it is, even if
it's inside some helper inline function (macros still obviously are
going to be problematic).

And you've cut the whole information down, to the point where there is
_less_ information than there used to be.

So the full faddr2line output is actually important, and should look
something like this:

    __schedule+0x315/0x840:
    rq_sched_info_arrive at kernel/sched/stats.h:13
     (inlined by) sched_info_arrive at kernel/sched/stats.h:99
     (inlined by) __sched_info_switch at kernel/sched/stats.h:151
     (inlined by) sched_info_switch at kernel/sched/stats.h:158
     (inlined by) prepare_task_switch at kernel/sched/core.c:2582
     (inlined by) context_switch at kernel/sched/core.c:2755
     (inlined by) __schedule at kernel/sched/core.c:3366

which is admittedly a fairly extreme case, but it shows how involved
things can be.

Note how in my example above the access itself is from a header file
(that inlined rq_sched_info_arrive() function), and I happened to pick
the

                rq->rq_sched_info.pcount++;

line. But Many inline functions are inlined from different points,
often many times in the same top-level function (think of the atomic
helper inlines, for example).

In your case, that net/8021q/vlan.h:60 information is literally not
helping, because it only shows what I could already figure out from
looking at the constants in the disassembly: the code comes from the

        vlan_group_for_each_dev(...) {
                ..

loop setup (it's the inline __vlan_group_get_device() function, and
the constants I could recognize are the VLAN_GROUP_ARRAY_PART_LEN
things.

But exactly like the constants didn't tell me *which* invocation of
that loop it was, your "net/8021q/vlan.h:60" doesn't tell me which one
it is.

There's at least _eight_ different "vlan_group_for_each_dev() {" loops
in vlan_device_event(), and maybe it's not all that meaningful which
of the eight it is in this case, in other cases it's critical.

And since you've actually replaced the "vlan_device_event+0x7f5/0xa40"
with that "net/8021q/vlan.h:60" entirely, all the offset information
that *could* maybe have been used to pick one case over another (but
likely not) is also gone.

That's why that inlining chain is important - so that we can see how
it got to that access in __vlan_group_get_device().

So please do keep _all_ of the faddr2line output, at least for the
"IP:" line (the stack traces might not be worth it).

Anyway, it does all mean that the use-after-free almost certainly is
that "array[]" access, which was the main suspect to begin with.

I'm adding Jeff Kirsher and the e1000 list to the cc, since apparently
the e1000 driver is involved. Of course, all the stack traces are in
generic networking code, so it is possibly a generic networking issue,
but it would probably be good to have a few people look at it.

But all of this code is ancient. So it's almost certainly not a new
bug, whatever it is.

But as a test-case for your faddr2line improvements, this is excellent!

                   Linus

WARNING: multiple messages have this Message-ID
From: Linus Torvalds <torvalds@linux-foundation.org>
To: intel-wired-lan@osuosl.org
Subject: [Intel-wired-lan] [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf
Date: Wed, 8 Nov 2017 08:20:38 -0800	[thread overview]
Message-ID: <CA+55aFwK7935r325nmR-eQvanVCN3A+v3wEQHsW4NpSeBsybeA@mail.gmail.com> (raw)
In-Reply-To: <20171108094832.qxvkawpw2snpcbvh@wfg-t540p.sh.intel.com>

On Wed, Nov 8, 2017 at 1:48 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
>
> Now I got the faddr2line output. :)

Thank you, but this also shows that you then compress the output too
much for convenience.

> [  745.719623] BUG: unable to handle kernel paging request at 6b6b6f4f
> [  745.732871] IP: vlan_device_event at net/8021q/vlan.h:60

So this looks more legible than "vlan_device_event+0x7f5/0xa40" (or
whatever), but in fact it's not.

The reason faddr2line is great is because it gives inlining
information, so that you can see exactly which access it is, even if
it's inside some helper inline function (macros still obviously are
going to be problematic).

And you've cut the whole information down, to the point where there is
_less_ information than there used to be.

So the full faddr2line output is actually important, and should look
something like this:

    __schedule+0x315/0x840:
    rq_sched_info_arrive at kernel/sched/stats.h:13
     (inlined by) sched_info_arrive at kernel/sched/stats.h:99
     (inlined by) __sched_info_switch at kernel/sched/stats.h:151
     (inlined by) sched_info_switch at kernel/sched/stats.h:158
     (inlined by) prepare_task_switch at kernel/sched/core.c:2582
     (inlined by) context_switch at kernel/sched/core.c:2755
     (inlined by) __schedule at kernel/sched/core.c:3366

which is admittedly a fairly extreme case, but it shows how involved
things can be.

Note how in my example above the access itself is from a header file
(that inlined rq_sched_info_arrive() function), and I happened to pick
the

                rq->rq_sched_info.pcount++;

line. But Many inline functions are inlined from different points,
often many times in the same top-level function (think of the atomic
helper inlines, for example).

In your case, that net/8021q/vlan.h:60 information is literally not
helping, because it only shows what I could already figure out from
looking at the constants in the disassembly: the code comes from the

        vlan_group_for_each_dev(...) {
                ..

loop setup (it's the inline __vlan_group_get_device() function, and
the constants I could recognize are the VLAN_GROUP_ARRAY_PART_LEN
things.

But exactly like the constants didn't tell me *which* invocation of
that loop it was, your "net/8021q/vlan.h:60" doesn't tell me which one
it is.

There's at least _eight_ different "vlan_group_for_each_dev() {" loops
in vlan_device_event(), and maybe it's not all that meaningful which
of the eight it is in this case, in other cases it's critical.

And since you've actually replaced the "vlan_device_event+0x7f5/0xa40"
with that "net/8021q/vlan.h:60" entirely, all the offset information
that *could* maybe have been used to pick one case over another (but
likely not) is also gone.

That's why that inlining chain is important - so that we can see how
it got to that access in __vlan_group_get_device().

So please do keep _all_ of the faddr2line output, at least for the
"IP:" line (the stack traces might not be worth it).

Anyway, it does all mean that the use-after-free almost certainly is
that "array[]" access, which was the main suspect to begin with.

I'm adding Jeff Kirsher and the e1000 list to the cc, since apparently
the e1000 driver is involved. Of course, all the stack traces are in
generic networking code, so it is possibly a generic networking issue,
but it would probably be good to have a few people look at it.

But all of this code is ancient. So it's almost certainly not a new
bug, whatever it is.

But as a test-case for your faddr2line improvements, this is excellent!

                   Linus

  reply	other threads:[~2017-11-08 16:20 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-07 10:21 [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf Fengguang Wu
2017-11-07 16:25 ` Linus Torvalds
2017-11-07 16:46   ` Fengguang Wu
2017-11-08  9:48   ` Fengguang Wu
2017-11-08 16:20     ` Linus Torvalds [this message]
2017-11-08 16:20       ` [Intel-wired-lan] " Linus Torvalds
2017-11-08 17:12       ` Fengguang Wu
2017-11-08 17:12         ` [Intel-wired-lan] " Fengguang Wu
2017-11-08 17:18         ` Fengguang Wu
2017-11-08 17:18           ` [Intel-wired-lan] " Fengguang Wu
2017-11-08 18:05         ` Linus Torvalds
2017-11-08 18:05           ` [Intel-wired-lan] " Linus Torvalds
2017-11-08 18:36         ` Alexander Duyck
2017-11-08 18:36           ` [Intel-wired-lan] " Alexander Duyck
2017-11-09  3:12           ` Fengguang Wu
2017-11-09  3:12             ` [Intel-wired-lan] " Fengguang Wu
2017-11-09  4:09             ` Fengguang Wu
2017-11-09  4:09               ` [Intel-wired-lan] " Fengguang Wu
2017-11-09  7:22               ` Fengguang Wu
2017-11-09  7:22                 ` [Intel-wired-lan] " Fengguang Wu
2017-11-09  6:34             ` Cong Wang
2017-11-09  6:34               ` [Intel-wired-lan] " Cong Wang
2017-11-09  6:55               ` Fengguang Wu
2017-11-09  6:55                 ` [Intel-wired-lan] " Fengguang Wu
2017-11-09  7:43                 ` Fengguang Wu
2017-11-09  7:43                   ` [Intel-wired-lan] " Fengguang Wu
2017-11-09 15:51               ` Girish Moodalbail
2017-11-09 15:51                 ` [Intel-wired-lan] " Girish Moodalbail
2017-11-10  0:16                 ` Cong Wang
2017-11-10  0:16                   ` [Intel-wired-lan] " Cong Wang
2017-11-12 19:31         ` Linus Torvalds
2017-11-12 19:31           ` [Intel-wired-lan] " Linus Torvalds
2017-11-13  1:13           ` CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line Fengguang Wu
2017-11-13  1:13             ` [Intel-wired-lan] " Fengguang Wu
2017-11-13  2:05             ` Zhang Rui
2017-11-13  2:05               ` [Intel-wired-lan] " Zhang Rui
2017-11-13  2:22               ` Fengguang Wu
2017-11-13  2:22                 ` [Intel-wired-lan] " Fengguang Wu
2017-11-13 18:52             ` Andi Kleen
2017-11-13 18:52               ` [Intel-wired-lan] " Andi Kleen
2017-11-13 19:14               ` Linus Torvalds
2017-11-13 19:14                 ` [Intel-wired-lan] " Linus Torvalds
2017-11-13 20:10                 ` Andi Kleen
2017-11-13 20:10                   ` [Intel-wired-lan] " Andi Kleen
2017-11-13 20:14                   ` H.J. Lu
2017-11-13 20:14                     ` [Intel-wired-lan] " H.J. Lu
2017-11-13 20:56                   ` Linus Torvalds
2017-11-13 20:56                     ` [Intel-wired-lan] " Linus Torvalds
2017-11-13 21:41                     ` Andi Kleen
2017-11-13 21:41                       ` [Intel-wired-lan] " Andi Kleen
2017-11-13 21:57                       ` Linus Torvalds
2017-11-13 21:57                         ` [Intel-wired-lan] " Linus Torvalds
2017-11-13 23:51                         ` Andi Kleen
2017-11-13 23:51                           ` [Intel-wired-lan] " Andi Kleen
2017-11-14  8:13               ` Fengguang Wu
2017-11-14  8:13                 ` [Intel-wired-lan] " Fengguang Wu
2017-11-09  2:43     ` [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf Fengguang Wu
2017-11-09  6:48       ` Fengguang Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+55aFwK7935r325nmR-eQvanVCN3A+v3wEQHsW4NpSeBsybeA@mail.gmail.com \
    --to=torvalds@linux-foundation.org \
    --cc=davem@davemloft.net \
    --cc=fengguang.wu@intel.com \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jeffrey.t.kirsher@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.