amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Greg KH <gregkh@linuxfoundation.org>
To: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Cc: daniel.vetter@ffwll.ch, michel@daenzer.net,
	dri-devel@lists.freedesktop.org, ppaalanen@gmail.com,
	amd-gfx@lists.freedesktop.org, Daniel Vetter <daniel@ffwll.ch>,
	ckoenig.leichtzumerken@gmail.com, alexdeucher@gmail.com
Subject: Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
Date: Mon, 22 Jun 2020 18:45:51 +0200	[thread overview]
Message-ID: <20200622164551.GA112181@kroah.com> (raw)
In-Reply-To: <fdaebe5b-3930-66d6-4f62-3e59e515e3da@amd.com>

On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:
> 
> On 6/22/20 7:21 AM, Greg KH wrote:
> > On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
> > > On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
> > > > Track sysfs files in a list so they all can be removed during pci remove
> > > > since otherwise their removal after that causes crash because parent
> > > > folder was already removed during pci remove.
> > Huh?  That should not happen, do you have a backtrace of that crash?
> 
> 
> 2 examples in the attached trace.

Odd, how did you trigger these?


> [  925.738225 <    0.188086>] BUG: kernel NULL pointer dereference, address: 0000000000000090
> [  925.738232 <    0.000007>] #PF: supervisor read access in kernel mode
> [  925.738236 <    0.000004>] #PF: error_code(0x0000) - not-present page
> [  925.738240 <    0.000004>] PGD 0 P4D 0 
> [  925.738245 <    0.000005>] Oops: 0000 [#1] SMP PTI
> [  925.738249 <    0.000004>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
> [  925.738256 <    0.000007>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
> [  925.738266 <    0.000010>] RIP: 0010:kernfs_find_ns+0x18/0x110
> [  925.738270 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
> [  925.738282 <    0.000012>] RSP: 0018:ffffad6d0118fb00 EFLAGS: 00010246
> [  925.738287 <    0.000005>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
> [  925.738292 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffb6606b31 RDI: 0000000000000000
> [  925.738297 <    0.000005>] RBP: ffffffffb6606b31 R08: ffffffffb5379d10 R09: 0000000000000000
> [  925.738302 <    0.000005>] R10: ffffad6d0118fb38 R11: ffff9a75f64820a8 R12: 0000000000000000
> [  925.738307 <    0.000005>] R13: 0000000000000000 R14: ffffffffb6606b31 R15: ffff9a7612b06130
> [  925.738313 <    0.000006>] FS:  00007f3eca4e8700(0000) GS:ffff9a763dbc0000(0000) knlGS:0000000000000000
> [  925.738319 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  925.738323 <    0.000004>] CR2: 0000000000000090 CR3: 0000000035e5a005 CR4: 00000000000606e0
> [  925.738329 <    0.000006>] Call Trace:
> [  925.738334 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
> [  925.738339 <    0.000005>]  sysfs_remove_group+0x25/0x80
> [  925.738344 <    0.000005>]  sysfs_remove_groups+0x29/0x40
> [  925.738350 <    0.000006>]  free_msi_irqs+0xf5/0x190
> [  925.738354 <    0.000004>]  pci_disable_msi+0xe9/0x120

So the PCI core is trying to clean up attributes that it had registered,
which is fine.  But we can't seem to find the attributes?  Were they
already removed somewhere else?

that's odd.

> [  925.738406 <    0.000052>]  amdgpu_irq_fini+0xe3/0xf0 [amdgpu]
> [  925.738453 <    0.000047>]  tonga_ih_sw_fini+0xe/0x30 [amdgpu]
> [  925.738490 <    0.000037>]  amdgpu_device_fini_late+0x14b/0x440 [amdgpu]
> [  925.738529 <    0.000039>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
> [  925.738548 <    0.000019>]  drm_dev_put+0x5b/0x80 [drm]
> [  925.738558 <    0.000010>]  drm_release+0xc6/0xd0 [drm]
> [  925.738563 <    0.000005>]  __fput+0xc6/0x260
> [  925.738568 <    0.000005>]  task_work_run+0x79/0xb0
> [  925.738573 <    0.000005>]  do_exit+0x3d0/0xc60
> [  925.738578 <    0.000005>]  do_group_exit+0x47/0xb0
> [  925.738583 <    0.000005>]  get_signal+0x18b/0xc30
> [  925.738589 <    0.000006>]  do_signal+0x36/0x6a0
> [  925.738593 <    0.000004>]  ? force_sig_info_to_task+0xbc/0xd0
> [  925.738597 <    0.000004>]  ? signal_wake_up_state+0x15/0x30
> [  925.738603 <    0.000006>]  exit_to_usermode_loop+0x6f/0xc0
> [  925.738608 <    0.000005>]  prepare_exit_to_usermode+0xc7/0x110
> [  925.738613 <    0.000005>]  ret_from_intr+0x25/0x35
> [  925.738617 <    0.000004>] RIP: 0033:0x417369
> [  925.738621 <    0.000004>] Code: Bad RIP value.
> [  925.738625 <    0.000004>] RSP: 002b:00007ffdd6bf0900 EFLAGS: 00010246
> [  925.738629 <    0.000004>] RAX: 00007f3eca509000 RBX: 000000000000001e RCX: 00007f3ec95ba260
> [  925.738634 <    0.000005>] RDX: 00007f3ec9889790 RSI: 000000000000000a RDI: 0000000000000000
> [  925.738639 <    0.000005>] RBP: 00007ffdd6bf0990 R08: 00007f3ec9889780 R09: 00007f3eca4e8700
> [  925.738645 <    0.000006>] R10: 000000000000035c R11: 0000000000000246 R12: 00000000021c6170
> [  925.738650 <    0.000005>] R13: 00007ffdd6bf0c00 R14: 0000000000000000 R15: 0000000000000000
> 
> 
> 
> 
> [   40.880899 <    0.000004>] BUG: kernel NULL pointer dereference, address: 0000000000000090
> [   40.880906 <    0.000007>] #PF: supervisor read access in kernel mode
> [   40.880910 <    0.000004>] #PF: error_code(0x0000) - not-present page
> [   40.880915 <    0.000005>] PGD 0 P4D 0 
> [   40.880920 <    0.000005>] Oops: 0000 [#1] SMP PTI
> [   40.880924 <    0.000004>] CPU: 1 PID: 2526 Comm: amdgpu_test Tainted: G        W  OE     5.5.0-rc7-dev-kfd+ #50
> [   40.880932 <    0.000008>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
> [   40.880941 <    0.000009>] RIP: 0010:kernfs_find_ns+0x18/0x110
> [   40.880945 <    0.000004>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
> [   40.880957 <    0.000012>] RSP: 0018:ffffaf3380467ba8 EFLAGS: 00010246
> [   40.880963 <    0.000006>] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 2098a12076864b7e
> [   40.880968 <    0.000005>] RDX: 0000000000000000 RSI: ffffffffc0678cfc RDI: 0000000000000000
> [   40.880973 <    0.000005>] RBP: ffffffffc0678cfc R08: ffffffffaa379d10 R09: 0000000000000000
> [   40.880979 <    0.000006>] R10: ffffaf3380467be0 R11: ffff93547615d128 R12: 0000000000000000
> [   40.880984 <    0.000005>] R13: 0000000000000000 R14: ffffffffc0678cfc R15: ffff93549be86130
> [   40.880990 <    0.000006>] FS:  00007fd9ecb10700(0000) GS:ffff9354bd840000(0000) knlGS:0000000000000000
> [   40.880996 <    0.000006>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   40.881001 <    0.000005>] CR2: 0000000000000090 CR3: 0000000072866001 CR4: 00000000000606e0
> [   40.881006 <    0.000005>] Call Trace:
> [   40.881011 <    0.000005>]  kernfs_find_and_get_ns+0x2e/0x50
> [   40.881016 <    0.000005>]  sysfs_remove_group+0x25/0x80
> [   40.881055 <    0.000039>]  amdgpu_device_fini_late+0x3eb/0x440 [amdgpu]
> [   40.881095 <    0.000040>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]

Here is this is your driver doing the same thing, removing attributes it
created.  But again they are not there.

So something went through and wiped the tree clean, which if I'm reading
this correctly, your patch would not solve as you would try to also
remove attributes that were already removed, right?

And 5.5-rc7 is a bit old (6 months and many thousands of changes ago),
does this still happen on a modern, released, kernel?

thanks,

greg k-h
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2020-06-22 17:28 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-21  6:03 [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
2020-06-21  6:03 ` [PATCH v2 1/8] drm: Add dummy page per device or GEM object Andrey Grodzovsky
2020-06-22  9:35   ` Daniel Vetter
2020-06-22 14:21     ` Pekka Paalanen
2020-06-22 14:24       ` Daniel Vetter
2020-06-22 14:28         ` Pekka Paalanen
2020-11-09 20:34     ` Andrey Grodzovsky
2020-11-15  6:39     ` Andrey Grodzovsky
2020-06-22 13:18   ` Christian König
2020-06-22 14:23     ` Daniel Vetter
2020-06-22 14:32     ` Andrey Grodzovsky
2020-06-22 17:45       ` Christian König
2020-06-22 17:50         ` Daniel Vetter
2020-11-09 20:53           ` Andrey Grodzovsky
2020-11-13 20:52           ` Andrey Grodzovsky
2020-11-14  8:41             ` Christian König
2020-11-14  9:51               ` Daniel Vetter
2020-11-14  9:57                 ` Daniel Vetter
2020-11-16  9:42                   ` Michel Dänzer
2020-11-15  6:34                 ` Andrey Grodzovsky
2020-11-16  9:48                   ` Christian König
2020-11-16 19:00                     ` Andrey Grodzovsky
2020-11-16 20:36                       ` Christian König
2020-11-16 20:42                         ` Andrey Grodzovsky
2020-11-19 10:01                           ` Christian König
2020-06-21  6:03 ` [PATCH v2 2/8] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
2020-06-22  9:41   ` Daniel Vetter
2020-06-24  3:31     ` Andrey Grodzovsky
2020-06-24  7:19       ` Daniel Vetter
2020-11-10 17:41     ` Andrey Grodzovsky
2020-06-22 19:30   ` Christian König
2020-06-21  6:03 ` [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space Andrey Grodzovsky
2020-06-22  9:45   ` Daniel Vetter
2020-06-23  5:00     ` Andrey Grodzovsky
2020-06-23 10:25       ` Daniel Vetter
2020-06-23 12:55         ` Christian König
2020-06-22 19:37   ` Christian König
2020-06-22 19:47   ` Alex Deucher
2020-06-21  6:03 ` [PATCH v2 4/8] drm/amdgpu: Split amdgpu_device_fini into early and late Andrey Grodzovsky
2020-06-22  9:48   ` Daniel Vetter
2020-11-12  4:19     ` Andrey Grodzovsky
2020-11-12  9:29       ` Daniel Vetter
2020-06-21  6:03 ` [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal Andrey Grodzovsky
2020-06-22  9:51   ` Daniel Vetter
2020-06-22 11:21     ` Greg KH
2020-06-22 16:07       ` Andrey Grodzovsky
2020-06-22 16:45         ` Greg KH [this message]
2020-06-23  4:51           ` Andrey Grodzovsky
2020-06-23  6:05             ` Greg KH
2020-06-24  3:04               ` Andrey Grodzovsky
2020-06-24  6:11                 ` Greg KH
2020-06-25  1:52                   ` Andrey Grodzovsky
2020-11-10 17:54                   ` Andrey Grodzovsky
2020-11-10 17:59                     ` Greg KH
2020-11-11 15:13                       ` Andrey Grodzovsky
2020-11-11 15:34                         ` Greg KH
2020-11-11 15:45                           ` Andrey Grodzovsky
2020-11-11 16:06                             ` Greg KH
2020-11-11 16:34                               ` Andrey Grodzovsky
2020-12-02 15:48                           ` Andrey Grodzovsky
2020-12-02 17:34                             ` Greg KH
2020-12-02 18:02                               ` Andrey Grodzovsky
2020-12-02 18:20                                 ` Greg KH
2020-12-02 18:40                                   ` Andrey Grodzovsky
2020-06-22 13:19   ` Christian König
2020-06-21  6:03 ` [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove Andrey Grodzovsky
2020-06-22  9:56   ` Daniel Vetter
2020-06-22 19:38   ` Christian König
2020-06-22 19:48     ` Alex Deucher
2020-06-23 10:22       ` Daniel Vetter
2020-06-23 13:16         ` Christian König
2020-06-24  3:12           ` Andrey Grodzovsky
2020-06-21  6:03 ` [PATCH v2 7/8] drm/amdgpu: Fix sdma code crash post device unplug Andrey Grodzovsky
2020-06-22  9:55   ` Daniel Vetter
2020-06-22 19:40   ` Christian König
2020-06-23  5:11     ` Andrey Grodzovsky
2020-06-23  7:14       ` Christian König
2020-06-21  6:03 ` [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged Andrey Grodzovsky
2020-06-22  9:53   ` Daniel Vetter
2020-11-17 18:38     ` Andrey Grodzovsky
2020-11-17 18:52       ` Daniel Vetter
2020-11-17 19:18         ` Andrey Grodzovsky
2020-11-17 19:49           ` Daniel Vetter
2020-11-17 20:07             ` Andrey Grodzovsky
2020-11-18  7:39               ` Daniel Vetter
2020-11-18 12:01                 ` Christian König
2020-11-18 15:43                   ` Luben Tuikov
2020-11-18 16:20                   ` Andrey Grodzovsky
2020-11-19  7:55                     ` Christian König
2020-11-19 15:02                       ` Andrey Grodzovsky
2020-11-19 15:29                         ` Daniel Vetter
2020-11-19 21:24                           ` Andrey Grodzovsky
2020-11-18  0:46             ` Luben Tuikov
2020-06-22  9:46 ` [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Daniel Vetter
2020-06-23  5:14   ` Andrey Grodzovsky
2020-06-23  9:04     ` Michel Dänzer
2020-06-24  3:21       ` Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200622164551.GA112181@kroah.com \
    --to=gregkh@linuxfoundation.org \
    --cc=Andrey.Grodzovsky@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=michel@daenzer.net \
    --cc=ppaalanen@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).