All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
@ 2016-07-29 22:41 bugzilla-daemon
  2016-07-29 22:44 ` [Bug 150731] " bugzilla-daemon
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: bugzilla-daemon @ 2016-07-29 22:41 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

            Bug ID: 150731
           Summary: amdgpu: segfault on unbind in sysfs; card becomes
                    nonresponsive
           Product: Drivers
           Version: 2.5
    Kernel Version: 4.6.4
          Hardware: x86-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri@kernel-bugs.osdl.org
          Reporter: JimiJames.Bove@gmail.com
        Regression: No

Full details here:
https://www.reddit.com/r/linux_gaming/comments/4udupx/nvidiaamd_support_questions/d5ovipc

Summary:
I'm using an R9 380. Others confirmed having this issue on the R9 285 and RX
480 (so, Tonga & Polaris 10 at least).

I can bind my video card to amdgpu, and that works. It crashes X, but when I
log back in, it's properly connected and everything.

However, if I try to unbind it, after waiting for a few seconds, I get a
segfault. Any subsequent attempts to do anything with that card in
sysfs--trying to unbind again, trying to bind to something else, etc.--will get
stuck forever, never segfaulting, because the card is not responding.

Removing the card (echo 1 > /sys/bus/pci/devices/0000:0X:00.0/remove) works,
but after a rescan (echo 1 > /sys/bus/pci/rescan), the card is no longer in
sysfs at all, as if it's been powered down. It can't be accessed by the system
in any way after that, until the computer reboots.

It may or may not be related to the "reset issues" bug:
http://vfio.blogspot.de/2015/04/progress-on-amd-front.html
https://lists.gnu.org/archive/html/qemu-devel/2015-04/msg03128.html
That bug officially only affects Hawaii and Bonaire, but Tonga cards (380, 285)
exhibit the same behavior even if it may not be for the same reason. Whether it
affects Polaris 10 (RX 480) is unknown. The RX 480 tester is currently finding
that out.

I also had this issue on 4.6.1, so it probably at least affects 4.6 in general.
Maybe all kernel versions that have amdgpu?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 150731] amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
  2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
@ 2016-07-29 22:44 ` bugzilla-daemon
  2016-07-29 22:45 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2016-07-29 22:44 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

--- Comment #1 from Jimi <JimiJames.Bove@gmail.com> ---
Clarification: If I bind the card to amdgpu, X doesn't crash when I actually
bind it to amdgpu. I never actually get to bind it to amdgpu. X crashes when I
rescan PCI devices (after unbinding the card from whatever it was originally
bound to, which in my case is always vfio-pci because I pass it into a QEMU/KVM
virtual machine). When I log back in, the card has been automatically bound to
amdgpu successfully.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 150731] amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
  2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
  2016-07-29 22:44 ` [Bug 150731] " bugzilla-daemon
@ 2016-07-29 22:45 ` bugzilla-daemon
  2016-08-09 16:50 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2016-07-29 22:45 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

--- Comment #2 from Jimi <JimiJames.Bove@gmail.com> ---
Another clarification: the behavior is the same if I don't bind the card to
amdgpu myself and let it be bound to amdgpu on boot, automatically, which is
how I usually test it.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 150731] amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
  2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
  2016-07-29 22:44 ` [Bug 150731] " bugzilla-daemon
  2016-07-29 22:45 ` bugzilla-daemon
@ 2016-08-09 16:50 ` bugzilla-daemon
  2016-08-11 23:32 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2016-08-09 16:50 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

--- Comment #3 from Jimi <JimiJames.Bove@gmail.com> ---
I've now confirmed this issue on Fiji (R9 Fury) as well.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 150731] amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
  2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
                   ` (2 preceding siblings ...)
  2016-08-09 16:50 ` bugzilla-daemon
@ 2016-08-11 23:32 ` bugzilla-daemon
  2016-08-11 23:33 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2016-08-11 23:32 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

--- Comment #4 from Jimi <JimiJames.Bove@gmail.com> ---
Created attachment 228411
  --> https://bugzilla.kernel.org/attachment.cgi?id=228411&action=edit
X crash log

Here are my Xorg logs for when I unbind from vfio-pci, remove, and rescan, and
X crashes and comes back with the card bound. The post-crash log is the
Xorg.0.log file, which just shows X loading a desktop that uses both cards
(although the AMD card has "Ignore" set to "true" since it's just meant for
running games with the DRI_PRIME variable), and the crash log is the
Xorg.0.log.old file, which captures the moment of the crash starting at time
[456.336].

You can see there aren't any errors in there. It seems to be just reconfiguring
the graphics because it noticed a new available card, and somehow that resulted
in me being booted back to the login screen. And according to all the tutorials
I've read on switching a card between vfio-pci and X, it shouldn't even be
doing this on its own. It should be waiting for me to bind the card to amdgpu
myself. Why is it doing it automatically and booting me out?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 150731] amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
  2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
                   ` (3 preceding siblings ...)
  2016-08-11 23:32 ` bugzilla-daemon
@ 2016-08-11 23:33 ` bugzilla-daemon
  2016-08-11 23:46 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2016-08-11 23:33 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

--- Comment #5 from Jimi <JimiJames.Bove@gmail.com> ---
Created attachment 228421
  --> https://bugzilla.kernel.org/attachment.cgi?id=228421&action=edit
X post-crash log

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 150731] amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
  2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
                   ` (4 preceding siblings ...)
  2016-08-11 23:33 ` bugzilla-daemon
@ 2016-08-11 23:46 ` bugzilla-daemon
  2016-08-12  0:16 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2016-08-11 23:46 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

--- Comment #6 from Jimi <JimiJames.Bove@gmail.com> ---
Created attachment 228431
  --> https://bugzilla.kernel.org/attachment.cgi?id=228431&action=edit
dmesg log

And now I've tried unbinding it from amdgpu without X running at all, and it of
course didn't work, confirming its kernel bug status. I've captured the dmesg
log, and as far as I can tell, the part of the log that pertains to amdgpu is
the stack trace starting at [1131.985756].

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 150731] amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
  2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
                   ` (5 preceding siblings ...)
  2016-08-11 23:46 ` bugzilla-daemon
@ 2016-08-12  0:16 ` bugzilla-daemon
  2016-08-12  1:09 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2016-08-12  0:16 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

--- Comment #7 from Jimi <JimiJames.Bove@gmail.com> ---
I should mention at this point, I think there are 2 different bugs going on.
One bug is making it impossible to unbind any cards from the driver, and
another bug is making X immediately bind itself to an amdgpu card the instant
it becomes available and crash. The former is definitely something wrong with
amdgpu in the kernel, but the latter could be X's fault--I don't know. Just in
case, I've filed a report for X, too:
https://bugs.freedesktop.org/show_bug.cgi?id=97313

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 150731] amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
  2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
                   ` (6 preceding siblings ...)
  2016-08-12  0:16 ` bugzilla-daemon
@ 2016-08-12  1:09 ` bugzilla-daemon
  2016-08-16  9:21 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2016-08-12  1:09 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

--- Comment #8 from Jimi <JimiJames.Bove@gmail.com> ---
Created attachment 228441
  --> https://bugzilla.kernel.org/attachment.cgi?id=228441&action=edit
dmesg log (amdgpu-pro)

And here's the dmesg log from testing this with amdgpu-pro (without X running),
with the crash starting at [137.003975].

amdgpu-pro exhibited almost exactly the same behavior. The only difference was
instead of getting a segfault after a few seconds, the terminal session that
unbound the card was immediately spammed with the dmesg stack trace in this
attached file.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 150731] amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
  2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
                   ` (7 preceding siblings ...)
  2016-08-12  1:09 ` bugzilla-daemon
@ 2016-08-16  9:21 ` bugzilla-daemon
  2016-08-16  9:23 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2016-08-16  9:21 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

--- Comment #9 from Jimi <JimiJames.Bove@gmail.com> ---
Over at the X bug report, I've figured out that when X has AutoAddGPU turned
off, it doesnt crash (meaning that bug is not related to this bug), however,
the card is still automatically bound to amdgpu before I can bind it, which
means 2 things:
1. That's going to be a problem when I'm able to successfully unbinding my card
from amdgpu and will need to be able to respond PCI devices without it
auto-binding to amdgpu, because I'll want to bind it to vfio-pci.
2. That part of the X bug is actually its own bug, which makes sense, because
the card would immediately auto-bind to amdgpu if I tested things without X
running.

So, as far as this bug report is concerned, we actually have 2 bugs going on:
the card can't be unbound, and the card is automatically bound on a rescan,
stopping the user from having a choice in which driver it gets bound to. I
think these 2 issues just may be related. Maybe they even have the same cause?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 150731] amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
  2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
                   ` (8 preceding siblings ...)
  2016-08-16  9:21 ` bugzilla-daemon
@ 2016-08-16  9:23 ` bugzilla-daemon
  2017-06-08 14:27 ` bugzilla-daemon
  2017-06-10 19:28 ` bugzilla-daemon
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2016-08-16  9:23 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

--- Comment #10 from Jimi <JimiJames.Bove@gmail.com> ---
Sorry, autocorrect typos. Thing #1 is supposed to say that it's going to be a
problem when I'm able to successfully unbind my card from amdgpu and will need
to be able to rescan PCI devices without it auto-binding to amdgpu, because
I'll want to bind it to vfio-pci.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 150731] amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
  2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
                   ` (9 preceding siblings ...)
  2016-08-16  9:23 ` bugzilla-daemon
@ 2017-06-08 14:27 ` bugzilla-daemon
  2017-06-10 19:28 ` bugzilla-daemon
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2017-06-08 14:27 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

Luke A. Guest (laguest@archeia.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |laguest@archeia.com

--- Comment #11 from Luke A. Guest (laguest@archeia.com) ---
I can confirm that the OS completely hangs when unbinding R9 380 (Tonga Pro)
with X running. Works fine with X off.

I have amdgpu and vfio-pci both in kernel, used the following to unbind it.

#!/bin/bash
for dev in "$@"; do
        vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
        device=$(cat /sys/bus/pci/devices/$dev/device)
        if [ -e /sys/bus/pci/devices/$dev/driver ]; then
                echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
        fi
        echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
done

lspci -nnk shows:

03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc.
[AMD/ATI] Tonga PRO [Radeon R9 285/380] [1002:6939] (rev f1)
        Subsystem: PC Partner Limited / Sapphire Technology Radeon R9 380 Nitro
4G D5 [174b:e308]
        Kernel driver in use: vfio-pci
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga HDMI
Audio [Radeon R9 285/380] [1002:aad8]
        Subsystem: PC Partner Limited / Sapphire Technology Radeon R9 285/380
HDMI Audio [174b:aad8]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug 150731] amdgpu: segfault on unbind in sysfs; card becomes nonresponsive
  2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
                   ` (10 preceding siblings ...)
  2017-06-08 14:27 ` bugzilla-daemon
@ 2017-06-10 19:28 ` bugzilla-daemon
  11 siblings, 0 replies; 13+ messages in thread
From: bugzilla-daemon @ 2017-06-10 19:28 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=150731

--- Comment #12 from Jimi (JimiJames.Bove@gmail.com) ---
Turns out this bug has been getting ignored because of an extremely obscure
fact about terrible bug report website organization, that I'm sure has screwed
many other people in the past:
https://bugzilla.kernel.org/show_bug.cgi?id=195321#c5

Thankfully, someone else posted it in the right place recently:
https://bugs.freedesktop.org/show_bug.cgi?id=100399

Let's add our voices to that.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-06-10 19:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-29 22:41 [Bug 150731] New: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive bugzilla-daemon
2016-07-29 22:44 ` [Bug 150731] " bugzilla-daemon
2016-07-29 22:45 ` bugzilla-daemon
2016-08-09 16:50 ` bugzilla-daemon
2016-08-11 23:32 ` bugzilla-daemon
2016-08-11 23:33 ` bugzilla-daemon
2016-08-11 23:46 ` bugzilla-daemon
2016-08-12  0:16 ` bugzilla-daemon
2016-08-12  1:09 ` bugzilla-daemon
2016-08-16  9:21 ` bugzilla-daemon
2016-08-16  9:23 ` bugzilla-daemon
2017-06-08 14:27 ` bugzilla-daemon
2017-06-10 19:28 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.