All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
@ 2018-02-04 17:39 bugzilla-daemon
  2018-02-04 17:41 ` [Bug 198669] " bugzilla-daemon
                   ` (17 more replies)
  0 siblings, 18 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-04 17:39 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

            Bug ID: 198669
           Summary: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
           Product: Drivers
           Version: 2.5
    Kernel Version: 4.13.0-32-generic x86_64
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri@kernel-bugs.osdl.org
          Reporter: roger@beardandsandals.co.uk
        Regression: No

This is a resilience bug in the driver. When trying to recover from a GPU stall
radeon_ring_backup causes a paging violation.

[  488.507091] BUG: unable to handle kernel paging request at ffffb406c1891ffc
[  488.507176] IP: radeon_ring_backup+0xd3/0x140 [radeon]

The GPU stall is caused by a hardware problem triggered by vibration. N.B. This
bug is not abput the hardware problem. It is about the drivers resilience when
trying to recover from it.

It is very similar to bug #62721 reported 4 years ago. However this is
occurring using with the driver in the 4.13.0 kernel.

Here os the dmesg output.

[  139.457873] rfkill: input handler disabled
[  468.102340] radeon 0000:02:00.0: ring 0 stalled for more than 10256msec
[  468.102346] radeon 0000:02:00.0: GPU lockup (current fence id
0x0000000000001bdb last fence id 0x0000000000001bdc on ring 0)

... Similar lines removed

[  487.558156] radeon 0000:02:00.0: ring 0 stalled for more than 29712msec
[  487.558161] radeon 0000:02:00.0: GPU lockup (current fence id
0x0000000000001bdb last fence id 0x0000000000001bdc on ring 0)
[  488.070157] radeon 0000:02:00.0: ring 0 stalled for more than 30224msec
[  488.070162] radeon 0000:02:00.0: GPU lockup (current fence id
0x0000000000001bdb last fence id 0x0000000000001bdc on ring 0)
[  488.507091] BUG: unable to handle kernel paging request at ffffb406c1891ffc
[  488.507176] IP: radeon_ring_backup+0xd3/0x140 [radeon]
[  488.507195] PGD 236d37067 
[  488.507196] P4D 236d37067 
[  488.507207] PUD 0 

[  488.507234] Oops: 0000 [#1] SMP PTI
[  488.507248] Modules linked in: rfcomm bnep bonding binfmt_misc btusb btrtl
btbcm btintel intel_powerclamp joydev coretemp kvm_intel kvm input_leds
bluetooth ecdh_generic arc4 ath9k ath9k_common ath9k_hw ath mac80211 irqbypass
snd_seq_midi snd_seq_midi_event intel_cstate snd_hda_codec_realtek
snd_hda_codec_generic snd_hda_codec_hdmi snd_rawmidi cfg80211 snd_hda_intel
snd_hda_codec snd_hda_core snd_hwdep serio_raw snd_pcm snd_seq snd_seq_device
snd_timer snd lpc_ich shpchp i7core_edac mac_hid i5500_temp soundcore
tpm_infineon asus_atk0110 nfsd auth_rpcgss nfs_acl lockd grace sunrpc
parport_pc ppdev lp parport ip_tables x_tables autofs4 amdkfd amd_iommu_v2
radeon i2c_algo_bit ttm drm_kms_helper hid_generic syscopyarea uas sysfillrect
usbhid sysimgblt firewire_ohci fb_sys_fops usb_storage pata_acpi hid
[  488.507500]  psmouse firewire_core r8169 drm crc_itu_t mii
[  488.507523] CPU: 7 PID: 2073 Comm: gnome-shell Tainted: G          I    
4.13.0-32-generic #35-Ubuntu
[  488.507554] Hardware name: System manufacturer System Product Name/P6T SE,
BIOS 0403    05/19/2009
[  488.507584] task: ffff9e0cb6191600 task.stack: ffffb402c3724000
[  488.507619] RIP: 0010:radeon_ring_backup+0xd3/0x140 [radeon]
[  488.507639] RSP: 0018:ffffb402c3727c00 EFLAGS: 00010246
[  488.507658] RAX: ffff9e0c6f300000 RBX: 0000000000037ba1 RCX:
0000000000000000
[  488.507682] RDX: 0000000000000000 RSI: ffffb406c1891ffc RDI:
00000000000dee84
[  488.507707] RBP: ffffb402c3727c28 R08: 00000000000269a8 R09:
00000000000b2c44
[  488.507731] R10: ffffdc5046bd0000 R11: ffff9e0cfffd1d00 R12:
ffffb402c3727c68
[  488.507756] R13: ffff9e0ceafc9538 R14: ffff9e0ceafc9558 R15:
00000000ffffffff
[  488.507780] FS:  00007fdc82a60ac0(0000) GS:ffff9e0cf73c0000(0000)
knlGS:0000000000000000
[  488.507808] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  488.507828] CR2: ffffb406c1891ffc CR3: 00000001f632a000 CR4:
00000000000006e0
[  488.507853] Call Trace:
[  488.507875]  radeon_gpu_reset+0xc0/0x330 [radeon]
[  488.507895]  ? dma_fence_wait_timeout+0x38/0xf0
[  488.507912]  ? reservation_object_wait_timeout_rcu+0x14f/0x2d0
[  488.507946]  radeon_gem_handle_lockup.part.4+0xe/0x20 [radeon]
[  488.507979]  radeon_gem_wait_idle_ioctl+0x9c/0x100 [radeon]
[  488.508012]  ? radeon_gem_busy_ioctl+0x80/0x80 [radeon]
[  488.508040]  drm_ioctl_kernel+0x5d/0xb0 [drm]
[  488.508063]  drm_ioctl+0x31b/0x3d0 [drm]
[  488.508091]  ? radeon_gem_busy_ioctl+0x80/0x80 [radeon]
[  488.508111]  ? futex_wake+0x8f/0x180
[  488.508134]  radeon_drm_ioctl+0x4f/0x90 [radeon]
[  488.508153]  do_vfs_ioctl+0xa5/0x610
[  488.509723]  ? entry_SYSCALL_64_after_hwframe+0x118/0x168
[  488.511292]  ? entry_SYSCALL_64_after_hwframe+0x111/0x168
[  488.512853]  ? entry_SYSCALL_64_after_hwframe+0x10a/0x168
[  488.514405]  ? entry_SYSCALL_64_after_hwframe+0x103/0x168
[  488.515956]  ? entry_SYSCALL_64_after_hwframe+0xfc/0x168
[  488.517499]  ? entry_SYSCALL_64_after_hwframe+0xf5/0x168
[  488.519040]  ? entry_SYSCALL_64_after_hwframe+0xee/0x168
[  488.520572]  ? entry_SYSCALL_64_after_hwframe+0xe7/0x168
[  488.522095]  ? entry_SYSCALL_64_after_hwframe+0xe0/0x168
[  488.523598]  SyS_ioctl+0x79/0x90
[  488.525088]  ? entry_SYSCALL_64_after_hwframe+0xa1/0x168
[  488.526579]  entry_SYSCALL_64_fastpath+0x33/0xa3
[  488.528065] RIP: 0033:0x7fdc7fb65ef7
[  488.529553] RSP: 002b:00007ffd23064578 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  488.531081] RAX: ffffffffffffffda RBX: 00007ffd230645c0 RCX:
00007fdc7fb65ef7
[  488.532581] RDX: 00007ffd230645c0 RSI: 0000000040086464 RDI:
000000000000000c
[  488.534078] RBP: 00007ffd230645c0 R08: 0000000000000000 R09:
0000000800000000
[  488.535556] R10: 00007ffd230645d0 R11: 0000000000000246 R12:
0000000040086464
[  488.537031] R13: 000000000000000c R14: 00007ffd230646e8 R15:
000055d8a7b62200
[  488.538511] Code: 48 85 c0 49 89 04 24 74 62 8d 53 ff 48 8d 3c 95 04 00 00
00 31 d2 eb 04 49 8b 04 24 49 8b 76 08 41 8d 4f 01 45 89 ff 4a 8d 34 be <8b> 36
89 34 10 41 23 4e 54 48 83 c2 04 48 39 d7 41 89 cf 75 d8 
[  488.541900] RIP: radeon_ring_backup+0xd3/0x140 [radeon] RSP:
ffffb402c3727c00
[  488.543656] CR2: ffffb406c1891ffc
[  488.552481] ---[ end trace e6e07e03d7738a24 ]---

Various versions of this crash seem to be have been reported over the last few
years but none successfully closed.

For further diagnostics see
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1746232

Roger

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
@ 2018-02-04 17:41 ` bugzilla-daemon
  2018-02-04 18:26 ` bugzilla-daemon
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-04 17:41 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

roger@beardandsandals.co.uk (roger@beardandsandals.co.uk) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                URL|                            |https://bugs.launchpad.net/
                   |                            |ubuntu/+source/linux/+bug/1
                   |                            |746232

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
  2018-02-04 17:41 ` [Bug 198669] " bugzilla-daemon
@ 2018-02-04 18:26 ` bugzilla-daemon
  2018-02-04 20:55 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-04 18:26 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

Christian König (christian.koenig@amd.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |christian.koenig@amd.com

--- Comment #1 from Christian König (christian.koenig@amd.com) ---
What does radeon_ring_backup+0xd3 resolve to on your system?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
  2018-02-04 17:41 ` [Bug 198669] " bugzilla-daemon
  2018-02-04 18:26 ` bugzilla-daemon
@ 2018-02-04 20:55 ` bugzilla-daemon
  2018-02-05 12:16 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-04 20:55 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #2 from roger@beardandsandals.co.uk (roger@beardandsandals.co.uk) ---
Looking at the debug files.

radeon_ring_backup resolves to 0x33430 so +0xd4 is 0x33503.

The line info gives this

radeon_ring.c                                323             0x334f4
radeon_ring.c                                324             0x33508

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (2 preceding siblings ...)
  2018-02-04 20:55 ` bugzilla-daemon
@ 2018-02-05 12:16 ` bugzilla-daemon
  2018-02-05 22:03 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-05 12:16 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #3 from Christian König (christian.koenig@amd.com) ---
Created attachment 274001
  --> https://bugzilla.kernel.org/attachment.cgi?id=274001&action=edit
Possible fix

The attached patch is a shoot into the dark, but please give it a try.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (3 preceding siblings ...)
  2018-02-05 12:16 ` bugzilla-daemon
@ 2018-02-05 22:03 ` bugzilla-daemon
  2018-02-06 14:05 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-05 22:03 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #4 from roger@beardandsandals.co.uk (roger@beardandsandals.co.uk) ---
Well it moved the problem. It crashed somewhere else in the driver with some
message about scratch. Sorry I cannot tell you what is was because I screwed up
the save of the the kernel message buffer, and now I cannot get the thing to
glitch again. My normal method of stamping on the floor next to the system is
not working. I think I might have overdone it and now the thing has bedded in.
Going to leave it powered off overnight and try again in the morning.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (4 preceding siblings ...)
  2018-02-05 22:03 ` bugzilla-daemon
@ 2018-02-06 14:05 ` bugzilla-daemon
  2018-02-06 14:12 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-06 14:05 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #5 from roger@beardandsandals.co.uk (roger@beardandsandals.co.uk) ---
My best guess is the error came from 

r600.c:2848:            DRM_ERROR("radeon: ring %d test failed
(scratch(0x%04X)=0x%08X)\n",


I cannot reproduce the mechanical hardware failure. I don't want to clobber the
system any harder and risk damaging a disk.

I assume this is being called from the GPU reset path.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (5 preceding siblings ...)
  2018-02-06 14:05 ` bugzilla-daemon
@ 2018-02-06 14:12 ` bugzilla-daemon
  2018-02-06 15:19 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-06 14:12 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #6 from Christian König (christian.koenig@amd.com) ---
Well the issue is triggered by the driver reading nonsense values from the
hardware.

E.g. we ask the hardware what the last good position on a 16k ring buffer is
and get 0xffffffff as result (or something like this) which obviously can't be
correct.

My patch mitigated that by clamping the value to a valid range, but if you read
nonsense values from the hardware because the hardware has a loose connection
and acts strange on vibrations then I basically can't guarantee for anything.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (6 preceding siblings ...)
  2018-02-06 14:12 ` bugzilla-daemon
@ 2018-02-06 15:19 ` bugzilla-daemon
  2018-02-06 15:53 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-06 15:19 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #7 from roger@beardandsandals.co.uk (roger@beardandsandals.co.uk) ---
The original point I made in the bug report was that this bug is not about the
mechanical hardware glitch. It as about the driver being in what is obviously a
failure mode and attempting a recovery that fails and leaves the system in
unusable state. The error recovery paths of any driver should be its most
resilient components. Especially when the driver is controlling a part of the
primary user interface to it.

To pose another question. Why, when the driver has the information to tell it
that the GPU is irrevocably stalled, does it attempt a soft restart and leave
the system in an unusable state.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (7 preceding siblings ...)
  2018-02-06 15:19 ` bugzilla-daemon
@ 2018-02-06 15:53 ` bugzilla-daemon
  2018-02-06 21:39 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-06 15:53 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #8 from Christian König (christian.koenig@amd.com) ---
(In reply to roger@beardandsandals.co.uk from comment #7)
> The original point I made in the bug report was that this bug is not about
> the mechanical hardware glitch. It as about the driver being in what is
> obviously a failure mode and attempting a recovery that fails and leaves the
> system in unusable state.

You are missing the point. The driver fails to recover because the hardware is
buggy and not because there is any problem with the recovery routine.

In other words we read back an impossible value from the hardware and that is
why the system is failing.

I mean I can handle this impossible value at this code location, but as you
actually figured out by yourself it then fails at the next best location.

There are simply hundreds or even thousands of locations where the assumption
is that the hardware works correctly and we don't handle the case to get
nonsense values.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (8 preceding siblings ...)
  2018-02-06 15:53 ` bugzilla-daemon
@ 2018-02-06 21:39 ` bugzilla-daemon
  2018-02-07  8:22 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-06 21:39 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #9 from roger@beardandsandals.co.uk (roger@beardandsandals.co.uk) ---
I think we have to agree to differ on this one. You seem to be focussing on the
software interface between the GPU and the driver.

What follows is my personal opinion.

The most likely cause of this kind of mechanical issue is the signal path
between the video interface hardware and the outside world, either a dry joint
or a mechanical fault in the cable or cable connectors. I can only reiterate
what I said in my previous post. The driver has sufficient information to
determine that a hard failure has occured, and that failure is probably not in
the gpu itself. I would like to see the driver doing a hard reset of the card
with rigorous error checking. If it cannot reset the GPU in graphical mode it
should try to set the display hardware into a basic console mode.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (9 preceding siblings ...)
  2018-02-06 21:39 ` bugzilla-daemon
@ 2018-02-07  8:22 ` bugzilla-daemon
  2018-02-07  9:12 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-07  8:22 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #10 from Christian König (christian.koenig@amd.com) ---
(In reply to roger@beardandsandals.co.uk from comment #9)
> The most likely cause of this kind of mechanical issue is the signal path
> between the video interface hardware and the outside world, either a dry
> joint or a mechanical fault in the cable or cable connectors.

That is what I absolutely agree about.

> The driver has sufficient
> information to determine that a hard failure has occured, and that failure
> is probably not in the gpu itself. I would like to see the driver doing a
> hard reset of the card with rigorous error checking. If it cannot reset the
> GPU in graphical mode it should try to set the display hardware into a basic
> console mode.

And that is the part you don't seem to understand. The driver is trying exactly
what you are describing.

We detect a problem because of a timeout, e.g. the hardware doesn't respond in
a given time frame on commands we send to it.

What we do then is to query the hardware how far we proceeded in the execution
and the hardware answered with a nonsense value. In other words bits are set in
the response which should never be set.

This is a clear indicator that the PCIe transaction for the register read
aborted because the device doesn't response any more.

The most likely cause of that is that the bus interface in the ASIC locked up
because of an electrical problem (I think the ESD protection kicked in) and the
only way to get out of that is a hard reset of the system.

What we can try to do is trying to prevent further failures like the crash you
described by checking the values read from the hardware. This way you can at
least access the box over the network or blindly shut it down with keyboard
short cuts.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (10 preceding siblings ...)
  2018-02-07  8:22 ` bugzilla-daemon
@ 2018-02-07  9:12 ` bugzilla-daemon
  2018-02-07  9:16 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-07  9:12 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #11 from roger@beardandsandals.co.uk (roger@beardandsandals.co.uk) ---
Yes, I take your point. I was speculating on insufficient information. My
apologies.

The solution you propose is essentially what I have already been doing. The
logging in over a network already works with the unpatched driver. I have not
had any luck with keyboard shortcuts. It looks like xwayland/xserver does not
know that a problem has occurred and has still got hold of the keyboard and
mouse. This is an obscure problem and probably not worth spending much time on.
Especially as I no longer seem to be able to reproduce it!


Thank you for your patience.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (11 preceding siblings ...)
  2018-02-07  9:12 ` bugzilla-daemon
@ 2018-02-07  9:16 ` bugzilla-daemon
  2018-02-07 12:45 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-07  9:16 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #12 from roger@beardandsandals.co.uk (roger@beardandsandals.co.uk) ---
On 7 February 2018 08:23:06 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=198669
>
> --- Comment #10 from Christian König (christian.koenig@amd.com) ---
> (In reply to roger@beardandsandals.co.uk from comment #9)
>> The most likely cause of this kind of mechanical issue is the signal path
>> between the video interface hardware and the outside world, either a dry
>> joint or a mechanical fault in the cable or cable connectors.
>
> That is what I absolutely agree about.
>
>> The driver has sufficient
>> information to determine that a hard failure has occured, and that failure
>> is probably not in the gpu itself. I would like to see the driver doing a
>> hard reset of the card with rigorous error checking. If it cannot reset the
>> GPU in graphical mode it should try to set the display hardware into a basic
>> console mode.
>
> And that is the part you don't seem to understand. The driver is trying
> exactly
> what you are describing.
>
> We detect a problem because of a timeout, e.g. the hardware doesn't respond
> in
> a given time frame on commands we send to it.
>
> What we do then is to query the hardware how far we proceeded in the
> execution
> and the hardware answered with a nonsense value. In other words bits are set
> in
> the response which should never be set.
>
> This is a clear indicator that the PCIe transaction for the register read
> aborted because the device doesn't response any more.
>
> The most likely cause of that is that the bus interface in the ASIC locked up
> because of an electrical problem (I think the ESD protection kicked in) and
> the
> only way to get out of that is a hard reset of the system.
>
> What we can try to do is trying to prevent further failures like the crash
> you
> described by checking the values read from the hardware. This way you can at
> least access the box over the network or blindly shut it down with keyboard
> short cuts.


Yes, I take your point. I was speculating on insufficient information. My 
apologies. The solution you propose sounds great.

Thank you for your patience.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (12 preceding siblings ...)
  2018-02-07  9:16 ` bugzilla-daemon
@ 2018-02-07 12:45 ` bugzilla-daemon
  2018-12-03  3:55 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-02-07 12:45 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #13 from roger@beardandsandals.co.uk (roger@beardandsandals.co.uk) ---
You can ignore comment 11. I I thought the email reply had not worked. So I
posted a revised version directly. Comment 10 is the correct one.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (13 preceding siblings ...)
  2018-02-07 12:45 ` bugzilla-daemon
@ 2018-12-03  3:55 ` bugzilla-daemon
  2018-12-03  8:14 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-12-03  3:55 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

Dave Airlie (airlied@linux.ie) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |airlied@linux.ie

--- Comment #14 from Dave Airlie (airlied@linux.ie) ---
Should we at least push this patch to improve resiliance a little?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (14 preceding siblings ...)
  2018-12-03  3:55 ` bugzilla-daemon
@ 2018-12-03  8:14 ` bugzilla-daemon
  2018-12-03  8:18 ` bugzilla-daemon
  2018-12-03 11:12 ` bugzilla-daemon
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-12-03  8:14 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #15 from roger@beardandsandals.co.uk (roger@beardandsandals.co.uk) ---
For information. I enventually tracked the hardware fault to bad solder flow in
the area of the dvi-d socket. I still stick by my original comments about
usability. To me an outcome of a recovery process that will leave 99.9% of end
users clueless of how to safely restart their system is not a good outcome from
an end user perspective. This is my last word on this topic.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (15 preceding siblings ...)
  2018-12-03  8:14 ` bugzilla-daemon
@ 2018-12-03  8:18 ` bugzilla-daemon
  2018-12-03 11:12 ` bugzilla-daemon
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-12-03  8:18 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

roger@beardandsandals.co.uk (roger@beardandsandals.co.uk) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |OBSOLETE

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug 198669] Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
  2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
                   ` (16 preceding siblings ...)
  2018-12-03  8:18 ` bugzilla-daemon
@ 2018-12-03 11:12 ` bugzilla-daemon
  17 siblings, 0 replies; 19+ messages in thread
From: bugzilla-daemon @ 2018-12-03 11:12 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=198669

--- Comment #16 from Christian König (christian.koenig@amd.com) ---
(In reply to Dave Airlie from comment #14)
> Should we at least push this patch to improve resiliance a little?

We could, but I don't see much value in that. E.g. we would need to code the
software in a way which also works if the hardware is damaged.

That is possible, but I grepped a bit over the source and in this particular
case we would need to manually audit 2201 registers accesses so that they also
work when the hardware suddenly goes up in flames.

That is totally unrealistic and just fixing this one case doesn't gives us
much.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-12-03 11:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-04 17:39 [Bug 198669] New: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] bugzilla-daemon
2018-02-04 17:41 ` [Bug 198669] " bugzilla-daemon
2018-02-04 18:26 ` bugzilla-daemon
2018-02-04 20:55 ` bugzilla-daemon
2018-02-05 12:16 ` bugzilla-daemon
2018-02-05 22:03 ` bugzilla-daemon
2018-02-06 14:05 ` bugzilla-daemon
2018-02-06 14:12 ` bugzilla-daemon
2018-02-06 15:19 ` bugzilla-daemon
2018-02-06 15:53 ` bugzilla-daemon
2018-02-06 21:39 ` bugzilla-daemon
2018-02-07  8:22 ` bugzilla-daemon
2018-02-07  9:12 ` bugzilla-daemon
2018-02-07  9:16 ` bugzilla-daemon
2018-02-07 12:45 ` bugzilla-daemon
2018-12-03  3:55 ` bugzilla-daemon
2018-12-03  8:14 ` bugzilla-daemon
2018-12-03  8:18 ` bugzilla-daemon
2018-12-03 11:12 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.