linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [3.3-rc1]radeon 0000:07:00.0: GPU lockup CP stall for more than 10000msec
@ 2012-01-21 19:03 Torsten Kaiser
  2012-01-23 16:57 ` Jerome Glisse
  0 siblings, 1 reply; 5+ messages in thread
From: Torsten Kaiser @ 2012-01-21 19:03 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dave Airlie, Alex Deucher, dri-devel

After updating to kernel 3.3-rc1 I have experienced a lockup of my GPU.
I left my KDE desktop running until the screensaver turned off the
monitors. But on key presses it would not turn back on. Ctrl+Alt+F1 to
switch to another virtual console also did not work.
Alt+SysRq magic still worked, so I was able to force the syslog to
disk and restart the system.

>From the log:
Jan 21 19:30:01 thoregon cron[3960]: (root) CMD (test -x
/usr/sbin/run-crons && /usr/sbin/run-crons)
Jan 21 19:39:41 thoregon kernel: [ 6364.620131] radeon 0000:07:00.0:
GPU lockup CP stall for more than 10000msec
Jan 21 19:39:41 thoregon kernel: [ 6364.620139] GPU lockup (waiting
for 0x0003F1F2 last fence id 0x0003F1F1)
Jan 21 19:39:41 thoregon kernel: [ 6364.636341] radeon 0000:07:00.0:
GPU softreset
Jan 21 19:39:41 thoregon kernel: [ 6364.636348] radeon 0000:07:00.0:
R_008010_GRBM_STATUS=0xA0003028
Jan 21 19:39:41 thoregon kernel: [ 6364.636354] radeon 0000:07:00.0:
R_008014_GRBM_STATUS2=0x00000002
Jan 21 19:39:41 thoregon kernel: [ 6364.620131] radeon 0000:07:00.0:
GPU lockup CP stall for more than 10000msec
Jan 21 19:39:41 thoregon kernel: [ 6364.620139] GPU lockup (waiting
for 0x0003F1F2 last fence id 0x0003F1F1)
Jan 21 19:39:41 thoregon kernel: [ 6364.636341] radeon 0000:07:00.0:
GPU softreset
Jan 21 19:39:41 thoregon kernel: [ 6364.636348] radeon 0000:07:00.0:
R_008010_GRBM_STATUS=0xA0003028
Jan 21 19:39:41 thoregon kernel: [ 6364.636354] radeon 0000:07:00.0:
R_008014_GRBM_STATUS2=0x00000002
Jan 21 19:39:41 thoregon kernel: [ 6364.636359] radeon 0000:07:00.0:
R_000E50_SRBM_STATUS=0x200000C0
Jan 21 19:39:41 thoregon kernel: [ 6364.636370] radeon 0000:07:00.0:
R_008020_GRBM_SOFT_RESET=0x00007FEE
Jan 21 19:39:41 thoregon kernel: [ 6364.651219] radeon 0000:07:00.0:
R_008020_GRBM_SOFT_RESET=0x00000001
Jan 21 19:39:41 thoregon kernel: [ 6364.667212] radeon 0000:07:00.0:
R_008010_GRBM_STATUS=0x00003028
Jan 21 19:39:41 thoregon kernel: [ 6364.667217] radeon 0000:07:00.0:
R_008014_GRBM_STATUS2=0x00000002
Jan 21 19:39:41 thoregon kernel: [ 6364.667223] radeon 0000:07:00.0:
R_000E50_SRBM_STATUS=0x200000C0
Jan 21 19:39:41 thoregon kernel: [ 6364.668226] radeon 0000:07:00.0:
GPU reset succeed
Jan 21 19:39:41 thoregon kernel: [ 6364.673142] [drm] PCIE GART of
512M enabled (table at 0x0000000000040000).
Jan 21 19:39:41 thoregon kernel: [ 6364.673177] radeon 0000:07:00.0: WB enabled
Jan 21 19:39:41 thoregon kernel: [ 6364.673184] [drm] fence driver on
ring 0 use gpu addr 0x20000c00 and cpu addr 0xffff880328636c00
Jan 21 19:39:41 thoregon kernel: [ 6364.719445] [drm] ring test on 0
succeeded in 1 usecs
Jan 21 19:40:01 thoregon cron[3975]: (root) CMD (test -x
/usr/sbin/run-crons && /usr/sbin/run-crons)
Jan 21 19:43:37 thoregon kernel: [ 6600.390150] INFO: task X:3098
blocked for more than 120 seconds.
Jan 21 19:43:37 thoregon kernel: [ 6600.390157] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 21 19:43:37 thoregon kernel: [ 6600.390163] X               D
ffff880337d50a00     0  3098   3077 0x00400000
Jan 21 19:43:37 thoregon kernel: [ 6600.390174]  ffff88031df15080
0000000000000086 ffff8802f5087300 0000000000010a00
Jan 21 19:43:37 thoregon kernel: [ 6600.390185]  ffff88031bf79fd8
0000000000010a00 ffff88031bf78000 ffff88031bf79fd8
Jan 21 19:43:37 thoregon kernel: [ 6600.390194]  0000000000010a00
ffff88031df15080 0000000000010a00 0000000000010a00
Jan 21 19:43:37 thoregon kernel: [ 6600.390203] Call Trace:
Jan 21 19:43:37 thoregon kernel: [ 6600.390219]  [<ffffffff815eee58>]
? __mutex_lock_slowpath+0xc8/0x140
Jan 21 19:43:37 thoregon kernel: [ 6600.390230]  [<ffffffff815eeb4a>]
? mutex_lock+0x1a/0x40
Jan 21 19:43:37 thoregon kernel: [ 6600.390239]  [<ffffffff81352be2>]
? radeon_ib_get+0x52/0x230
Jan 21 19:43:37 thoregon kernel: [ 6600.390249]  [<ffffffff8136e86a>]
? r600_ib_test+0x5a/0x300
Jan 21 19:43:37 thoregon kernel: [ 6600.390258]  [<ffffffff8137246e>]
? rv770_startup+0xf7e/0x1590
Jan 21 19:43:37 thoregon kernel: [ 6600.390267]  [<ffffffff81372d5c>]
? rv770_resume+0x2c/0x90
Jan 21 19:43:37 thoregon kernel: [ 6600.390275]  [<ffffffff8132bd8e>]
? radeon_gpu_reset+0x11e/0x160
Jan 21 19:43:37 thoregon kernel: [ 6600.390284]  [<ffffffff8133ef43>]
? radeon_fence_wait+0x363/0x3b0
Jan 21 19:43:37 thoregon kernel: [ 6600.390293]  [<ffffffff8104f340>]
? wake_up_bit+0x40/0x40
Jan 21 19:43:37 thoregon kernel: [ 6600.390301]  [<ffffffff81352d77>]
? radeon_ib_get+0x1e7/0x230
Jan 21 19:43:37 thoregon kernel: [ 6600.390310]  [<ffffffff81354b4a>]
? radeon_cs_ioctl+0x27a/0x4d0
Jan 21 19:43:37 thoregon kernel: [ 6600.390319]  [<ffffffff812f42d4>]
? drm_ioctl+0x3e4/0x490
Jan 21 19:43:37 thoregon kernel: [ 6600.390327]  [<ffffffff813548d0>]
? radeon_cs_finish_pages+0xa0/0xa0
Jan 21 19:43:37 thoregon kernel: [ 6600.390336]  [<ffffffff81024769>]
? do_page_fault+0x199/0x420
Jan 21 19:43:37 thoregon kernel: [ 6600.390344]  [<ffffffff810af30c>]
? mmap_region+0x1dc/0x570
Jan 21 19:43:37 thoregon kernel: [ 6600.390352]  [<ffffffff810de446>]
? do_vfs_ioctl+0x96/0x4e0
Jan 21 19:43:37 thoregon kernel: [ 6600.390359]  [<ffffffff815efd0c>]
? __schedule+0x28c/0x630
Jan 21 19:43:37 thoregon kernel: [ 6600.390366]  [<ffffffff810de8d9>]
? sys_ioctl+0x49/0x90
Jan 21 19:43:37 thoregon kernel: [ 6600.390375]  [<ffffffff815f16e2>]
? system_call_fastpath+0x16/0x1b
Jan 21 19:45:08 thoregon kernel: [ 6691.864440] SysRq : Emergency Sync
Jan 21 19:45:08 thoregon kernel: [ 6691.864838] Emergency Sync complete
Jan 21 19:45:14 thoregon kernel: [ 6697.476112] SysRq : Emergency Remount R/O
Jan 21 19:46:33 thoregon kernel: [    0.000000] Linux version
3.3.0-rc1 (root@thoregon) (gcc version 4.5.3 (Gentoo 4.5.3-r2 p1.0,
pie-0.4.6) ) #1 SMP Fri Jan 20 09:54:26 CET 2012

I did not have any trouble with 3.2 or earlier kernel, so it looks
like an regression in 3.3-rc1.

Info from my card:
thoregon ~ # lspci -vvs 07:00.0
07:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee
ATI RV730 PRO [Radeon HD 4650] (prog-if 00 [VGA controller])
        Subsystem: Hightech Information System Ltd. Device 2269
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 78
        Region 0: Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Region 2: Memory at fe9e0000 (64-bit, non-prefetchable) [size=64K]
        Region 4: I/O ports at e000 [size=256]
        Expansion ROM at fe9c0000 [disabled] [size=128K]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
<4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq-
AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s
L1, Latency L0 <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train-
SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance-
SpeedDis-, Selectable De-emphasis: -6dB
                         Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-,
LinkEqualizationRequest-
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee3f00c  Data: 4189
        Capabilities: [100 v1] Vendor Specific Information: ID=0001
Rev=1 Len=010 <?>
        Kernel driver in use: radeon

Please ask, if you need any other information, I will try to provide it.

Torsten

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [3.3-rc1]radeon 0000:07:00.0: GPU lockup CP stall for more than 10000msec
  2012-01-21 19:03 [3.3-rc1]radeon 0000:07:00.0: GPU lockup CP stall for more than 10000msec Torsten Kaiser
@ 2012-01-23 16:57 ` Jerome Glisse
  2012-01-23 18:01   ` Torsten Kaiser
  0 siblings, 1 reply; 5+ messages in thread
From: Jerome Glisse @ 2012-01-23 16:57 UTC (permalink / raw)
  To: Torsten Kaiser; +Cc: linux-kernel, Alex Deucher, Dave Airlie, dri-devel

[-- Attachment #1: Type: text/plain, Size: 572 bytes --]

On Sat, Jan 21, 2012 at 08:03:37PM +0100, Torsten Kaiser wrote:
> After updating to kernel 3.3-rc1 I have experienced a lockup of my GPU.
> I left my KDE desktop running until the screensaver turned off the
> monitors. But on key presses it would not turn back on. Ctrl+Alt+F1 to
> switch to another virtual console also did not work.
> Alt+SysRq magic still worked, so I was able to force the syslog to
> disk and restart the system.
> 

Can you test if attached patch help your case ?

Of course it would be best if we did not lockup in the first place.

Cheers,
Jerome

[-- Attachment #2: 0001-drm-radeon-avoid-deadlock-if-GPU-lockup-is-detected-.patch --]
[-- Type: text/plain, Size: 7146 bytes --]

>From 84e7f3d46d2a4ac226343e195b806820accdf2fe Mon Sep 17 00:00:00 2001
From: Jerome Glisse <jglisse@redhat.com>
Date: Mon, 23 Jan 2012 11:52:15 -0500
Subject: [PATCH] drm/radeon: avoid deadlock if GPU lockup is detected in
 ib_pool_get

If GPU lockup is detected in ib_pool get we are holding the ib_pool
mutex that will be needed by the GPU reset code. As ib_pool code is
safe to be reentrant from GPU reset code we should not block if we
are trying to get the ib pool lock on the behalf of the same userspace
caller, thus use the radeon_mutex_lock helper.

Signed-off-by: Jerome Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/radeon/radeon.h        |   84 ++++++++++++++++----------------
 drivers/gpu/drm/radeon/radeon_device.c |    2 +-
 drivers/gpu/drm/radeon/radeon_ring.c   |   24 +++++-----
 3 files changed, 55 insertions(+), 55 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 73e05cb..1668ec1 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -157,6 +157,47 @@ bool radeon_get_bios(struct radeon_device *rdev);
 
 
 /*
+ * Mutex which allows recursive locking from the same process.
+ */
+struct radeon_mutex {
+	struct mutex		mutex;
+	struct task_struct	*owner;
+	int			level;
+};
+
+static inline void radeon_mutex_init(struct radeon_mutex *mutex)
+{
+	mutex_init(&mutex->mutex);
+	mutex->owner = NULL;
+	mutex->level = 0;
+}
+
+static inline void radeon_mutex_lock(struct radeon_mutex *mutex)
+{
+	if (mutex_trylock(&mutex->mutex)) {
+		/* The mutex was unlocked before, so it's ours now */
+		mutex->owner = current;
+	} else if (mutex->owner != current) {
+		/* Another process locked the mutex, take it */
+		mutex_lock(&mutex->mutex);
+		mutex->owner = current;
+	}
+	/* Otherwise the mutex was already locked by this process */
+
+	mutex->level++;
+}
+
+static inline void radeon_mutex_unlock(struct radeon_mutex *mutex)
+{
+	if (--mutex->level > 0)
+		return;
+
+	mutex->owner = NULL;
+	mutex_unlock(&mutex->mutex);
+}
+
+
+/*
  * Dummy page
  */
 struct radeon_dummy_page {
@@ -598,7 +639,7 @@ struct radeon_ib {
  * mutex protects scheduled_ibs, ready, alloc_bm
  */
 struct radeon_ib_pool {
-	struct mutex			mutex;
+	struct radeon_mutex		mutex;
 	struct radeon_sa_manager	sa_manager;
 	struct radeon_ib		ibs[RADEON_IB_POOL_SIZE];
 	bool				ready;
@@ -1355,47 +1396,6 @@ struct r600_vram_scratch {
 
 
 /*
- * Mutex which allows recursive locking from the same process.
- */
-struct radeon_mutex {
-	struct mutex		mutex;
-	struct task_struct	*owner;
-	int			level;
-};
-
-static inline void radeon_mutex_init(struct radeon_mutex *mutex)
-{
-	mutex_init(&mutex->mutex);
-	mutex->owner = NULL;
-	mutex->level = 0;
-}
-
-static inline void radeon_mutex_lock(struct radeon_mutex *mutex)
-{
-	if (mutex_trylock(&mutex->mutex)) {
-		/* The mutex was unlocked before, so it's ours now */
-		mutex->owner = current;
-	} else if (mutex->owner != current) {
-		/* Another process locked the mutex, take it */
-		mutex_lock(&mutex->mutex);
-		mutex->owner = current;
-	}
-	/* Otherwise the mutex was already locked by this process */
-
-	mutex->level++;
-}
-
-static inline void radeon_mutex_unlock(struct radeon_mutex *mutex)
-{
-	if (--mutex->level > 0)
-		return;
-
-	mutex->owner = NULL;
-	mutex_unlock(&mutex->mutex);
-}
-
-
-/*
  * Core structure, functions and helpers.
  */
 typedef uint32_t (*radeon_rreg_t)(struct radeon_device*, uint32_t);
diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
index 0afb13b..df0c4c9 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -720,7 +720,7 @@ int radeon_device_init(struct radeon_device *rdev,
 	/* mutex initialization are all done here so we
 	 * can recall function without having locking issues */
 	radeon_mutex_init(&rdev->cs_mutex);
-	mutex_init(&rdev->ib_pool.mutex);
+	radeon_mutex_init(&rdev->ib_pool.mutex);
 	for (i = 0; i < RADEON_NUM_RINGS; ++i)
 		mutex_init(&rdev->ring[i].mutex);
 	mutex_init(&rdev->dc_hw_i2c_mutex);
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c
index e8bc709..9af5317 100644
--- a/drivers/gpu/drm/radeon/radeon_ring.c
+++ b/drivers/gpu/drm/radeon/radeon_ring.c
@@ -109,12 +109,12 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 		return r;
 	}
 
-	mutex_lock(&rdev->ib_pool.mutex);
+	radeon_mutex_lock(&rdev->ib_pool.mutex);
 	idx = rdev->ib_pool.head_id;
 retry:
 	if (cretry > 5) {
 		dev_err(rdev->dev, "failed to get an ib after 5 retry\n");
-		mutex_unlock(&rdev->ib_pool.mutex);
+		radeon_mutex_unlock(&rdev->ib_pool.mutex);
 		radeon_fence_unref(&fence);
 		return -ENOMEM;
 	}
@@ -139,7 +139,7 @@ retry:
 				 */
 				rdev->ib_pool.head_id = (1 + idx);
 				rdev->ib_pool.head_id &= (RADEON_IB_POOL_SIZE - 1);
-				mutex_unlock(&rdev->ib_pool.mutex);
+				radeon_mutex_unlock(&rdev->ib_pool.mutex);
 				return 0;
 			}
 		}
@@ -158,7 +158,7 @@ retry:
 		}
 		idx = (idx + 1) & (RADEON_IB_POOL_SIZE - 1);
 	}
-	mutex_unlock(&rdev->ib_pool.mutex);
+	radeon_mutex_unlock(&rdev->ib_pool.mutex);
 	radeon_fence_unref(&fence);
 	return r;
 }
@@ -171,12 +171,12 @@ void radeon_ib_free(struct radeon_device *rdev, struct radeon_ib **ib)
 	if (tmp == NULL) {
 		return;
 	}
-	mutex_lock(&rdev->ib_pool.mutex);
+	radeon_mutex_lock(&rdev->ib_pool.mutex);
 	if (tmp->fence && !tmp->fence->emitted) {
 		radeon_sa_bo_free(rdev, &tmp->sa_bo);
 		radeon_fence_unref(&tmp->fence);
 	}
-	mutex_unlock(&rdev->ib_pool.mutex);
+	radeon_mutex_unlock(&rdev->ib_pool.mutex);
 }
 
 int radeon_ib_schedule(struct radeon_device *rdev, struct radeon_ib *ib)
@@ -206,9 +206,9 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 {
 	int i, r;
 
-	mutex_lock(&rdev->ib_pool.mutex);
+	radeon_mutex_lock(&rdev->ib_pool.mutex);
 	if (rdev->ib_pool.ready) {
-		mutex_unlock(&rdev->ib_pool.mutex);
+		radeon_mutex_unlock(&rdev->ib_pool.mutex);
 		return 0;
 	}
 
@@ -216,7 +216,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 				      RADEON_IB_POOL_SIZE*64*1024,
 				      RADEON_GEM_DOMAIN_GTT);
 	if (r) {
-		mutex_unlock(&rdev->ib_pool.mutex);
+		radeon_mutex_unlock(&rdev->ib_pool.mutex);
 		return r;
 	}
 
@@ -236,7 +236,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 	if (radeon_debugfs_ring_init(rdev)) {
 		DRM_ERROR("Failed to register debugfs file for rings !\n");
 	}
-	mutex_unlock(&rdev->ib_pool.mutex);
+	radeon_mutex_unlock(&rdev->ib_pool.mutex);
 	return 0;
 }
 
@@ -244,7 +244,7 @@ void radeon_ib_pool_fini(struct radeon_device *rdev)
 {
 	unsigned i;
 
-	mutex_lock(&rdev->ib_pool.mutex);
+	radeon_mutex_lock(&rdev->ib_pool.mutex);
 	if (rdev->ib_pool.ready) {
 		for (i = 0; i < RADEON_IB_POOL_SIZE; i++) {
 			radeon_sa_bo_free(rdev, &rdev->ib_pool.ibs[i].sa_bo);
@@ -253,7 +253,7 @@ void radeon_ib_pool_fini(struct radeon_device *rdev)
 		radeon_sa_bo_manager_fini(rdev, &rdev->ib_pool.sa_manager);
 		rdev->ib_pool.ready = false;
 	}
-	mutex_unlock(&rdev->ib_pool.mutex);
+	radeon_mutex_unlock(&rdev->ib_pool.mutex);
 }
 
 int radeon_ib_pool_start(struct radeon_device *rdev)
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [3.3-rc1]radeon 0000:07:00.0: GPU lockup CP stall for more than 10000msec
  2012-01-23 16:57 ` Jerome Glisse
@ 2012-01-23 18:01   ` Torsten Kaiser
  2012-01-24  7:34     ` Torsten Kaiser
  0 siblings, 1 reply; 5+ messages in thread
From: Torsten Kaiser @ 2012-01-23 18:01 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-kernel, Alex Deucher, Dave Airlie, dri-devel

On Mon, Jan 23, 2012 at 5:57 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Sat, Jan 21, 2012 at 08:03:37PM +0100, Torsten Kaiser wrote:
>> After updating to kernel 3.3-rc1 I have experienced a lockup of my GPU.
>> I left my KDE desktop running until the screensaver turned off the
>> monitors. But on key presses it would not turn back on. Ctrl+Alt+F1 to
>> switch to another virtual console also did not work.
>> Alt+SysRq magic still worked, so I was able to force the syslog to
>> disk and restart the system.
>>
>
> Can you test if attached patch help your case ?

Patch is installed, but I can't reproduce the hang on demand.
It did happen a second time yesterday while letting the screensaver
kick in, but only at around the third or fourth try. Just using "xset
dpms force standby/suspend/off" did not trigger it.

> Of course it would be best if we did not lockup in the first place.

Not sure if this is important: I also upgraded to mesa 8.0-rc1 before
the first hang, but after switching back to 3.2 but still using mesa
8.0 I did not have any problems.
Except the KDE desktop effects there should not have been any OpenGL
programs running.
The screen saver itself is just turning the screens off via the KDE
power profile.

I will report again, when I succeeded in triggering the GPU lockup again...

Torsten

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [3.3-rc1]radeon 0000:07:00.0: GPU lockup CP stall for more than 10000msec
  2012-01-23 18:01   ` Torsten Kaiser
@ 2012-01-24  7:34     ` Torsten Kaiser
  2012-01-28 10:20       ` Torsten Kaiser
  0 siblings, 1 reply; 5+ messages in thread
From: Torsten Kaiser @ 2012-01-24  7:34 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-kernel, Alex Deucher, Dave Airlie, dri-devel

On Mon, Jan 23, 2012 at 7:01 PM, Torsten Kaiser
<just.for.lkml@googlemail.com> wrote:
> On Mon, Jan 23, 2012 at 5:57 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>> On Sat, Jan 21, 2012 at 08:03:37PM +0100, Torsten Kaiser wrote:
>>> After updating to kernel 3.3-rc1 I have experienced a lockup of my GPU.
>>> I left my KDE desktop running until the screensaver turned off the
>>> monitors. But on key presses it would not turn back on. Ctrl+Alt+F1 to
>>> switch to another virtual console also did not work.
>>> Alt+SysRq magic still worked, so I was able to force the syslog to
>>> disk and restart the system.
>>>
>>
>> Can you test if attached patch help your case ?
>
> Patch is installed, but I can't reproduce the hang on demand.
> It did happen a second time yesterday while letting the screensaver
> kick in, but only at around the third or fourth try. Just using "xset
> dpms force standby/suspend/off" did not trigger it.

I think the patch did what it was intended to do, but it did not really help.
While the GPU reset did seem to work, X still got stuck and was not
able to turn the monitors back on.

>From the log:
The GPU lockup happend while the system was idle:
Jan 23 23:53:54 thoregon kernel: [17121.080129] radeon 0000:07:00.0:
GPU lockup CP stall for more than 10000msec
Jan 23 23:53:54 thoregon kernel: [17121.080137] GPU lockup (waiting
for 0x002080B7 last fence id 0x002080B6)
Jan 23 23:53:54 thoregon kernel: [17121.096334] radeon 0000:07:00.0:
GPU softreset
Jan 23 23:53:54 thoregon kernel: [17121.096341] radeon 0000:07:00.0:
R_008010_GRBM_STATUS=0xA0003028
Jan 23 23:53:54 thoregon kernel: [17121.096346] radeon 0000:07:00.0:
R_008014_GRBM_STATUS2=0x00000002
Jan 23 23:53:54 thoregon kernel: [17121.096351] radeon 0000:07:00.0:
R_000E50_SRBM_STATUS=0x200000C0
Jan 23 23:53:54 thoregon kernel: [17121.096362] radeon 0000:07:00.0:
R_008020_GRBM_SOFT_RESET=0x00007FEE
Jan 23 23:53:54 thoregon kernel: [17121.111386] radeon 0000:07:00.0:
R_008020_GRBM_SOFT_RESET=0x00000001
Jan 23 23:53:54 thoregon kernel: [17121.127378] radeon 0000:07:00.0:
R_008010_GRBM_STATUS=0x00003028
Jan 23 23:53:54 thoregon kernel: [17121.127384] radeon 0000:07:00.0:
R_008014_GRBM_STATUS2=0x00000002
Jan 23 23:53:54 thoregon kernel: [17121.127390] radeon 0000:07:00.0:
R_000E50_SRBM_STATUS=0x200000C0
Jan 23 23:53:54 thoregon kernel: [17121.128393] radeon 0000:07:00.0:
GPU reset succeed
Jan 23 23:53:54 thoregon kernel: [17121.133330] [drm] PCIE GART of
512M enabled (table at 0x0000000000040000).
Jan 23 23:53:54 thoregon kernel: [17121.133364] radeon 0000:07:00.0: WB enabled
Jan 23 23:53:54 thoregon kernel: [17121.133370] [drm] fence driver on
ring 0 use gpu addr 0x20000c00 and cpu addr 0xffff8803286e5c00
Jan 23 23:53:54 thoregon kernel: [17121.179627] [drm] ring test on 0
succeeded in 1 usecs
Jan 23 23:53:54 thoregon kernel: [17121.179653] [drm] ib test on ring
0 succeeded in 1 usecs

There where no messages about X getting stuck ("blocked for more than
120 seconds"), but after trying to access the system and failing
SysRq+W reported this:
Jan 24 08:08:20 thoregon kernel: [46786.741180] SysRq : Show Blocked State
Jan 24 08:08:20 thoregon kernel: [46786.741190]   task
       PC stack   pid father
Jan 24 08:08:20 thoregon kernel: [46786.741270] X               D
ffff880337d50a00     0  3047   3026 0x00400004
Jan 24 08:08:20 thoregon kernel: [46786.741281]  ffff880327eacac0
0000000000000086 ffff880327d52e00 0000000000010a00
Jan 24 08:08:20 thoregon kernel: [46786.741292]  ffff88031be9bfd8
0000000000010a00 ffff88031be9a000 ffff88031be9bfd8
Jan 24 08:08:20 thoregon kernel: [46786.741301]  0000000000010a00
ffff880327eacac0 0000000000010a00 0000000000010a00
Jan 24 08:08:20 thoregon kernel: [46786.741310] Call Trace:
Jan 24 08:08:20 thoregon kernel: [46786.741326]  [<ffffffff815ee9f7>]
? schedule_timeout+0x157/0x220
Jan 24 08:08:20 thoregon kernel: [46786.741336]  [<ffffffff8103fbd0>]
? run_timer_softirq+0x240/0x240
Jan 24 08:08:20 thoregon kernel: [46786.741346]  [<ffffffff8133ee39>]
? radeon_fence_wait+0x239/0x3b0
Jan 24 08:08:20 thoregon kernel: [46786.741356]  [<ffffffff8104f340>]
? wake_up_bit+0x40/0x40
Jan 24 08:08:20 thoregon kernel: [46786.741364]  [<ffffffff81352e07>]
? radeon_ib_get+0x257/0x2e0
Jan 24 08:08:20 thoregon kernel: [46786.741372]  [<ffffffff81354d7a>]
? radeon_cs_ioctl+0x27a/0x4d0
Jan 24 08:08:20 thoregon kernel: [46786.741381]  [<ffffffff812f42d4>]
? drm_ioctl+0x3e4/0x490
Jan 24 08:08:20 thoregon kernel: [46786.741389]  [<ffffffff81354b00>]
? radeon_cs_finish_pages+0xa0/0xa0
Jan 24 08:08:20 thoregon kernel: [46786.741398]  [<ffffffff81024769>]
? do_page_fault+0x199/0x420
Jan 24 08:08:20 thoregon kernel: [46786.741406]  [<ffffffff810af30c>]
? mmap_region+0x1dc/0x570
Jan 24 08:08:20 thoregon kernel: [46786.741414]  [<ffffffff810de446>]
? do_vfs_ioctl+0x96/0x4e0
Jan 24 08:08:20 thoregon kernel: [46786.741422]  [<ffffffff810de8d9>]
? sys_ioctl+0x49/0x90
Jan 24 08:08:20 thoregon kernel: [46786.741430]  [<ffffffff815f1922>]
? system_call_fastpath+0x16/0x1b

I did search my logs for more GPU lockups after noting that this also
happened with 3.2.
The first lockup in my logs occurred on Nov 4 under 3.1. But until
3.3-rc1 X always was able to resume normal operations.

My best guess for the cause of the GPU lockups seems to be the upgrade
from xf86-video-ati-6.14.2 to 6.14.3, but 3.3-rc1 seems to have an
independent bug that prevents X to recover from a GPU lockup/reset.

>> Of course it would be best if we did not lockup in the first place.
>
> Not sure if this is important: I also upgraded to mesa 8.0-rc1 before
> the first hang, but after switching back to 3.2 but still using mesa
> 8.0 I did not have any problems.
> Except the KDE desktop effects there should not have been any OpenGL
> programs running.
> The screen saver itself is just turning the screens off via the KDE
> power profile.
>
> I will report again, when I succeeded in triggering the GPU lockup again...
>
> Torsten

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [3.3-rc1]radeon 0000:07:00.0: GPU lockup CP stall for more than 10000msec
  2012-01-24  7:34     ` Torsten Kaiser
@ 2012-01-28 10:20       ` Torsten Kaiser
  0 siblings, 0 replies; 5+ messages in thread
From: Torsten Kaiser @ 2012-01-28 10:20 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-kernel, Alex Deucher, Dave Airlie, dri-devel

On Tue, Jan 24, 2012 at 8:34 AM, Torsten Kaiser
<just.for.lkml@googlemail.com> wrote:
> On Mon, Jan 23, 2012 at 7:01 PM, Torsten Kaiser
> <just.for.lkml@googlemail.com> wrote:
>> On Mon, Jan 23, 2012 at 5:57 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>>> On Sat, Jan 21, 2012 at 08:03:37PM +0100, Torsten Kaiser wrote:
>>>> After updating to kernel 3.3-rc1 I have experienced a lockup of my GPU.
>>>> I left my KDE desktop running until the screensaver turned off the
>>>> monitors. But on key presses it would not turn back on. Ctrl+Alt+F1 to
>>>> switch to another virtual console also did not work.
>>>> Alt+SysRq magic still worked, so I was able to force the syslog to
>>>> disk and restart the system.
>>>>
>>>
>>> Can you test if attached patch help your case ?
>>
>> Patch is installed, but I can't reproduce the hang on demand.
>> It did happen a second time yesterday while letting the screensaver
>> kick in, but only at around the third or fourth try. Just using "xset
>> dpms force standby/suspend/off" did not trigger it.
>
> I think the patch did what it was intended to do, but it did not really help.
> While the GPU reset did seem to work, X still got stuck and was not
> able to turn the monitors back on.
>
> From the log:
> The GPU lockup happend while the system was idle:
> Jan 23 23:53:54 thoregon kernel: [17121.080129] radeon 0000:07:00.0:
> GPU lockup CP stall for more than 10000msec
> Jan 23 23:53:54 thoregon kernel: [17121.080137] GPU lockup (waiting
> for 0x002080B7 last fence id 0x002080B6)
> Jan 23 23:53:54 thoregon kernel: [17121.096334] radeon 0000:07:00.0:
> GPU softreset
> Jan 23 23:53:54 thoregon kernel: [17121.096341] radeon 0000:07:00.0:
> R_008010_GRBM_STATUS=0xA0003028
> Jan 23 23:53:54 thoregon kernel: [17121.096346] radeon 0000:07:00.0:
> R_008014_GRBM_STATUS2=0x00000002
> Jan 23 23:53:54 thoregon kernel: [17121.096351] radeon 0000:07:00.0:
> R_000E50_SRBM_STATUS=0x200000C0
> Jan 23 23:53:54 thoregon kernel: [17121.096362] radeon 0000:07:00.0:
> R_008020_GRBM_SOFT_RESET=0x00007FEE
> Jan 23 23:53:54 thoregon kernel: [17121.111386] radeon 0000:07:00.0:
> R_008020_GRBM_SOFT_RESET=0x00000001
> Jan 23 23:53:54 thoregon kernel: [17121.127378] radeon 0000:07:00.0:
> R_008010_GRBM_STATUS=0x00003028
> Jan 23 23:53:54 thoregon kernel: [17121.127384] radeon 0000:07:00.0:
> R_008014_GRBM_STATUS2=0x00000002
> Jan 23 23:53:54 thoregon kernel: [17121.127390] radeon 0000:07:00.0:
> R_000E50_SRBM_STATUS=0x200000C0
> Jan 23 23:53:54 thoregon kernel: [17121.128393] radeon 0000:07:00.0:
> GPU reset succeed
> Jan 23 23:53:54 thoregon kernel: [17121.133330] [drm] PCIE GART of
> 512M enabled (table at 0x0000000000040000).
> Jan 23 23:53:54 thoregon kernel: [17121.133364] radeon 0000:07:00.0: WB enabled
> Jan 23 23:53:54 thoregon kernel: [17121.133370] [drm] fence driver on
> ring 0 use gpu addr 0x20000c00 and cpu addr 0xffff8803286e5c00
> Jan 23 23:53:54 thoregon kernel: [17121.179627] [drm] ring test on 0
> succeeded in 1 usecs
> Jan 23 23:53:54 thoregon kernel: [17121.179653] [drm] ib test on ring
> 0 succeeded in 1 usecs

I found the commit (in xf86-video-ati) that causes the lockups and
filed a bug at the xorg bugzilla about it:
https://bugs.freedesktop.org/show_bug.cgi?id=45329

But that still leaves the regression in 3.3-rc1 that even with Jeromes
patch the X server is no longer able to recover from the lockup, as
shown by the SysRq+W trace below.

> There where no messages about X getting stuck ("blocked for more than
> 120 seconds"), but after trying to access the system and failing
> SysRq+W reported this:
> Jan 24 08:08:20 thoregon kernel: [46786.741180] SysRq : Show Blocked State
> Jan 24 08:08:20 thoregon kernel: [46786.741190]   task
>       PC stack   pid father
> Jan 24 08:08:20 thoregon kernel: [46786.741270] X               D
> ffff880337d50a00     0  3047   3026 0x00400004
> Jan 24 08:08:20 thoregon kernel: [46786.741281]  ffff880327eacac0
> 0000000000000086 ffff880327d52e00 0000000000010a00
> Jan 24 08:08:20 thoregon kernel: [46786.741292]  ffff88031be9bfd8
> 0000000000010a00 ffff88031be9a000 ffff88031be9bfd8
> Jan 24 08:08:20 thoregon kernel: [46786.741301]  0000000000010a00
> ffff880327eacac0 0000000000010a00 0000000000010a00
> Jan 24 08:08:20 thoregon kernel: [46786.741310] Call Trace:
> Jan 24 08:08:20 thoregon kernel: [46786.741326]  [<ffffffff815ee9f7>]
> ? schedule_timeout+0x157/0x220
> Jan 24 08:08:20 thoregon kernel: [46786.741336]  [<ffffffff8103fbd0>]
> ? run_timer_softirq+0x240/0x240
> Jan 24 08:08:20 thoregon kernel: [46786.741346]  [<ffffffff8133ee39>]
> ? radeon_fence_wait+0x239/0x3b0
> Jan 24 08:08:20 thoregon kernel: [46786.741356]  [<ffffffff8104f340>]
> ? wake_up_bit+0x40/0x40
> Jan 24 08:08:20 thoregon kernel: [46786.741364]  [<ffffffff81352e07>]
> ? radeon_ib_get+0x257/0x2e0
> Jan 24 08:08:20 thoregon kernel: [46786.741372]  [<ffffffff81354d7a>]
> ? radeon_cs_ioctl+0x27a/0x4d0
> Jan 24 08:08:20 thoregon kernel: [46786.741381]  [<ffffffff812f42d4>]
> ? drm_ioctl+0x3e4/0x490
> Jan 24 08:08:20 thoregon kernel: [46786.741389]  [<ffffffff81354b00>]
> ? radeon_cs_finish_pages+0xa0/0xa0
> Jan 24 08:08:20 thoregon kernel: [46786.741398]  [<ffffffff81024769>]
> ? do_page_fault+0x199/0x420
> Jan 24 08:08:20 thoregon kernel: [46786.741406]  [<ffffffff810af30c>]
> ? mmap_region+0x1dc/0x570
> Jan 24 08:08:20 thoregon kernel: [46786.741414]  [<ffffffff810de446>]
> ? do_vfs_ioctl+0x96/0x4e0
> Jan 24 08:08:20 thoregon kernel: [46786.741422]  [<ffffffff810de8d9>]
> ? sys_ioctl+0x49/0x90
> Jan 24 08:08:20 thoregon kernel: [46786.741430]  [<ffffffff815f1922>]
> ? system_call_fastpath+0x16/0x1b
>
> I did search my logs for more GPU lockups after noting that this also
> happened with 3.2.
> The first lockup in my logs occurred on Nov 4 under 3.1. But until
> 3.3-rc1 X always was able to resume normal operations.
>
> My best guess for the cause of the GPU lockups seems to be the upgrade
> from xf86-video-ati-6.14.2 to 6.14.3, but 3.3-rc1 seems to have an
> independent bug that prevents X to recover from a GPU lockup/reset.
>
>>> Of course it would be best if we did not lockup in the first place.
>>
>> Not sure if this is important: I also upgraded to mesa 8.0-rc1 before
>> the first hang, but after switching back to 3.2 but still using mesa
>> 8.0 I did not have any problems.
>> Except the KDE desktop effects there should not have been any OpenGL
>> programs running.
>> The screen saver itself is just turning the screens off via the KDE
>> power profile.
>>
>> I will report again, when I succeeded in triggering the GPU lockup again...
>>
>> Torsten

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-01-28 10:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-21 19:03 [3.3-rc1]radeon 0000:07:00.0: GPU lockup CP stall for more than 10000msec Torsten Kaiser
2012-01-23 16:57 ` Jerome Glisse
2012-01-23 18:01   ` Torsten Kaiser
2012-01-24  7:34     ` Torsten Kaiser
2012-01-28 10:20       ` Torsten Kaiser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).