From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 111551] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout Date: Tue, 03 Sep 2019 13:40:26 +0000 Message-ID: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0336710374==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [131.252.210.165]) by gabe.freedesktop.org (Postfix) with ESMTP id A065B8954A for ; Tue, 3 Sep 2019 13:40:26 +0000 (UTC) List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============0336710374== Content-Type: multipart/alternative; boundary="15675180261.9916ad1e.13852" Content-Transfer-Encoding: 7bit --15675180261.9916ad1e.13852 Date: Tue, 3 Sep 2019 13:40:26 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D111551 Bug ID: 111551 Summary: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout Product: DRI Version: XOrg git Hardware: ARM OS: Linux (All) Status: NEW Severity: major Priority: not set Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: 78666679@qq.com The amdgpu(pollaries10, wx5100) drm drivers sometimes report: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=3D24423862, emitted seq=3D24423865 and many threads run into disk sleeping state kernel version: 4.19.36 mesa: 18.3.6 --=20 You are receiving this mail because: You are the assignee for the bug.= --15675180261.9916ad1e.13852 Date: Tue, 3 Sep 2019 13:40:26 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated
Bug ID 111551
Summary [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout
Product DRI
Version XOrg git
Hardware ARM
OS Linux (All)
Status NEW
Severity major
Priority not set
Component DRM/AMDgpu
Assignee dri-devel@lists.freedesktop.org
Reporter 78666679@qq.com

The amdgpu(pollaries10, wx5100) drm drivers sometimes report:

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled
seq=3D24423862, emitted seq=3D24423865

and many threads run into disk sleeping state


kernel version: 4.19.36

mesa: 18.3.6


You are receiving this mail because:
  • You are the assignee for the bug.
= --15675180261.9916ad1e.13852-- --===============0336710374== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============0336710374==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 111551] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout Date: Tue, 03 Sep 2019 13:42:45 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0150849971==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [131.252.210.165]) by gabe.freedesktop.org (Postfix) with ESMTP id 23D86892BD for ; Tue, 3 Sep 2019 13:42:47 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============0150849971== Content-Type: multipart/alternative; boundary="15675181660.2caeaebC.13850" Content-Transfer-Encoding: 7bit --15675181660.2caeaebC.13850 Date: Tue, 3 Sep 2019 13:42:46 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D111551 yanhua <78666679@qq.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |78666679@qq.com --- Comment #1 from yanhua <78666679@qq.com> --- Created attachment 145253 --> https://bugs.freedesktop.org/attachment.cgi?id=3D145253&action=3Dedit dmesg output grep drm dmesg.txt. there are sdma1 ring timout --=20 You are receiving this mail because: You are the assignee for the bug.= --15675181660.2caeaebC.13850 Date: Tue, 3 Sep 2019 13:42:46 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated yanhua changed bug 11155= 1
What Removed Added
CC   78666679@qq.com

Commen= t # 1 on bug 11155= 1 from yanhua
Created attachment 145253 =
[details]
dmesg output

grep drm dmesg.txt. there are sdma1 ring timout


You are receiving this mail because:
  • You are the assignee for the bug.
= --15675181660.2caeaebC.13850-- --===============0150849971== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============0150849971==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 111551] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout Date: Wed, 04 Sep 2019 05:14:56 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0536380396==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [IPv6:2610:10:20:722:a800:ff:fe98:4b55]) by gabe.freedesktop.org (Postfix) with ESMTP id 71C58899B5 for ; Wed, 4 Sep 2019 05:14:56 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============0536380396== Content-Type: multipart/alternative; boundary="15675740961.6Bf0.25462" Content-Transfer-Encoding: 7bit --15675740961.6Bf0.25462 Date: Wed, 4 Sep 2019 05:14:56 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D111551 --- Comment #2 from yanhua <78666679@qq.com> --- Created attachment 145260 --> https://bugs.freedesktop.org/attachment.cgi?id=3D145260&action=3Dedit The previous dmesg.txt has messages been overwriten. from the dmesg-full= .txt can see more information --=20 You are receiving this mail because: You are the assignee for the bug.= --15675740961.6Bf0.25462 Date: Wed, 4 Sep 2019 05:14:56 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Commen= t # 2 on bug 11155= 1 from yanhua
Created attachment=
 145260 [details]
The previous  dmesg.txt has  messages  been overwriten. from the dmesg-full=
.txt
can see more information


You are receiving this mail because:
  • You are the assignee for the bug.
= --15675740961.6Bf0.25462-- --===============0536380396== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============0536380396==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 111551] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout Date: Wed, 04 Sep 2019 11:45:27 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0984323435==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [131.252.210.165]) by gabe.freedesktop.org (Postfix) with ESMTP id A89438989C for ; Wed, 4 Sep 2019 11:45:27 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============0984323435== Content-Type: multipart/alternative; boundary="15675975272.7d7D0.773" Content-Transfer-Encoding: 7bit --15675975272.7d7D0.773 Date: Wed, 4 Sep 2019 11:45:27 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D111551 --- Comment #3 from Christian K=C3=B6nig --- As far as I can see this is a really large box with multiple GPUs installed. The SDMA rarely locks up, especially not while executing page table updates= . So there is most likely something wrong with the hardware here. Are you sure that the power supply is large enough for that system? What system/platform is that? Could this be a coherency problem? --=20 You are receiving this mail because: You are the assignee for the bug.= --15675975272.7d7D0.773 Date: Wed, 4 Sep 2019 11:45:27 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Commen= t # 3 on bug 11155= 1 from Christian K=C3=B6nig
As far as I can see this is a really large box with multiple G=
PUs installed.

The SDMA rarely locks up, especially not while executing page table updates=
. So
there is most likely something wrong with the hardware here.

Are you sure that the power supply is large enough for that system?

What system/platform is that? Could this be a coherency problem?


You are receiving this mail because:
  • You are the assignee for the bug.
= --15675975272.7d7D0.773-- --===============0984323435== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============0984323435==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 111551] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout Date: Wed, 04 Sep 2019 12:26:49 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1592578243==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [IPv6:2610:10:20:722:a800:ff:fe98:4b55]) by gabe.freedesktop.org (Postfix) with ESMTP id 8EA6A89483 for ; Wed, 4 Sep 2019 12:26:49 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============1592578243== Content-Type: multipart/alternative; boundary="15676000091.99bfc20a.7378" Content-Transfer-Encoding: 7bit --15676000091.99bfc20a.7378 Date: Wed, 4 Sep 2019 12:26:49 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D111551 --- Comment #4 from yanhua <78666679@qq.com> --- I have asked hardware team, they have tested, and can be sure there are no power supply problem. The system is arm64 with 64 cores. and there are three amdgpu card in the board. there are rarely gfx timeout, sdma timeout, and vce timeout. When the ring timeout occur, we can use amd supplied tools umr to read chip registers. ca= n we know the real cause from the register value? with the coherency problem you said, I think if that was true. the problem should occur more frequently. I'm not sure. --=20 You are receiving this mail because: You are the assignee for the bug.= --15676000091.99bfc20a.7378 Date: Wed, 4 Sep 2019 12:26:49 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Commen= t # 4 on bug 11155= 1 from yanhua
I have asked hardware team, they have tested, and can be sure =
there are no
power supply problem.


The system is arm64 with 64 cores. and there are three amdgpu card in the
board.


there are rarely gfx timeout, sdma timeout, and vce timeout. When the ring
timeout occur, we can use amd supplied tools umr to read chip registers. ca=
n we
know the real cause from the register value?

with the coherency problem you said, I think if that was true. the problem
should occur more frequently. I'm not sure.


You are receiving this mail because:
  • You are the assignee for the bug.
= --15676000091.99bfc20a.7378-- --===============1592578243== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============1592578243==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 111551] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout Date: Wed, 04 Sep 2019 12:35:52 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1059602337==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [IPv6:2610:10:20:722:a800:ff:fe98:4b55]) by gabe.freedesktop.org (Postfix) with ESMTP id 2E38989A32 for ; Wed, 4 Sep 2019 12:35:52 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============1059602337== Content-Type: multipart/alternative; boundary="15676005520.18E1125.7896" Content-Transfer-Encoding: 7bit --15676005520.18E1125.7896 Date: Wed, 4 Sep 2019 12:35:52 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D111551 Christian K=C3=B6nig changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |INVALID Status|NEW |RESOLVED --- Comment #5 from Christian K=C3=B6nig --- amdgpu is known to not work on arm64 until very recently. So it is not a supprise that this isn't working. Please switch to a newer kernel and re-test. Apart from that there isn't much we can do about it. --=20 You are receiving this mail because: You are the assignee for the bug.= --15676005520.18E1125.7896 Date: Wed, 4 Sep 2019 12:35:52 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated Christian K=C3=B6nig changed bug 11155= 1
What Removed Added
Resolution --- INVALID
Status NEW RESOLVED

Commen= t # 5 on bug 11155= 1 from Christian K=C3=B6nig
amdgpu is known to not work on arm64 until very recently.

So it is not a supprise that this isn't working. Please switch to a newer
kernel and re-test.

Apart from that there isn't much we can do about it.


You are receiving this mail because:
  • You are the assignee for the bug.
= --15676005520.18E1125.7896-- --===============1059602337== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============1059602337==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 111551] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout Date: Wed, 04 Sep 2019 12:50:15 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1387536037==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [IPv6:2610:10:20:722:a800:ff:fe98:4b55]) by gabe.freedesktop.org (Postfix) with ESMTP id D6D6F899B0 for ; Wed, 4 Sep 2019 12:50:14 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============1387536037== Content-Type: multipart/alternative; boundary="15676014141.44b8f.11419" Content-Transfer-Encoding: 7bit --15676014141.44b8f.11419 Date: Wed, 4 Sep 2019 12:50:14 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D111551 --- Comment #6 from yanhua <78666679@qq.com> --- As far as I know, arm64 does not support wc memory. and We have already turn the wc flag as newer kernel version does. --=20 You are receiving this mail because: You are the assignee for the bug.= --15676014141.44b8f.11419 Date: Wed, 4 Sep 2019 12:50:14 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Commen= t # 6 on bug 11155= 1 from yanhua
As far as I know, arm64 does not support wc memory. and We hav=
e already turn
the wc flag as newer kernel version does.


You are receiving this mail because:
  • You are the assignee for the bug.
= --15676014141.44b8f.11419-- --===============1387536037== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVs --===============1387536037==--