From mboxrd@z Thu Jan 1 00:00:00 1970
From: bugzilla-daemon@freedesktop.org
Subject: [Bug 106671] Frequent lock ups for AMD RX 550 graphics card
Date: Mon, 28 May 2018 08:47:31 +0000
Message-ID:
What
Removed
Added
Component
Driver/AMDgpu
Drivers/Gallium/radeonsi
Assignee
xorg-driver-ati@lists.x.org
dri-devel@lists.freedesktop.org
Version
unspecified
18.0
QA Contact
xorg-team@lists.x.org
dri-devel@lists.freedesktop.org
Product
xorg
Mesa
Please attach the corresponding full Xorg log and dmesg output.
This is most likely between Mesa and the kernel; xf86-video-amdgpu doesn't
contain any GPU specific rendering code which could cause hangs. I'd recomm=
end
trying latest upstream versions of Mesa (18.1) and the kernel, and if it st=
ill
happens, also try getting the current microcode files from
https://git.kernel.org/pub/scm/linux/kernel/git/fi=
rmware/linux-firmware.git/tree/amdgpu
.
Created attachment 13981=
6 [details]
X log file as requested
Created attachment 139=
817 [details]
dmesg output as requested
Hi Michel: I have added your requested attachments. And if there are other data you n= eed or other tests I can run, let me know. Meanwhile, all else seems well with this new computer (e.g., the lock ups a= re gone under my normal KDE desktop use since I bypassed using this card 3 days ago by displaying my desktop on an X server running on a different computer= .=20 But that is only a temporary workaround (another person needs to use that o= ther computer's display and keyboard/mouse). Therefore, I need the RX 550 to wo= rk reliably on my new computer which is why I will be following your recommendations with regard to trying the latest kernel, mesa, and (if all = else fails) firmware. But building kernel and mesa is going to take me consider= able time for the reasons I mentioned in my original post.
Hi Michel: Since the lock ups occurred during ordinary (KDE) desktop use when I wasn't running 3D games I ignored mesa upgrades and instead concentrated first on trying a new kernel version (from 4.16.5 to 4.16.12 because 4.16.12 had conveniently just been propagated from Debian Sid to Buster). And so far it appears that upgrade makes a large improvement for the RX 550. Previously = for 4.16.5 the uptimes before a lock up occurred ranged from 7 hours to 2 days,= but right now with heavy desktop use and a substantial number of runs of the 3D game, I haven't experienced a single lock up with 4.16.12 with current upti= me since I booted 4.16.12 approaching 3 days. You may well conclude "problem already solved", but I normally ru= n my computer 24/7 with reboots only when absolutely necessary. Therefore I would like to keep this bug report open for a while just to report the maximum uptimes (hopefully at least several months) I can achieve with this graphics card.<= /pre>
Edit of previous comment: of the 3D game, -> of the 3D game, foobillard,
Please remove the resolution of this bug as FIXED. The reason for this request is subsequent kernel-4.16.x use after the initi= al success I reported continued to show lock ups whenever this graphics card w= as used. Yesterday, I tried kernel-4.17.17-1 from Debian Buster (the first ti= me I had tried any kernel-4.17.x version) in great anticipation these kernel loc= kups would be fixed (since kernel-4.17.x apparently contains lots of AMD graphics fixes). But when I used this graphics card for ordinary direct desktop use= (as opposed to accessing my desktop on the new computer via an X-terminal which= is so far the only stable way I can use my new computer), I got a lockup withi= n a half hour or so followed by one roughly 8 hours later. For what it is wort= h, I have also installed mesa-8.1.6-1 and version 20180518-1 of the firmware-amd-graphics package from Debian Buster before performing this fai= ling experiment. So it appears the substantial number of AMD graphics fixes in kernel-4.17.x= and mesa-18.1.y and installation of the relatively recent (from May) Debian Bus= ter firmware-amd-graphics package are not sufficient to stabilize use of this A= MD RX 550 graphics card. That is a big disappointment since this card should = no longer be considered cutting-edge hardware (i.e. it was first offered for s= ale at least 16 months ago) and this delay in fixing it cannot be attributed to non-cooperation from AMD since they appear to have a good open-source recor= d. Because of these on-going issues with direct use of this card, I am going back to using the X-terminal method with this kernel which experience with kernel-4.16.x shows is much more stable since it avoids using this graphics card completely (except for the direct display of the Linux console login prompt). I plan to again try the experiment of attempting to use this card directly = when kernel-4.18.x is promoted to Buster. But meanwhile, if you have any other suggestions I could try, please let me know.
Created attachment 141451 [details]
tarball containing kern.log, syslog, and dmesg output
We (there are two of us using this machine) just got yet anoth= er kernel lockup (no remote access possible with ssh, direct keyboard not working), but this= is a case when we were remotely accessing this box with an X-terminal. In oth= er words, the only use of the RX 550 was to display the command-line login pro= mpt for the Linux console of the directly attached monitor until the lockup whe= re it displayed the following message (roughly 15 times in the half-hour befor= e I got out of the lockup by pushing the reset button.) watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [firefox-esr:29266] (At the time we were both browsing different sites with firefox with one of those firefox instances running a couple of days, and as a security measure= we both restrict the use of javascript with the noscript extension to firefox.) I have attached a tarball containing log files (kern.log and syslog) that contain the lockup information (including the above message) as well as information about the fresh boot afterwards. (For what it is worth, that tarball also includes dmesg output which appears to contain information only about the fresh boot.) For this minimal use case for the RX 550, the Linux kernel lasted 6 days be= fore the lockup which is much better than the direct use case where the lockups = can occur as soon as a half hour after a fresh boot. So the current lockup cou= ld be due to an entirely different bug than in the lockups I have encountered = for the direct use case. But, of course, minimal use is not zero use so curren= tly I ascribe both the present remote-use lockup and the previous direct-use lockups to some incompatibility between the RX 550 and the Debian Testing graphics stack. That stack currently includes the following component versions: linux-image-4.17.0-3-amd64 4.17.17-1 firmware-amd-graphics 20180518-1 libdrm-amdgpu1:amd64 2.4.93-1 libglapi-mesa:amd64 18.1.6-1 xserver-xorg-video-amdgpu 18.0.1-1+b1 Please let me know if there are any other data you need or any experiments = you would like me to try. In any case I plan to continue with remote use of th= is box while reporting lockup incidents as they occur. But I also plan to try direct use again whenever one of the components of the above stack gets significantly upgraded for Debian Testing.
(In reply to Alan W. Irwin from comment #9) > So the current lockup could be due to an entirel= y different bug than in the > lockups I have encountered for the direct use case. Yeah, that looks like an RCU or other core kernel issue, not directly relat= ed to the graphics drivers (which as you say, aren't really being used in this case). Does idle=3Dnomwait on the kernel command line help for any of these issues= , by any chance? It's also worth making sure the motherboard BIOS is up to date.
Thanks for that idle=3Dnomwait suggestion which I have now jus= t tried (verified by irwin@merlin> cat /proc/cmdline BOOT_IMAGE=3D/boot/vmlinuz-4.17.0-3-amd64 root=3DUUID=3D1e45a1ee-a5d6-4327-9a7b-2663ffc0b157 ro rootwait quiet idle= =3Dnomwait ) and I now indeed have a stable result. However, that is currently just f= or the last 5 minutes in remote access mode. :-) So we will see how this goes for, say, the next two weeks, to see if I can beat my last 4.17.17 remote access uptime record of 6 days. With regard to your MB BIOS update suggestion, I am going to hold back on t= hat for a while since the techs from a local computer company that assembled my= box in May felt such updates were dangerous and therefore a last resort. And t= hat is also the consistent advice I have gotten for the other 3 Linux boxes I h= ave had assembled for me since I started using Linux in 1996. Of course, this = year may be a special case with all the Meltdown (although not for this AMD hardware) and many variants of SPECTRE out there so I do plan to update the BIOS within the next couple of months on the assumption that the SPECTRE BI= OS mitigations recommended by AMD to ASUS for this hardware (PRIME B350+ MB wi= th AMD Ryzen 7 1700 CPU, 64GB RAM, and ASUS RX 550 graphics card) will have matured by then.=20=20 But before I implement that planned BIOS update, I am hoping that the curre= nt cutting-edge Linux graphics stack (which according to a senior Phoronix pos= ter works well for the RX 560) will also give me stable direct-display results = for the RX 550 once that version of the graphics stack propagates to Debian Testing. I estimate that propagation time will be a couple of more months based on how quickly elements of the cutting-edge Linux graphics stack such= as the kernel has propagated in the past from upstream to Debian Testing. In sum, it is a waiting game now to see if your idle=3Dnomwait suggestion restores the complete Linux stability I was used to with my old box (for De= bian Oldstable =3D Jessie) for at least the remote display case, and if that sta= bility is obviously much better (i.e., at least a couple of weeks uptime with no lockups) then I will try the direct display case again with idle=3Dnomwait = to see if it makes that case stable as well. Thanks, Michel, for your on-going helpful suggestions for dealing with this troubling instability issue (these troubling instability issues?) for my new Linux box. Alan
(In reply to Michel D=C3=A4nzer from comment #10) > [T]hat looks like an RCU or other core kernel is= sue, not directly > related to the graphics drivers. Hi Michel: If so, should I report that probable non-graphics kernel bug (with my crash-report tarball) elsewhere? Or do you suggest I just forget it until I see what are the remote graphics results of idle=3Dnomwait over the course = of the next couple of weeks AND (if that is a success) the direct graphics results= of idle=3Dnomwait for a couple of more weeks after that?
Well, after 1.5 (successful) days with the remote graphics exp= eriment, I decided instead it made more sense to go after the quicker acting instabili= ty that I have previously experienced in direct graphics mode. So just now I = have started a direct graphics experiment after a Debian Testing upgrade which included = the following firmware and mesa changes: firmware-amd-graphics updated "(20180825+dfsg-1) over (20180518-1)&quo= t; mesa updated "(18.1.7-1) over (18.1.6-1)" In addition for this experiment I installed the amd64-microcode package that contains "microcode patches for all AMD AMD64 processors". Also, as part of this experiment I have continued with the idle=3Dnomwait k= ernel parameter as verified by=20 irwin@merlin> cat /proc/cmdline BOOT_IMAGE=3D/boot/vmlinuz-4.17.0-3-amd64 root=3DUUID=3D1e45a1ee-a5d6-4327-9a7b-2663ffc0b157 ro rootwait quiet idle= =3Dnomwait N.B. note those kernel parameters do not include any amdgpu-related paramet= ers. Do you recommend any such parameters for the RX 550 such as amdgpu.dc=3D1 = which is sometimes recommended for older versions of AMD new-generation graphics hardware?
Created attachment 141479 [details]
compressed dmesg output from current direct graphics experiment
Created attachment 1=
41567 [details]
log files from latest logup
I was beginning to have some hope that the latest direct acces= s experiment would prove to be stable. However, just now it locked up again after almos= t 7 days. So the stability is substantially improved compared to before, and my guess is that improvement is due to installation of the amd64-microcode pac= kage from Debian Buster for this latest experiment.=20=20 However, this is still disappointing stability because typically for truly stable systems I achieve up times of 30 days or longer with the only limit = on uptime being how often I have to reboot due to kernel upgrades. I have attached a crash report tarball containing dmesg output as well as various log files that captured all log activity before the lockup and the = boot afterward. I don't see anything concerning the crash in those log files, b= ut I may be missing something since I am no expert so I would appreciate it if y= ou took a look. I have restarted exactly the same direct graphics access test again (with s= ame versions of graphics stack packages and your recommended idle=3Dnomwait ker= nel parameter in hopes that the kernel will last longer this time before the lo= ckup and/or I catch more details of the lockup when it occurs. If you would pre= fer me to try a different variant of this test, please let me know.
I terminated the last test immediately because it turns out a = new kernel (Linux merlin 4.18.0-1-amd64 #1 SMP Debian 4.18.6-1 (2018-09-06) x86_64 GNU/Linux)= has propagated from Debian Unstable to Debian Testing =3D Buster so I will use = that kernel for my new test. On boot with this new kernel the usual blast of ra= ndom color on the Linux console displayed by the RX 550 that I am used to for all previous kernel versions is now gone. So that is a positive step in the ri= ght direction, and I hope that means the Debian Buster graphics stack is finally completely stable for the RX 550, but I will test that hypothesis with this latest test.=20=20 The latest Debian Buster graphics stack versions for this direct graphics kernel stability test for the RX 550 are as follows: linux-image-4.18.0-1-amd64 4.18.6-1 amd64-microcode 3.20180524.1 firmware-amd-graphics 20180825+dfsg-1 libdrm-amdgpu1:amd64 2.4.94-1 libglapi-mesa:amd64 18.1.7-1 xserver-xorg-video-amdgpu 18.0.1-1+b1 Here are my kernel parameters which includes the suggested idle=3Dnomwait: irwin@merlin> cat /proc/cmdline BOOT_IMAGE=3D/boot/vmlinuz-4.18.0-1-amd64 root=3DUUID=3D1e45a1ee-a5d6-4327-9a7b-2663ffc0b157 ro rootwait quiet idle= =3Dnomwait
Created attachment 141706 [details]
tarball containing log information concerning latest lockup
Despite a new kernel, this instability issue has continued. K= ernel 4.18.6 locked up after 8+ days of up time on our principal computer that has the RX 550 graphics card installed. (I will refer to this computer as the "n= ew" computer, our other working Linux computer that is used to display X results from the new computer as the X-terminal, and our old principal computer (powered down permanently now) as the "old" computer.) The lockup= of the new computer occurred some time in the early morning and (since two users use t= his machine at one time) with one inactive XFCE desktop being displayed on our X-terminal and one inactive XFCE desktop being displayed directly on the new computer. The only symptom of the lockup I could spot in the log files was= a burst of null bytes in each log file. For what it is worth that symptom is new. See the attached crash_report_20180923.tar.gz for the log file and dm= esg details. This result of 8+ days of up time for direct graphics desktop use of the new computer is slightly better than the almost 7 days of up time achieved for = the previous similar test for kernel 4.17.7. Although the present up time resu= lt at least encourages further testing with kernel 4.18.x, this is only one te= st, and the next test might give a substantially shorter or longer up time. In = any case this result is still far from ideal since such lockups never occurred = on the old computer that this new computer replaced and also do not currently occur for the X-terminal. That is, on the old principal box up times excee= ding 30 days have been common and similarly on the X-terminal, and the only reas= on I rebooted in those cases was power interruptions or the installation of a new kernel. For the present case of the new box, the lockups mean the only recovery possible is to hit the reset button with all that implies about journal recovery and potential file deletion for files that are in inconsis= tent shape due to the lockup. For what it is worth, the lockup symptoms this time were a bit different th= an before. The new computer had a frozen display (rather than blank before), = and frozen mouse and keyboard (as before). The X-terminal used to remotely acc= ess a desktop running on the new computer had a frozen display (rather than blanked) with working keyboard (and maybe mouse, but I didn't record that) = so I could exit the local X and get to the Linux console where ping to the new computer actually worked (as opposed to ping not working at all for the previous lockup). So because networking was working, ssh to the new comput= er didn't time out. However, it ran for 20+ minutes with no sign of a login so the net result was the same as for previous lockups; there was no way to lo= gin to the new computer from another computer to shut down the new computer normally so the only method of shutting it down was to hit the reset button= .
I started a new stability test as of 2018-09-23 15:34:19 right= after a Debian Buster dist-upgrade. The graphics stack versions for this test are as foll= ows: ii amd64-microcode 3.20180524.1=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20 amd64 Processor microcode firmware for AMD CPUs ii firmware-amd-graphics 20180825+dfsg-1=20=20=20= =20=20=20=20=20=20=20=20=20=20=20 all Binary firmware for AMD/ATI graphics chips ii libdrm-amdgpu1:amd64 2.4.94-1=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime ii libglapi-mesa:amd64 18.1.7-1=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 amd64 free implementation of the GL API -- shared library ii linux-image-4.18.0-1-amd64 4.18.6-1=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 amd64 Linux 4.18 for 64-bit PCs ii xserver-xorg-video-amdgpu 18.1.0-1=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 amd64 X.Org X server -- AMDGPU display driver That is, these versions are identical to the previous test other than the (substantial) update of the AMDGPU display driver from version 18.0.1-1= +b1 to version 18.1.0-1. The kernel parameters were the same as the previous test, e.g., BOOT_IMAGE=3D/boot/vmlinuz-4.18.0-1-amd64 root=3DUUID=3D1e45a1ee-a5d6-4327-9a7b-2663ffc0b157 ro rootwait quiet idle= =3Dnomwait
Created attachment 141724 [details]
tarball containing kern.log, syslog, and dmesg output
This last stability test lasted only 17.5 hours before the loc= kup. See the latest attached tarball for the relevant log files (which capture everything during this short up time) and dmesg output. As far as I can tell there is nothing in those log files relevant to the lockup, e.g., no burst of null a= scii characters like what occurred in the log files for the previous experiment. There are some segfaults associated with a cron task I have configured every morning starting at 4:32, but those always occur for that task (which is a complete build and test of CMake) so I don't think they are relevant. The actual lockup today happened with one inactive desktop running on the X-terminal and one active desktop running on the new box. (I was editing a file with Emacs.) Also, the symptoms of this lockup were more severe, i.e., ping did not work from the X-terminal to the new box. But as always there was no way to shut down the new box properly so I had to do that with the reset button. Since I bought the new box in May remote access from an X-terminal has only locked up twice (one of those detailed here), and after a relatively long period of time. So tests where the X-terminal use is the only way to access the new box seems in general much more stable than direct use (as in the present case with such a short time before the lockup). And I haven't tried sole use of the X-terminal for a while now, and that may be completely stab= le with the new kernel. So my conclusion remains that the problem is associat= ed with the Debian Buster graphics stack (and likely also the very latest grap= hics stack if someone will do some up time tests for modern AMD graphics cards f= or that stack) used to display and control the RX 550 card on the new box. I have now started a new test (as of 9:08:19 today) with all graphics stack versions and kernel parameters the same as for the previous test in hopes t= hat when the inevitable lockup comes the log files will be more informative.=20 Please let me know if you have some other experiment you would like me to t= ry.
Created attachment 141872 [details]
tarball containing daemon.log, messages, kern.log, syslog, and dmesg output
The previously described uptime test lasted (until the lockup this morning)=
for
9+ days, but the log files included nothing that seemed relevant. The next
uptime test that started this morning for exactly the same graphics stack a=
nd
kernel parameters lasted only 7 hours until a lockup, and this time the
(attached) log files caught substantial error messages before the crash.=20=
=20
@Michel D=C3=A4nzer: Could you please take a look at this one to see w=
hether there
is some clue in the kernel error messages concerning the source of this
instability?