From mboxrd@z Thu Jan 1 00:00:00 1970
From: bugzilla-daemon@freedesktop.org
Subject: [Bug 100306] System randomly freezes or crashes to the login screen,
glitches until rebooted
Date: Wed, 22 Mar 2017 02:16:27 +0000
Message-ID:
What
Removed
Added
Version
unspecified
17.0
QA Contact
xorg-team@lists.x.org
dri-devel@lists.freedesktop.org
Assignee
xorg-driver-ati@lists.x.org
dri-devel@lists.freedesktop.org
Product
xorg
Mesa
Component
Driver/Radeon
Drivers/Gallium/radeonsi
Looks like GPU lockups, which are most likely due to an issue =
in
Mesa/LLVM/kernel. Reassigning to Mesa for now.
What | Removed | Added |
---|---|---|
Status | NEW | RESOLVED |
Resolution | --- | FIXED |
I'm happy to announce that since xorg-x11-server 1.19.3 and / = or xf86-video-ati 7.9.0 the issue appears to have been remedied. I now have over 5 days of uptime, with not a single GPU freeze caused by the desktop. If by chance the problem returns, I'll definitely post an update and let everyone know. It w= ould be very appreciated if in the future, the Xorg team and driver developers c= ould please consider more in-depth testing for GPU lockups, to prevent this sort= of thing from repeating. I imagine this issue can be marked as resolved. I will reopen it if I see t= he problem happening again. Feel free to otherwise reopen it if anyone believes there's still something to investigate.
What | Removed | Added |
---|---|---|
Status | RESOLVED | REOPENED |
Resolution | FIXED | --- |
To my stupefaction, the exact same issue has been re-introduce= d and is happening again. After another openSUSE Tumbleweed snapshot and several pac= kage updates, I believe approximately 2 days ago, the freeze occurs once more. I= can however confirm that for approximately 8 days before that, the problem went away as I had one week of uninterrupted uptime. Can someone please analyze = what changes fit this time pattern?
(In reply to MirceaKitsune from comment #11) > I can however confirm that for approximately 8 d= ays before that, the > problem went away as I had one week of uninterrupted uptime. Can someo= ne > please analyze what changes fit this time pattern? Not without more information about what changed on your system when the pro= blem apparently stopped and started again (or alternatively, a magic crystal ball ;).
(In reply to Michel D=C3=A4nzer from comment #12) Like I said, I use openSUSE Tumbleweed and its latest official system packa= ges. openSUSE has a lot of useful web tools, so I assume a package history should exist somewhere? If not, can anyone tell me how to dump my package installa= tion history over the last month from zypper, so I can post a new log?
This is the upstream bug tracker. For openSUSE related help, t= alk to openSUSE folks.
(In reply to Michel D=C3=A4nzer from comment #14) Yes, but it's due to an issue in one of the system components. I'm talking about it there for the packaging related aspect, but figured this is helpful for debugging the issue itself.
KDE is slow and buggy desktop and it does not even have freely= configurable menus like the Xfce Whisker menu is. Test with Xfce and lightdm.
Very important note: While investigating a completely unrelate= d bug, I remembered that KWin was configured to use egl over glx on my machine. I believe glx is the old render architecture, whereas egl is the new renderer which uses OpenGL and is visibly a lot faster. Considering that desktop activity was always the cause of the freezes in so= me form, I have a strong suspicion that this might have something to do with t= he GPU freezes as well! It's a very likely candidate because egl involves experimental OpenGL rendering, and since it's not enabled by default that w= ould also explain why few other people are able to reproduce the GPU lockups. I = have just now switched back to glx and therefore haven't had the time to confirm this, but I'm willing to bet it just might make the problem go away... I wi= ll immediately post an update if and when I'm proven wrong obviously. If anyone else wants to test this theory and help in reproducing the system freeze, consider switching to egl rendering. Obviously this means your mach= ine might start freezing as well, so only do this if there's no risk of data lo= ss or major annoyance. The switch is done by opening ~/.config/kwinrc in a text editor and changing: GLPlatformInterface=3Degl to GLPlatformInterface=3Dglx Note that if you're using an Aurorae theme, you might experience the bug I mentioned above, which involves KWin no longer rendering window decorations. Here are the reports for that in case anyone is curious: https:/= /bugzilla.opensuse.org/show_bug.cgi?id=3D1033598 https://bugs.kde.= org/show_bug.cgi?id=3D378663
Nope, still happens in glx as well. Switching away from egl re= ndering does not make the freezes go away.
llvm 4.0.0 is now in openSUSE Tumbleweed: I have preformed a '= zypper dup', installed it, and restarted. Now it's time to see if this really makes the freeze go away. Please allow me to keep this issue open for another month to make sure it's truly gone: If through some stretch of imagination I see the problem again, I'll immediately post a reply here and let everyone know! I need to take at least several days to have no doubt it has been solved, due to how far its probability seems to have ranged. Thank you.
The same freeze happened again today, after a two week period = of not seeing the problem any more. This is reaching the point where it's becoming outright ridiculous: I've used openSUSE for years, and have never seen such a thing spanning over such a long time period. Unfortunately I can't risk breaking my system by installing custom versions= of llvm or the system wide Mesa. I can only hope the developers know what to bisect to find out where this fault lies, based on the logs I've attached h= ere. Let me know if there is more info I could somehow help with.
What | Removed | Added |
---|---|---|
Priority | high | highest |
Severity | major | critical |
Once again, I'm dealing with at least one system crash per day= . The latest one happens even after upgrading to the 4.11.0 Kernel, meaning the error was po= rted to it as well. Why are the developers so incapable this time? This has been happening for nearly 3 darn months! The problem has been fixed twice, and each time it's returned after over 2 weeks. Can someone please explain what the heck is go= ing on here? At this stage, it feels like someone is actively developing and updating this freeze against the latest system components... I don't even s= ee how it could survive through this many package updates by chance alone, it = is ridiculous. And please don't lecture me on how this is free software, and I should only= be complaining if I was actually paying the developers: There is a limit beyond which an important piece of software, be it a free Linux distribution or component, can break and stay broken. To literally be unable to keep a syst= em running without the image suddenly freezing and the monitor shutting down e= very single day, for over a quarter of an year... that goes far beyond that limi= t. I'm sorry for the outburst, but at this stage I think something needed to be said. I did not expect something like this to get dragged so far, and that = I'd be unable to keep my system running for months. I'm going to bump the sever= ity of this issue again, in hopes that someone can please take a look at it so I can run my system normally again! Thank you.
What | Removed | Added |
---|---|---|
CC | daniel@fooishbar.org |
(In reply to MirceaKitsune from comment #21) > Why are the developers so incapable this time? T= his has been happening for > nearly 3 darn months! The problem has been fixed twice, and each time = it's > returned after over 2 weeks. Can someone please explain what the heck = is > going on here? At this stage, it feels like someone is actively develo= ping > and updating this freeze against the latest system components... I don= 't > even see how it could survive through this many package updates by cha= nce > alone, it is ridiculous. You may be surprised to hear that the code is not an explicit 'if (rand()) hang_gpu();' call. > And please don't lecture me on how this is free = software, and I should only > be complaining if I was actually paying the developers: Regardless, insulting people is rarely the best way to get them to do thing= s. Everyone has more problems to solve than hours in the day, and the guy call= ing people 'incapable' does not tend to work his way to the top of the list. > I'm sorry for the outburst, but at this stage I = think something needed to be > said. No, it really did not. I understand your frustration, and I'm sorry to hear= it, though as the Bugzilla footer notes, the freedesktop.org Code of Conduct applies here: https://www.fre= edesktop.org/wiki/CodeOfConduct/ If you cannot keep your behaviour civil in future, your access to this bug tracker will be revoked. Thanks for your understanding, and I do hope your = bug gets resolved.
(In reply to Daniel Stone from comment #22) > (In reply to MirceaKitsune from comment #21) >=20 > You may be surprised to hear that the code is not an explicit 'if (ran= d()) > hang_gpu();' call. >=20 > Regardless, insulting people is rarely the best way to get them to do > things. Everyone has more problems to solve than hours in the day, and= the > guy calling people 'incapable' does not tend to work his way to the to= p of > the list. >=20 > No, it really did not. I understand your frustration, and I'm sorry to= hear > it, though as the Bugzilla footer notes, the freedesktop.org Code of C= onduct > applies here: > https://ww= w.freedesktop.org/wiki/CodeOfConduct/ >=20 > If you cannot keep your behaviour civil in future, your access to this= bug > tracker will be revoked. Thanks for your understanding, and I do hope = your > bug gets resolved. Alright. Can't argue that and I'll keep your words in mind... also I wasn't looking to insult anyone per say. Please understand this sort of thing is something I've never dealt with in nearly 5 years since I use Linux, and I don't understand either how it happens nor why nothing is being done: Not o= nly is it the most severe type of issue, and that it's been there for months, b= ut it's literally coming and going on a weekly schedule. I'd care less if it w= as a minor issue, but I literally can't preform daily activities properly becaus= e my monitor simply shuts itself down randomly for no reason! I don't have Windo= ws and wouldn't want to use it again, nor can I downgrade to a version as old = as before the time I can assume the issue started... what am I to do? I'll try to be more calm, but I believe this sort of thing needs to have so= me solution. There are millions of people using these drivers, I don't think i= t's a normal situation that they can't use their machine safely and no one even knows what's triggering it... if this happened in Windows it would probably make headlines. Anyway I don't want to take this off-topic. I'll wait for more answers and = post new updates as I see changes. Sorry again for earlier.
More info after the latest crash: In ~/.config/kwinrc I tried = setting GLCore=3Dfalse and GLColorCorrection=3Dfalse, however none of these seem to= affect the problem although I found them suspicious. I wonder if other such settin= gs, such as vsync (tearing prevention) could make a difference... there are many combinations to try, and I'm not sure which also affect the system when compositing is off. When the system doesn't completely freeze, the behavior of the problem can = be very strange at times; Last night after the desktop froze, I quickly hit Control + Alt + Backspace several times to kill X11... the system went into= a console, which started flashing continuously together with the HDD led. I couldn't do anything on my keyboard and mouse... but once I pressed the pow= er button it stopped and the machine shut itself down quickly and cleanly, mea= ning it still received the power off signal and managed to recover from that.
Can you narrow down what component update caused the issue?
(In reply to Alex Deucher from comment #25) > Can you narrow down what component update caused= the issue? openSUSE Tumbleweed upgraded from Mesa 17.0.5 to 17.1.0 yesterday. I'm stil= l in the process of seeing whether this affects the issue and how... so far no freezes, and over 1 day of uptime which should at least mean it's rarer. I do not believe it's the Kernel: The same freeze started with Kernel 4.9.0, persisted in 4.10.0, and ever after 4.11.0 it happened again. A kernel issue would have been impossible not to notice during months of development. Other components aren't updated as frequently or easy for me to track, and = with the issue happening once every day there's no way to test it on demand. As = such I don't currently have more info on what the source could possibly be.
In weekly news, it appears a recent openSUSE Tumbleweed snapsh= ot (which among other changes upgraded Mesa 17.0.5 to 17.1.0) made the issue a lot rarer for the time being: Until this snapshot it was even happening twice a day, hence why it was starting to drive me nuts... now I only seem to get this freeze every 2-3 days of uptime, which is so much more bearable! I worry whether t= he next snapshot will make the issue more frequent again rather than better... clearly it's one of the important system packages that's messing with it, b= ut I still have no idea which and how.
After another two weeks of absence, the issue was apparently r= eimplemented on top of Kernel 4.11.3 + Mesa 17.1.1 + Plasma 5.10.0, likely sometime during = the last few days. The behavior is once again identical, with alt-tab switching= or desktop effects causing everything but the mouse pointer to freeze then aft= er 10 seconds the monitor shuts down. Other unrelated GPU crashes (such as tho= se caused by some games) behave by the classic model, where the entire system simple freezes in place at once... that's a very different result from this freeze, and likely confirms this is a different type of crash. At this point I have almost no doubt this is an attack that's being deliberately programmed, and manually reimplemented on top of new drivers o= nce it gets fixed. The cycle seems to be that a kernel or driver update resolves the issue, then the creators of the crash require about two weeks to patch = it and reimplement the exact same functionality. This is the 4th time the story repeats. I tried steering away from this possibility until I was sure, as I didn't w= ant it triggering any unnecessary arguments... if this is an attack then investigating it as such might help in finding its source more quickly. The= re's simply no way something this precise could happen by itself for nearly half= an year, always coming back with the exact same effects after a period of absence... all despite radical changes to nearly every driver and system component, which would have no doubt altered the behavior of the initial problem in some form. Therefore I hope everyone can see why I'm now going w= ith this theory and greatly considering the option of malicious intent. I have no idea how the virus (?) could be updated on my computer, as it's likely not through the package update system directly. However I suspect it= 's using a constant series of vulnerabilities in one or more system components, which should be fixed by the developers if they exist. I would appreciate a= ny ideas on both how the malicious code might be inserted into the computer, as well as finding the vulnerability within radeon / Mesa / X11 / etc that it exploits. Please let me know what your thoughts are!
An update: The latest reimplementation of the freeze appears t= o be worse than all others. My system can now be taken down after only 4 hours of uptime! T= his is a huge difference from all previous versions of the crash, which required that the system had at least been running for a day before it could be cras= hed.
Created attachment 132030 [details]
Photo of the corrupt image on the screen
I have discovered some very important details today. Everyone following up =
on
the report, please see this comment!
Recently I realized that a useful test would be to jump into a different run
level once I notice the crash, in order to see how the system behaves there=
. A
few minutes ago another freeze took place, so I instantly hit Control + Alt=
+
F1 to go to a console. What I noticed was pretty remarkable and sheds light=
on
a few aspects:
I could keep typing in the console for nearly 10 seconds, but after that the
exact same behavior still took place (monitor turned itself on and off two
times then the image froze). This time however I was able to toggle the Num=
Lock
led a minute after the crash, while also seeing the HDD led still working; =
That
means this is not (always) a total system freeze such as a Kernel panic...
instead it appears to be the image output corrupting and staying that way,
freezing only specific components with it (I was still unable to issue a bl=
ind
reboot command for instance). To put everything into an approximate timelin=
e,
this is what happened:
00 seconds in: The crash occurs.
02 seconds in: I notice and instantly hit Control + Alt + F1.
05 seconds in: I'm taken to a console where everything works fine: I see the
blinking cursor, can write my login and password, etc.
12 seconds in: Suddenly the monitor turns off and back on several times, th=
en
the image remains frozen in place.
This time however, the screen did not remain turned off or black. Instead it
stayed stuck in a state showing corrupt lines and rectangles of random colo=
rs.
I took a photo of my screen with my smartphone, which I attached to this is=
sue.
Created attachment=
132448 [details]
Screenshot of "top"
Lots of important new information on this freeze, which was of course porte=
d to
the latest openSUSE Tumbleweed system packages and still works:
First and foremost, the problem does not happen in every session, and this =
is
not always influenced by updates! During an interval in which I installed
absolutely no relevant package changes, the following has happened: The fre=
eze
occurred after about just 8 hours of uptime... after that I restarted the
machine, but then I had 4 days of uptime with no freeze! This leads me to
believe that certain applications or system actions prepare the system with=
a
"time bomb", which then causes switching between windows or deskt=
ops to produce
the freeze... however I have no way to know what mines the system and what
doesn't yet, as I use too many applications at once to figure out which mig=
ht
be responsible.
Anyway another crash happened today. Once more I quickly hit Control + Alt =
+ F1
to switch to a different runlevel; This caused the image to become corrupte=
d on
the monitor, however the system remained responsive and didn't actually fre=
eze.
So I went to my mother's computer and logged in via SSH, which indeed still
worked. I was able to issue a reboot command, which caused the image to bri=
efly
unfreeze as the monitor turned on and off a few more times... I could see a=
few
KDE error messages about applications crashing, before the system actually =
went
ahead and rebooted successfully! However this is only possible if I switch =
to a
console quickly enough when noticing the freeze start to happen, if not the
whole machine freezes and not even SSH responds from other devices!
While I was in SSH, I decided to run "top" and take a screenshot =
of my
processes (while the computer was frozen and with corrupt image stuck on the
screen). I can't tell if anything is out of the ordinary such as a memory l=
eak,
but I'm attaching a screenshot of it here.