dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/atomic: Convert a warning to dbg atomic printk
@ 2022-04-04 18:21 Zack Rusin
  2022-04-05 17:09 ` Michel Dänzer
  0 siblings, 1 reply; 6+ messages in thread
From: Zack Rusin @ 2022-04-04 18:21 UTC (permalink / raw)
  To: dri-devel
  Cc: Thomas Zimmermann, David Airlie, krastevm, mdaenzer, mombasawalam

From: Zack Rusin <zackr@vmware.com>

By default each flip times out after 0.1 sec and a warning about the
time out is added to the kernel log. The warning is harmless because
there's another flip coming right after but it can quickly fill up
the log, e.g. missing 2 flips every second over a 24 hour span will
add about 172 thousand lines to the log.
Skip printing it by default and instead add it to the atomic drm debug
category. It can be trivially enabled via drm.debug=0x10 and it avoids
filling up logs by default.

This is particularly important on vm's where presentation might be CPU
bound and running multiple vm's or under a CPU heavy load can easily
lead to missing mutliple flips and any actual errors are often blamed
on this due to it being so prevalent in the logs.
Fixes a bunch of bugs reported to RedHat (1792515, 1884401, 1896139
and 1884401) and SUSE (210123) where people complain about having
their logs spammed by this message and blame various errors on it.

Signed-off-by: Zack Rusin <zackr@vmware.com>
---
 drivers/gpu/drm/drm_atomic_helper.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index 9603193d2fa1..e88b4d83fd5c 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -1566,8 +1566,8 @@ void drm_atomic_helper_wait_for_flip_done(struct drm_device *dev,
 
 		ret = wait_for_completion_timeout(&commit->flip_done, 10 * HZ);
 		if (ret == 0)
-			drm_err(dev, "[CRTC:%d:%s] flip_done timed out\n",
-				crtc->base.id, crtc->name);
+			drm_dbg_atomic(dev, "[CRTC:%d:%s] flip_done timed out\n",
+				       crtc->base.id, crtc->name);
 	}
 
 	if (old_state->fake_commit)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/atomic: Convert a warning to dbg atomic printk
  2022-04-04 18:21 [PATCH] drm/atomic: Convert a warning to dbg atomic printk Zack Rusin
@ 2022-04-05 17:09 ` Michel Dänzer
  2022-04-05 22:47   ` Zack Rusin
  0 siblings, 1 reply; 6+ messages in thread
From: Michel Dänzer @ 2022-04-05 17:09 UTC (permalink / raw)
  To: Zack Rusin
  Cc: David Airlie, krastevm, dri-devel, mombasawalam, Thomas Zimmermann

On 2022-04-04 20:21, Zack Rusin wrote:
> From: Zack Rusin <zackr@vmware.com>
> 
> By default each flip times out after 0.1 sec

10 * HZ is 10 seconds, not 0.1.

> and a warning about the time out is added to the kernel log. The warning is
> harmless because there's another flip coming right after but it can quickly fill up
> the log, e.g. missing 2 flips every second over a 24 hour span will
> add about 172 thousand lines to the log.

As we discussed before, while this might be true for the vmwgfx driver, for other drivers this message indicates that either the GPU hung, or something else went wrong spectacularly. As such, I think we do want to see these messages by default for other drivers at least.

I suspect this just papers over the real issue even with vmwgfx though.


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/atomic: Convert a warning to dbg atomic printk
  2022-04-05 17:09 ` Michel Dänzer
@ 2022-04-05 22:47   ` Zack Rusin
  2022-04-06  8:30     ` Michel Dänzer
  0 siblings, 1 reply; 6+ messages in thread
From: Zack Rusin @ 2022-04-05 22:47 UTC (permalink / raw)
  To: michel.daenzer
  Cc: airlied, Martin Krastev, Maaz Mombasawala, dri-devel, tzimmermann

On Tue, 2022-04-05 at 19:09 +0200, Michel Dänzer wrote:
> ⚠ External Email
> 
> On 2022-04-04 20:21, Zack Rusin wrote:
> > From: Zack Rusin <zackr@vmware.com>
> > 
> > By default each flip times out after 0.1 sec
> 
> 10 * HZ is 10 seconds, not 0.1.

Yea, sorry, this entire commit message is not correct. I've sent out a
very old diff with a log the best I could remember. I recall our
conversation now and iirc we said that maybe an interface through drm
atomic code to enable/disable this error is the way to go but after
looking at this again I'm not sure. More below.

> > and a warning about the time out is added to the kernel log. The
> > warning is
> > harmless because there's another flip coming right after but it can
> > quickly fill up
> > the log, e.g. missing 2 flips every second over a 24 hour span will
> > add about 172 thousand lines to the log.
> 
> As we discussed before, while this might be true for the vmwgfx driver,
> for other drivers this message indicates that either the GPU hung, or
> something else went wrong spectacularly. As such, I think we do want to
> see these messages by default for other drivers at least.

I'm not going to argue for or against that but I am curious what's the
point of the message. The message is basically saying "something could
possibly have went very wrong". OK, what's next? Especially if there's
no visible problems and it's not reproducible. Even if it would be
reproducible there's nothing actionable from the message itself. If the
system has no output connected and no users are currently logged in and
we missed a flip, does it matter?

> I suspect this just papers over the real issue even with vmwgfx though.

This goes back to the above. I think you, me and Dave looked at the
logs from those bugs and none of us seem to know what do do about it.
Lacking some other error messages there seems to be nothing this error
adds with vmwgfx or without (we have been adding more logging to vmwgfx
so hopefully with newer kernels we could get some actionable errors but
that's orthogonal to this).

z


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/atomic: Convert a warning to dbg atomic printk
  2022-04-05 22:47   ` Zack Rusin
@ 2022-04-06  8:30     ` Michel Dänzer
  2022-04-06 14:10       ` Zack Rusin
  0 siblings, 1 reply; 6+ messages in thread
From: Michel Dänzer @ 2022-04-06  8:30 UTC (permalink / raw)
  To: Zack Rusin
  Cc: airlied, Martin Krastev, dri-devel, Maaz Mombasawala, tzimmermann

On 2022-04-06 00:47, Zack Rusin wrote:
> On Tue, 2022-04-05 at 19:09 +0200, Michel Dänzer wrote:
>> On 2022-04-04 20:21, Zack Rusin wrote:
>>> From: Zack Rusin <zackr@vmware.com>
>>>
>>> By default each flip times out after 0.1 sec
>>
>> 10 * HZ is 10 seconds, not 0.1.
> 
> Yea, sorry, this entire commit message is not correct. I've sent out a
> very old diff with a log the best I could remember. I recall our
> conversation now and iirc we said that maybe an interface through drm
> atomic code to enable/disable this error is the way to go but after
> looking at this again I'm not sure. More below.
> 
>>> and a warning about the time out is added to the kernel log. The
>>> warning is
>>> harmless because there's another flip coming right after but it can
>>> quickly fill up
>>> the log, e.g. missing 2 flips every second over a 24 hour span will
>>> add about 172 thousand lines to the log.
>>
>> As we discussed before, while this might be true for the vmwgfx driver,
>> for other drivers this message indicates that either the GPU hung, or
>> something else went wrong spectacularly. As such, I think we do want to
>> see these messages by default for other drivers at least.
> 
> I'm not going to argue for or against that but I am curious what's the
> point of the message. The message is basically saying "something could
> possibly have went very wrong". OK, what's next? Especially if there's
> no visible problems and it's not reproducible. Even if it would be
> reproducible there's nothing actionable from the message itself. If the
> system has no output connected and no users are currently logged in and
> we missed a flip, does it matter?

I don't think waiting for a 10 second timeout is the appropriate behaviour in that case. While a KMS CRTC is enabled, the driver needs to make it work, in the worst case via a timer which ticks at the CRTC refresh rate.


>> I suspect this just papers over the real issue even with vmwgfx though.
> 
> This goes back to the above. I think you, me and Dave looked at the
> logs from those bugs and none of us seem to know what do do about it.
> Lacking some other error messages there seems to be nothing this error
> adds with vmwgfx or without (we have been adding more logging to vmwgfx
> so hopefully with newer kernels we could get some actionable errors but
> that's orthogonal to this).

The error means that either:

* A flip actually didn't complete in 10 seconds.
* There's some kind of time tracking issue which results in the timer firing after less than 10 seconds (of the system actually running).

Either way, it's an issue which should be fixed rather than just swept under the rug.


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/atomic: Convert a warning to dbg atomic printk
  2022-04-06  8:30     ` Michel Dänzer
@ 2022-04-06 14:10       ` Zack Rusin
  2022-04-07  8:34         ` Michel Dänzer
  0 siblings, 1 reply; 6+ messages in thread
From: Zack Rusin @ 2022-04-06 14:10 UTC (permalink / raw)
  To: michel.daenzer
  Cc: airlied, Martin Krastev, Maaz Mombasawala, dri-devel, tzimmermann

On Wed, 2022-04-06 at 10:30 +0200, Michel Dänzer wrote:
> ⚠ External Email
> 
> On 2022-04-06 00:47, Zack Rusin wrote:
> > On Tue, 2022-04-05 at 19:09 +0200, Michel Dänzer wrote:
> > > On 2022-04-04 20:21, Zack Rusin wrote:
> > > > From: Zack Rusin <zackr@vmware.com>
> > > > 
> > > > By default each flip times out after 0.1 sec
> > > 
> > > 10 * HZ is 10 seconds, not 0.1.
> > 
> > Yea, sorry, this entire commit message is not correct. I've sent
> > out a
> > very old diff with a log the best I could remember. I recall our
> > conversation now and iirc we said that maybe an interface through
> > drm
> > atomic code to enable/disable this error is the way to go but after
> > looking at this again I'm not sure. More below.
> > 
> > > > and a warning about the time out is added to the kernel log.
> > > > The
> > > > warning is
> > > > harmless because there's another flip coming right after but it
> > > > can
> > > > quickly fill up
> > > > the log, e.g. missing 2 flips every second over a 24 hour span
> > > > will
> > > > add about 172 thousand lines to the log.
> > > 
> > > As we discussed before, while this might be true for the vmwgfx
> > > driver,
> > > for other drivers this message indicates that either the GPU
> > > hung, or
> > > something else went wrong spectacularly. As such, I think we do
> > > want to
> > > see these messages by default for other drivers at least.
> > 
> > I'm not going to argue for or against that but I am curious what's
> > the
> > point of the message. The message is basically saying "something
> > could
> > possibly have went very wrong". OK, what's next? Especially if
> > there's
> > no visible problems and it's not reproducible. Even if it would be
> > reproducible there's nothing actionable from the message itself. If
> > the
> > system has no output connected and no users are currently logged in
> > and
> > we missed a flip, does it matter?
> 
> I don't think waiting for a 10 second timeout is the appropriate
> behaviour in that case. While a KMS CRTC is enabled, the driver needs
> to make it work, in the worst case via a timer which ticks at the
> CRTC refresh rate.

What if it isn't a driver bug and it just so happens that vmwgfx is the
only one that's running on some specific system without a display long
enough? Without being able to reproduce how would you know?

> 
> The error means that either:
> 
> * A flip actually didn't complete in 10 seconds.
> * There's some kind of time tracking issue which results in the timer
> firing after less than 10 seconds (of the system actually running).
> 
> Either way, it's an issue which should be fixed rather than just
> swept under the rug.

Without being able to reproduce or really confirm that it's an actual
problem in vmwgfx I'm going to leave this to you then. I'd be happy to
look at it again or help out in any way I can if you find any problems
we can do anything about.

z


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/atomic: Convert a warning to dbg atomic printk
  2022-04-06 14:10       ` Zack Rusin
@ 2022-04-07  8:34         ` Michel Dänzer
  0 siblings, 0 replies; 6+ messages in thread
From: Michel Dänzer @ 2022-04-07  8:34 UTC (permalink / raw)
  To: Zack Rusin
  Cc: airlied, Martin Krastev, dri-devel, Maaz Mombasawala, tzimmermann

On 2022-04-06 16:10, Zack Rusin wrote:
> On Wed, 2022-04-06 at 10:30 +0200, Michel Dänzer wrote:
>> On 2022-04-06 00:47, Zack Rusin wrote:
>>> On Tue, 2022-04-05 at 19:09 +0200, Michel Dänzer wrote:
>>>> On 2022-04-04 20:21, Zack Rusin wrote:
>>>>> From: Zack Rusin <zackr@vmware.com>
>>>>>
>>>>> By default each flip times out after 0.1 sec
>>>>
>>>> 10 * HZ is 10 seconds, not 0.1.
>>>
>>> Yea, sorry, this entire commit message is not correct. I've sent
>>> out a
>>> very old diff with a log the best I could remember. I recall our
>>> conversation now and iirc we said that maybe an interface through
>>> drm
>>> atomic code to enable/disable this error is the way to go but after
>>> looking at this again I'm not sure. More below.
>>>
>>>>> and a warning about the time out is added to the kernel log.
>>>>> The
>>>>> warning is
>>>>> harmless because there's another flip coming right after but it
>>>>> can
>>>>> quickly fill up
>>>>> the log, e.g. missing 2 flips every second over a 24 hour span
>>>>> will
>>>>> add about 172 thousand lines to the log.
>>>>
>>>> As we discussed before, while this might be true for the vmwgfx
>>>> driver,
>>>> for other drivers this message indicates that either the GPU
>>>> hung, or
>>>> something else went wrong spectacularly. As such, I think we do
>>>> want to
>>>> see these messages by default for other drivers at least.
>>>
>>> I'm not going to argue for or against that but I am curious what's
>>> the
>>> point of the message. The message is basically saying "something
>>> could
>>> possibly have went very wrong". OK, what's next? Especially if
>>> there's
>>> no visible problems and it's not reproducible. Even if it would be
>>> reproducible there's nothing actionable from the message itself. If
>>> the
>>> system has no output connected and no users are currently logged in
>>> and
>>> we missed a flip, does it matter?
>>
>> I don't think waiting for a 10 second timeout is the appropriate
>> behaviour in that case. While a KMS CRTC is enabled, the driver needs
>> to make it work, in the worst case via a timer which ticks at the
>> CRTC refresh rate.
> 
> What if it isn't a driver bug

Does it make a difference? It's still a bug.

> and it just so happens that vmwgfx is the only one that's running on some specific system
> without a display long enough?

FWIW, other drivers have hit this same error before as well, but not just because no outputs were enabled. If it's really related to that, a vmwgfx bug seems most likely.


>> The error means that either:
>>
>> * A flip actually didn't complete in 10 seconds.
>> * There's some kind of time tracking issue which results in the timer
>> firing after less than 10 seconds (of the system actually running).
>>
>> Either way, it's an issue which should be fixed rather than just
>> swept under the rug.
> 
> Without being able to reproduce or really confirm that it's an actual
> problem in vmwgfx I'm going to leave this to you then.

Nice try. :) This is for vmwgfx developers.


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-04-07  8:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-04 18:21 [PATCH] drm/atomic: Convert a warning to dbg atomic printk Zack Rusin
2022-04-05 17:09 ` Michel Dänzer
2022-04-05 22:47   ` Zack Rusin
2022-04-06  8:30     ` Michel Dänzer
2022-04-06 14:10       ` Zack Rusin
2022-04-07  8:34         ` Michel Dänzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).