linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* regression with mainline kernel
@ 2021-11-11 14:03 Sudip Mukherjee
  2021-11-11 20:51 ` Sudip Mukherjee
  0 siblings, 1 reply; 6+ messages in thread
From: Sudip Mukherjee @ 2021-11-11 14:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Greg Kroah-Hartman, Stable, linux-kernel

Hi Linus,

My testing has been failing for the last few days. Last good test was
with 6f2b76a4a384 and I started seeing the failure with ce840177930f5
where boot timeout.

Last good test - https://openqa.qa.codethink.co.uk/tests/323
Failing test - https://openqa.qa.codethink.co.uk/tests/335

Saw a similar issue with 5.10.79-rc1 today and bisect showed the
problem with 8615ff6dd1ac but that was already in the last good test I
had.
I will do a bisect tonight and let you know the result.


-- 
Regards
Sudip

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regression with mainline kernel
  2021-11-11 14:03 regression with mainline kernel Sudip Mukherjee
@ 2021-11-11 20:51 ` Sudip Mukherjee
  2021-11-13 11:06   ` Sudip Mukherjee
  2021-11-13 17:06   ` Linus Torvalds
  0 siblings, 2 replies; 6+ messages in thread
From: Sudip Mukherjee @ 2021-11-11 20:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Gurchetan Singh, Nicholas Verne, Gerd Hoffmann,
	dri-devel, open list:VIRTIO GPU DRIVER, Daniel Vetter

Hi Linus,

On Thu, Nov 11, 2021 at 2:03 PM Sudip Mukherjee
<sudipm.mukherjee@gmail.com> wrote:
>
> Hi Linus,
>
> My testing has been failing for the last few days. Last good test was
> with 6f2b76a4a384 and I started seeing the failure with ce840177930f5
> where boot timeout.
>
> Last good test - https://openqa.qa.codethink.co.uk/tests/323
> Failing test - https://openqa.qa.codethink.co.uk/tests/335
>
> Saw a similar issue with 5.10.79-rc1 today and bisect showed the
> problem with 8615ff6dd1ac but that was already in the last good test I
> had.

Did a bisect and this is a separate issue than the one we saw in 5.10.79-rc1.

The bisect log:
# bad: [ce840177930f591a181f55515fc6ac9e1f56b84a] Merge tag
'defconfig-5.16' of
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
# good: [6f2b76a4a384e05ac8d3349831f29dff5de1e1e2] Merge tag
'Smack-for-5.16' of https://github.com/cschaufler/smack-next
git bisect start 'ce840177930f5' '6f2b76a4a384e05ac8d3349831f29dff5de1e1e2'
# good: [a64a325bf6313aa5cde7ecd691927e92892d1b7f] Merge tag
'afs-next-20211102' of
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
git bisect good a64a325bf6313aa5cde7ecd691927e92892d1b7f
# bad: [dcd68326d29b62f3039e4f4d23d3e38f24d37360] Merge tag
'devicetree-for-5.16' of
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
git bisect bad dcd68326d29b62f3039e4f4d23d3e38f24d37360
# bad: [c7c774fe09389fc806bbe4b487c18e45f576c1ae] Merge tag
'drm-intel-next-2021-10-04' of
git://anongit.freedesktop.org/drm/drm-intel into drm-next
git bisect bad c7c774fe09389fc806bbe4b487c18e45f576c1ae
# good: [8017ecb11ebbcdfcbdff14c5edbdf1efc14991f4] drm/amd/display:
Added root clock optimization flags
git bisect good 8017ecb11ebbcdfcbdff14c5edbdf1efc14991f4
# good: [8a1ec3f3275479292613273a7be2ac87f2a7f6e6] drm/i915: Remove
DP_PORT_EN stuff from link training code
git bisect good 8a1ec3f3275479292613273a7be2ac87f2a7f6e6
# bad: [9962601ca5719050906915c3c33a63744ac7b15c] drm/bridge:
dw-hdmi-cec: Make use of the helper function
devm_add_action_or_reset()
git bisect bad 9962601ca5719050906915c3c33a63744ac7b15c
# bad: [606b102876e3741851dfb09d53f3ee57f650a52c] drm: fb_helper: fix
CONFIG_FB dependency
git bisect bad 606b102876e3741851dfb09d53f3ee57f650a52c
# good: [c43da06c24a485308e80d709737b446e8cad175d] dt-bindings:
drm/panel: boe-tv101wum-nl6: Support enabling a 3.3V rail
git bisect good c43da06c24a485308e80d709737b446e8cad175d
# good: [8d6b006e1f51c99016aa39ca9e03947cbdd024e3] drm/virtio:
implement context init: handle VIRTGPU_CONTEXT_PARAM_POLL_RINGS_MASK
git bisect good 8d6b006e1f51c99016aa39ca9e03947cbdd024e3
# bad: [d0f5d790ae863079025398015eb59347b01db455] drm/ttm: remove
TTM_PAGE_FLAG_NO_RETRY
git bisect bad d0f5d790ae863079025398015eb59347b01db455
# bad: [f5d28856b89baab4232a9f841e565763fcebcdf9] drm/ttm: stop
calling tt_swapin in vm_access
git bisect bad f5d28856b89baab4232a9f841e565763fcebcdf9
# bad: [78aa20fa4381623cf59a85d053486f98784ca3a0] drm/virtio:
implement context init: advertise feature to userspace
git bisect bad 78aa20fa4381623cf59a85d053486f98784ca3a0
# bad: [cd7f5ca33585918febe5e2f6dc090a21cfa775b0] drm/virtio:
implement context init: add virtio_gpu_fence_event
git bisect bad cd7f5ca33585918febe5e2f6dc090a21cfa775b0
# first bad commit: [cd7f5ca33585918febe5e2f6dc090a21cfa775b0]
drm/virtio: implement context init: add virtio_gpu_fence_event

And, indeed reverting cd7f5ca33585 on top of debe436e77c7 has fixed
the problem I was seeing on my qemu test of x86_64. The qemu image is
based on Ubuntu.

Will be happy to test any fix or more debugging if needed.


-- 
Regards
Sudip

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regression with mainline kernel
  2021-11-11 20:51 ` Sudip Mukherjee
@ 2021-11-13 11:06   ` Sudip Mukherjee
  2021-11-13 17:06   ` Linus Torvalds
  1 sibling, 0 replies; 6+ messages in thread
From: Sudip Mukherjee @ 2021-11-13 11:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Gurchetan Singh, Nicholas Verne, Gerd Hoffmann,
	dri-devel, open list:VIRTIO GPU DRIVER, Daniel Vetter

On Thu, Nov 11, 2021 at 8:51 PM Sudip Mukherjee
<sudipm.mukherjee@gmail.com> wrote:
>
> Hi Linus,
>
> On Thu, Nov 11, 2021 at 2:03 PM Sudip Mukherjee
> <sudipm.mukherjee@gmail.com> wrote:
> >
> > Hi Linus,
> >
> > My testing has been failing for the last few days. Last good test was
> > with 6f2b76a4a384 and I started seeing the failure with ce840177930f5
> > where boot timeout.

Last night's test on 66f4beaa6c1d worked fine. So I guess this has now
been fixed.
Thanks.


-- 
Regards
Sudip

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regression with mainline kernel
  2021-11-11 20:51 ` Sudip Mukherjee
  2021-11-13 11:06   ` Sudip Mukherjee
@ 2021-11-13 17:06   ` Linus Torvalds
  2021-11-13 21:34     ` Sudip Mukherjee
  2021-11-15 14:24     ` Daniel Vetter
  1 sibling, 2 replies; 6+ messages in thread
From: Linus Torvalds @ 2021-11-13 17:06 UTC (permalink / raw)
  To: Sudip Mukherjee
  Cc: linux-kernel, Gurchetan Singh, Nicholas Verne, Gerd Hoffmann,
	dri-devel, open list:VIRTIO GPU DRIVER, Daniel Vetter

[ Hmm. This email was marked as spam for me. I see no obvious reason
for it being marked as spam, but it happens.. ]

On Thu, Nov 11, 2021 at 12:52 PM Sudip Mukherjee
<sudipm.mukherjee@gmail.com> wrote:
>
> # first bad commit: [cd7f5ca33585918febe5e2f6dc090a21cfa775b0]
> drm/virtio: implement context init: add virtio_gpu_fence_event

Hmm. Judging from your automated screenshots, the login never appears.

> And, indeed reverting cd7f5ca33585 on top of debe436e77c7 has fixed
> the problem I was seeing on my qemu test of x86_64. The qemu image is
> based on Ubuntu.

Presumably either that commit is somehow buggy in itself - or it does
exactly what it means to do, and the new poll() semantics just
confuses the heck out of the X server (or wayland or whatever).

And honestly, if I read that thing correctly, the patch is entirely
broken. The new poll function (virtio_gpu_poll()) will unconditionally
remove the first event from the event list, and then report "Yeah, I
had events".

This is completely bogus for a few reasons:

 - poll() really should be idempotent, because the poll function gets
called multiple times

 - it doesn't even seem to check that the event that it removes is the
new VIRTGPU_EVENT_FENCE_SIGNALED_INTERNAL kind of event, so it will
unconditionally just remove random events.

 - it does seem to check the "vfpriv->ring_idx_mask" and do the old
thing if that is zero, but I see absolutely no reason for that (and
that check itself has caused problems, see below)

Honestly, my reaction to this all is that that commit is fundamentally
broken and probably should be reverted regardless as "this commit does
bad things".

HOWEVER - it has had a fix for a NULL pointer dereference in the
meantime - can you check whether the current top of tree happens to
work for you? Maybe your problem isn't due to "that commit does
unnatural things", but simply due to the bug fixed in d89c0c8322ec
("drm/virtio: Fix NULL dereference error in virtio_gpu_poll").

And if it's still broken with that commit, I'll happily revert it and
people need to go back to the drawing board.

In fact, I would really suggest that people look at that
virtio_gpu_poll() function regardless. That odd "let's unconditionally
just drop events in the poll function is really REALLY broken
behavior.

              Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regression with mainline kernel
  2021-11-13 17:06   ` Linus Torvalds
@ 2021-11-13 21:34     ` Sudip Mukherjee
  2021-11-15 14:24     ` Daniel Vetter
  1 sibling, 0 replies; 6+ messages in thread
From: Sudip Mukherjee @ 2021-11-13 21:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Gurchetan Singh, Nicholas Verne, Gerd Hoffmann,
	dri-devel, open list:VIRTIO GPU DRIVER, Daniel Vetter

Hi Linus,

On Sat, Nov 13, 2021 at 5:07 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> [ Hmm. This email was marked as spam for me. I see no obvious reason
> for it being marked as spam, but it happens.. ]
>
> On Thu, Nov 11, 2021 at 12:52 PM Sudip Mukherjee
> <sudipm.mukherjee@gmail.com> wrote:
> >
> > # first bad commit: [cd7f5ca33585918febe5e2f6dc090a21cfa775b0]
> > drm/virtio: implement context init: add virtio_gpu_fence_event
>
> Hmm. Judging from your automated screenshots, the login never appears.
>

<snip>

>
> HOWEVER - it has had a fix for a NULL pointer dereference in the
> meantime - can you check whether the current top of tree happens to
> work for you? Maybe your problem isn't due to "that commit does
> unnatural things", but simply due to the bug fixed in d89c0c8322ec
> ("drm/virtio: Fix NULL dereference error in virtio_gpu_poll").
>
> And if it's still broken with that commit, I'll happily revert it and
> people need to go back to the drawing board.

I sent another mail yesterday which is now at
https://lore.kernel.org/lkml/CADVatmOOzCxAgLhCu1tTz=44sgRDXds5-oMZ3V0w4f5kLCLKrw@mail.gmail.com/
I will just pase that here for you.

Last night's test on 66f4beaa6c1d worked fine. So I guess this has now
been fixed.

I have not done a bisect to see what has fixed it, but looking at the
log I think it will be that NULL pointer fix.


-- 
Regards
Sudip

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regression with mainline kernel
  2021-11-13 17:06   ` Linus Torvalds
  2021-11-13 21:34     ` Sudip Mukherjee
@ 2021-11-15 14:24     ` Daniel Vetter
  1 sibling, 0 replies; 6+ messages in thread
From: Daniel Vetter @ 2021-11-15 14:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sudip Mukherjee, linux-kernel, Gurchetan Singh, Nicholas Verne,
	Gerd Hoffmann, dri-devel, open list:VIRTIO GPU DRIVER,
	Daniel Vetter

On Sat, Nov 13, 2021 at 09:06:57AM -0800, Linus Torvalds wrote:
> [ Hmm. This email was marked as spam for me. I see no obvious reason
> for it being marked as spam, but it happens.. ]
> 
> On Thu, Nov 11, 2021 at 12:52 PM Sudip Mukherjee
> <sudipm.mukherjee@gmail.com> wrote:
> >
> > # first bad commit: [cd7f5ca33585918febe5e2f6dc090a21cfa775b0]
> > drm/virtio: implement context init: add virtio_gpu_fence_event
> 
> Hmm. Judging from your automated screenshots, the login never appears.

Greg also reported a regression, plus I looked at the commit and had
questions too.

> > And, indeed reverting cd7f5ca33585 on top of debe436e77c7 has fixed
> > the problem I was seeing on my qemu test of x86_64. The qemu image is
> > based on Ubuntu.
> 
> Presumably either that commit is somehow buggy in itself - or it does
> exactly what it means to do, and the new poll() semantics just
> confuses the heck out of the X server (or wayland or whatever).
> 
> And honestly, if I read that thing correctly, the patch is entirely
> broken. The new poll function (virtio_gpu_poll()) will unconditionally
> remove the first event from the event list, and then report "Yeah, I
> had events".
> 
> This is completely bogus for a few reasons:
> 
>  - poll() really should be idempotent, because the poll function gets
> called multiple times
> 
>  - it doesn't even seem to check that the event that it removes is the
> new VIRTGPU_EVENT_FENCE_SIGNALED_INTERNAL kind of event, so it will
> unconditionally just remove random events.
> 
>  - it does seem to check the "vfpriv->ring_idx_mask" and do the old
> thing if that is zero, but I see absolutely no reason for that (and
> that check itself has caused problems, see below)
> 
> Honestly, my reaction to this all is that that commit is fundamentally
> broken and probably should be reverted regardless as "this commit does
> bad things".
> 
> HOWEVER - it has had a fix for a NULL pointer dereference in the
> meantime - can you check whether the current top of tree happens to
> work for you? Maybe your problem isn't due to "that commit does
> unnatural things", but simply due to the bug fixed in d89c0c8322ec
> ("drm/virtio: Fix NULL dereference error in virtio_gpu_poll").
> 
> And if it's still broken with that commit, I'll happily revert it and
> people need to go back to the drawing board.
> 
> In fact, I would really suggest that people look at that
> virtio_gpu_poll() function regardless. That odd "let's unconditionally
> just drop events in the poll function is really REALLY broken
> behavior.

Tbh I haven't looked at the poll implementation, but it's fishy on
principles as in drm drivers shouldn't reinvent them. The commit message
cites vmwgfx as example for a private driver drm_event, but that works
perfectly fine with standard drm_poll (and is meant to work perfectly fine
with standard drm_poll).

So if it's buggy on top of questionable I think revert might be the right
choice irrespective of whether there's some fixes in-flight.

So if you end up pushing that revert:

References: https://lore.kernel.org/dri-devel/YZJrutLaiwozLfSw@phenom.ffwll.local/
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Plus credit Greg too and all that ofc.

But lets wait a bit for virtio folks to chime in, it's only Monday :-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-11-15 14:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-11 14:03 regression with mainline kernel Sudip Mukherjee
2021-11-11 20:51 ` Sudip Mukherjee
2021-11-13 11:06   ` Sudip Mukherjee
2021-11-13 17:06   ` Linus Torvalds
2021-11-13 21:34     ` Sudip Mukherjee
2021-11-15 14:24     ` Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).