Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
       [not found] <178BF2895FF685E6.5378@lists.openembedded.org>
@ 2023-10-08  8:23 ` Richard Purdie
  2023-10-09 10:05     ` Mikko Rapeli
       [not found] ` <178C1443D3C8D91F.5378@lists.openembedded.org>
  1 sibling, 1 reply; 25+ messages in thread
From: Richard Purdie @ 2023-10-08  8:23 UTC (permalink / raw)
  To: openembedded-core
  Cc: Paul Gortmaker, Bruce Ashfield, Mikko Rapeli, Randy MacLeod

On Sat, 2023-10-07 at 23:05 +0100, Richard Purdie via
lists.openembedded.org wrote:
> I thought I'd summarise where things are at with the 6.5 kernel.
> 
> We've fixed:
> * the ARM LTP OOM lockup (kernel patch)
> * the locale ARM selftest failure which was OOM due to silly buffer 
>   allocations in 6.5 (kernel commandline)
> * the ARM jitterentropy errors (kernel patch)
> * the cryptodev build failures (recipe updated)
> 
> We've also:
> * disabled the strace tests that fail with 6.5.
> * made sure the serial ports and getty counts match
> * added ttyrun which wraps serial consoles and avoids hacks
> * made the qemurunner logging save all the port logs
> * made the qemurunner write the binary data it is sent verbatim
> * made sure to use nodelay on qemu's tcpserial
> 
> This leaves an annoying serial console problem where ttyS1 never has
> the getty login prompt appear.
> 
> What we know:
> 
> * We've only seen this on x86 more recently (yesterday/today) but have
> seen it on ARM in the days before that.
> 
> * It affects both sysvinit and systemd images.
> 
> * Systemd does print that it started a getty on ttyS0 and ttyS1 when
> the failure occurs
> 
> * There is a getty running according to "ps" when the failure occurs
> 
> * There are only ever one or three characters received to ttyS1 in the
> failure case (0x0d and 0x0a chars, i.e. CR and LF)
> 
> * It can't be any kind of utf-8 conversion issue since the login prompt
> isn't visible in the binary log
> 
> * the kernel boot logs do show the serial port created with the same
> ioport and irq on x86.
> 
> Previously we did see some logs with timing issues on the ttyS0 port
> but the nodelay parameter may have helped with that.
> 
> There are debug patches in master-next against qemurunner which try and
> poke around to gather more debug when things fail using ttyS0.
> 
> The best failure log we have is now this one:
> 
> https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/5874/steps/14/logs/stdio
> 
> where I've saved the logs:
> 
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
> and
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2
> 
> You can see ttyS1 times out after 1000 seconds and the port only has a
> single byte (in the .2 file). The other log shows ps output showing the
> getty running for ttyS1.
> 
> Ideas welcome on where from here. 
> 
> I've tweaked master-next to keep reading the ttyS1 port after we poke
> it from ttyS0 to see if that reveals anything next time it fails (build
> running).

Testing overnight with the new debug yielded:

https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5895/steps/14/logs/stdio

The interesting bit being:

"""
WARNING: core-image-full-cmdline-1.0-r0 do_testimage: Extra read data: 
Poky (Yocto Project Reference Distro) 4.2+snapshot-
7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1

qemux86-64 login: helloA

Poky (Yocto Project Reference Distro) 4.2+snapshot-
7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
qemux86-64 login: 

"""

i.e. the getty didn't appear in 1000s but sometime in shutdown the
original prompt, the "helloA" and the new getty prompt did.

So the data *is* there but stuck in a buffer somehow. Kernel or qemu
side, I don't know.

Cheers,

Richard






^ permalink raw reply	[flat|nested] 25+ messages in thread

* Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-08  8:23 ` [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary) Richard Purdie
@ 2023-10-09 10:05     ` Mikko Rapeli
  0 siblings, 0 replies; 25+ messages in thread
From: Mikko Rapeli @ 2023-10-09 10:05 UTC (permalink / raw)
  To: linux-serial, Greg Kroah-Hartman, Richard Purdie
  Cc: openembedded-core, Paul Gortmaker, Bruce Ashfield, Randy MacLeod

Hi linux-serial and Greg,

Yocto Linux distro maintainer Richard Purdie is seeing a regression or behavior
change after updating kernel from 6.4 to 6.5. Yocto runs a lot of automated tests with qemu
where a python test framework configures and spawns qemu (version 8.1) with two
serial ports and boots a newly built kernel and rootfs there. The python test framework
tries to detect getty login prompt from ttyS1 while boot time logging
from qemu, kernel, userspace etc happens on serial console ttyS0. This has worked
reliably for years and across multiple kernel versions. Now with 6.5 kernel there are
suddenly frequent failures where ttyS1 doesn't show the login prompt in the python code
which listens on the qemu socket from target system ttyS1. The target system ttyS0 is working
and boot time logs and getty prompt are visible there. The ttyS1 logs from test framework show
that only two characters CR and LF make it through. This same happens on x86_64 host
running qemu machine for x86_64 and arm64 targets.

The kernel boot logs from target system with 6.5.5 don't show any errors and getty
processes are started correctly and they don't see any errors either. Data from
target system userspace is being written, but seems to sometimes hang inside kernel
or qemu serial port buffers, and only on the ttyS1 while ttyS0 works reliably all
the time.

Example target logs with 6.5.5 via ttyS0:

https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
or https://pastebin.com/raw/jRRa2CwW

ttyS1 logs from the same run:
https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2

Kernel config:
https://pastebin.com/raw/Lv9Hfeuh

Do you have any ideas or hints how to debug or possibly fix this?

tty layer has seen quite a few major changes between 6.4.14 and 6.5.5 and we're not able
to pinpoint or bisect the issue, since it so annoyingly rare.

Some more details from Richard below.

Cheers,

-Mikko

On Sun, Oct 08, 2023 at 09:23:57AM +0100, Richard Purdie wrote:
> On Sat, 2023-10-07 at 23:05 +0100, Richard Purdie via
> lists.openembedded.org wrote:
> > I thought I'd summarise where things are at with the 6.5 kernel.
> > 
> > We've fixed:
> > * the ARM LTP OOM lockup (kernel patch)
> > * the locale ARM selftest failure which was OOM due to silly buffer 
> >   allocations in 6.5 (kernel commandline)
> > * the ARM jitterentropy errors (kernel patch)
> > * the cryptodev build failures (recipe updated)
> > 
> > We've also:
> > * disabled the strace tests that fail with 6.5.
> > * made sure the serial ports and getty counts match
> > * added ttyrun which wraps serial consoles and avoids hacks
> > * made the qemurunner logging save all the port logs
> > * made the qemurunner write the binary data it is sent verbatim
> > * made sure to use nodelay on qemu's tcpserial
> > 
> > This leaves an annoying serial console problem where ttyS1 never has
> > the getty login prompt appear.
> > 
> > What we know:
> > 
> > * We've only seen this on x86 more recently (yesterday/today) but have
> > seen it on ARM in the days before that.
> > 
> > * It affects both sysvinit and systemd images.
> > 
> > * Systemd does print that it started a getty on ttyS0 and ttyS1 when
> > the failure occurs
> > 
> > * There is a getty running according to "ps" when the failure occurs
> > 
> > * There are only ever one or three characters received to ttyS1 in the
> > failure case (0x0d and 0x0a chars, i.e. CR and LF)
> > 
> > * It can't be any kind of utf-8 conversion issue since the login prompt
> > isn't visible in the binary log
> > 
> > * the kernel boot logs do show the serial port created with the same
> > ioport and irq on x86.
> > 
> > Previously we did see some logs with timing issues on the ttyS0 port
> > but the nodelay parameter may have helped with that.
> > 
> > There are debug patches in master-next against qemurunner which try and
> > poke around to gather more debug when things fail using ttyS0.
> > 
> > The best failure log we have is now this one:
> > 
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/5874/steps/14/logs/stdio
> > 
> > where I've saved the logs:
> > 
> > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
> > and
> > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2
> > 
> > You can see ttyS1 times out after 1000 seconds and the port only has a
> > single byte (in the .2 file). The other log shows ps output showing the
> > getty running for ttyS1.
> > 
> > Ideas welcome on where from here. 
> > 
> > I've tweaked master-next to keep reading the ttyS1 port after we poke
> > it from ttyS0 to see if that reveals anything next time it fails (build
> > running).
> 
> Testing overnight with the new debug yielded:
> 
> https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5895/steps/14/logs/stdio
> 
> The interesting bit being:
> 
> """
> WARNING: core-image-full-cmdline-1.0-r0 do_testimage: Extra read data: 
> Poky (Yocto Project Reference Distro) 4.2+snapshot-
> 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
> 
> qemux86-64 login: helloA
> 
> Poky (Yocto Project Reference Distro) 4.2+snapshot-
> 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
> qemux86-64 login: 
> 
> """
> 
> i.e. the getty didn't appear in 1000s but sometime in shutdown the
> original prompt, the "helloA" and the new getty prompt did.
> 
> So the data *is* there but stuck in a buffer somehow. Kernel or qemu
> side, I don't know.
> 
> Cheers,
> 
> Richard
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
@ 2023-10-09 10:05     ` Mikko Rapeli
  0 siblings, 0 replies; 25+ messages in thread
From: Mikko Rapeli @ 2023-10-09 10:05 UTC (permalink / raw)
  To: linux-serial, Greg Kroah-Hartman, Richard Purdie
  Cc: openembedded-core, Paul Gortmaker, Bruce Ashfield, Randy MacLeod

Hi linux-serial and Greg,

Yocto Linux distro maintainer Richard Purdie is seeing a regression or behavior
change after updating kernel from 6.4 to 6.5. Yocto runs a lot of automated tests with qemu
where a python test framework configures and spawns qemu (version 8.1) with two
serial ports and boots a newly built kernel and rootfs there. The python test framework
tries to detect getty login prompt from ttyS1 while boot time logging
from qemu, kernel, userspace etc happens on serial console ttyS0. This has worked
reliably for years and across multiple kernel versions. Now with 6.5 kernel there are
suddenly frequent failures where ttyS1 doesn't show the login prompt in the python code
which listens on the qemu socket from target system ttyS1. The target system ttyS0 is working
and boot time logs and getty prompt are visible there. The ttyS1 logs from test framework show
that only two characters CR and LF make it through. This same happens on x86_64 host
running qemu machine for x86_64 and arm64 targets.

The kernel boot logs from target system with 6.5.5 don't show any errors and getty
processes are started correctly and they don't see any errors either. Data from
target system userspace is being written, but seems to sometimes hang inside kernel
or qemu serial port buffers, and only on the ttyS1 while ttyS0 works reliably all
the time.

Example target logs with 6.5.5 via ttyS0:

https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
or https://pastebin.com/raw/jRRa2CwW

ttyS1 logs from the same run:
https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2

Kernel config:
https://pastebin.com/raw/Lv9Hfeuh

Do you have any ideas or hints how to debug or possibly fix this?

tty layer has seen quite a few major changes between 6.4.14 and 6.5.5 and we're not able
to pinpoint or bisect the issue, since it so annoyingly rare.

Some more details from Richard below.

Cheers,

-Mikko

On Sun, Oct 08, 2023 at 09:23:57AM +0100, Richard Purdie wrote:
> On Sat, 2023-10-07 at 23:05 +0100, Richard Purdie via
> lists.openembedded.org wrote:
> > I thought I'd summarise where things are at with the 6.5 kernel.
> > 
> > We've fixed:
> > * the ARM LTP OOM lockup (kernel patch)
> > * the locale ARM selftest failure which was OOM due to silly bufferï¿½
> >   allocations in 6.5 (kernel commandline)
> > * the ARM jitterentropy errors (kernel patch)
> > * the cryptodev build failures (recipe updated)
> > 
> > We've also:
> > * disabled the strace tests that fail with 6.5.
> > * made sure the serial ports and getty counts match
> > * added ttyrun which wraps serial consoles and avoids hacks
> > * made the qemurunner logging save all the port logs
> > * made the qemurunner write the binary data it is sent verbatim
> > * made sure to use nodelay on qemu's tcpserial
> > 
> > This leaves an annoying serial console problem where ttyS1 never has
> > the getty login prompt appear.
> > 
> > What we know:
> > 
> > * We've only seen this on x86 more recently (yesterday/today) but have
> > seen it on ARM in the days before that.
> > 
> > * It affects both sysvinit and systemd images.
> > 
> > * Systemd does print that it started a getty on ttyS0 and ttyS1 when
> > the failure occurs
> > 
> > * There is a getty running according to "ps" when the failure occurs
> > 
> > * There are only ever one or three characters received to ttyS1 in the
> > failure case (0x0d and 0x0a chars, i.e. CR and LF)
> > 
> > * It can't be any kind of utf-8 conversion issue since the login prompt
> > isn't visible in the binary log
> > 
> > * the kernel boot logs do show the serial port created with the same
> > ioport and irq on x86.
> > 
> > Previously we did see some logs with timing issues on the ttyS0 port
> > but the nodelay parameter may have helped with that.
> > 
> > There are debug patches in master-next against qemurunner which try and
> > poke around to gather more debug when things fail using ttyS0.
> > 
> > The best failure log we have is now this one:
> > 
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/5874/steps/14/logs/stdio
> > 
> > where I've saved the logs:
> > 
> > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
> > and
> > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2
> > 
> > You can see ttyS1 times out after 1000 seconds and the port only has a
> > single byte (in the .2 file). The other log shows ps output showing the
> > getty running for ttyS1.
> > 
> > Ideas welcome on where from here.ï¿½
> > 
> > I've tweaked master-next to keep reading the ttyS1 port after we poke
> > it from ttyS0 to see if that reveals anything next time it fails (build
> > running).
> 
> Testing overnight with the new debug yielded:
> 
> https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5895/steps/14/logs/stdio
> 
> The interesting bit being:
> 
> """
> WARNING: core-image-full-cmdline-1.0-r0 do_testimage: Extra read data: 
> Poky (Yocto Project Reference Distro) 4.2+snapshot-
> 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
> 
> qemux86-64 login: helloA
> 
> Poky (Yocto Project Reference Distro) 4.2+snapshot-
> 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
> qemux86-64 login: 
> 
> """
> 
> i.e. the getty didn't appear in 1000s but sometime in shutdown the
> original prompt, the "helloA" and the new getty prompt did.
> 
> So the data *is* there but stuck in a buffer somehow. Kernel or qemu
> side, I don't know.
> 
> Cheers,
> 
> Richard
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
       [not found] ` <178C1443D3C8D91F.5378@lists.openembedded.org>
@ 2023-10-09 12:30   ` Richard Purdie
       [not found]   ` <178C704E48D36FAB.24994@lists.openembedded.org>
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Purdie @ 2023-10-09 12:30 UTC (permalink / raw)
  To: openembedded-core
  Cc: Paul Gortmaker, Bruce Ashfield, Mikko Rapeli, Randy MacLeod

On Sun, 2023-10-08 at 09:23 +0100, Richard Purdie via
lists.openembedded.org wrote:
> On Sat, 2023-10-07 at 23:05 +0100, Richard Purdie via
> lists.openembedded.org wrote:
> > I thought I'd summarise where things are at with the 6.5 kernel.
> > 
> > We've fixed:
> > * the ARM LTP OOM lockup (kernel patch)
> > * the locale ARM selftest failure which was OOM due to silly buffer 
> >   allocations in 6.5 (kernel commandline)
> > * the ARM jitterentropy errors (kernel patch)
> > * the cryptodev build failures (recipe updated)
> > 
> > We've also:
> > * disabled the strace tests that fail with 6.5.
> > * made sure the serial ports and getty counts match
> > * added ttyrun which wraps serial consoles and avoids hacks
> > * made the qemurunner logging save all the port logs
> > * made the qemurunner write the binary data it is sent verbatim
> > * made sure to use nodelay on qemu's tcpserial
> > 
> > This leaves an annoying serial console problem where ttyS1 never has
> > the getty login prompt appear.
> > 
> > What we know:
> > 
> > * We've only seen this on x86 more recently (yesterday/today) but have
> > seen it on ARM in the days before that.
> > 
> > * It affects both sysvinit and systemd images.
> > 
> > * Systemd does print that it started a getty on ttyS0 and ttyS1 when
> > the failure occurs
> > 
> > * There is a getty running according to "ps" when the failure occurs
> > 
> > * There are only ever one or three characters received to ttyS1 in the
> > failure case (0x0d and 0x0a chars, i.e. CR and LF)
> > 
> > * It can't be any kind of utf-8 conversion issue since the login prompt
> > isn't visible in the binary log
> > 
> > * the kernel boot logs do show the serial port created with the same
> > ioport and irq on x86.
> > 
> > Previously we did see some logs with timing issues on the ttyS0 port
> > but the nodelay parameter may have helped with that.
> > 
> > There are debug patches in master-next against qemurunner which try and
> > poke around to gather more debug when things fail using ttyS0.
> > 
> > The best failure log we have is now this one:
> > 
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/5874/steps/14/logs/stdio
> > 
> > where I've saved the logs:
> > 
> > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
> > and
> > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2
> > 
> > You can see ttyS1 times out after 1000 seconds and the port only has a
> > single byte (in the .2 file). The other log shows ps output showing the
> > getty running for ttyS1.
> > 
> > Ideas welcome on where from here. 
> > 
> > I've tweaked master-next to keep reading the ttyS1 port after we poke
> > it from ttyS0 to see if that reveals anything next time it fails (build
> > running).
> 
> Testing overnight with the new debug yielded:
> 
> https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5895/steps/14/logs/stdio
> 
> The interesting bit being:
> 
> """
> WARNING: core-image-full-cmdline-1.0-r0 do_testimage: Extra read data: 
> Poky (Yocto Project Reference Distro) 4.2+snapshot-
> 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
> 
> qemux86-64 login: helloA
> 
> Poky (Yocto Project Reference Distro) 4.2+snapshot-
> 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
> qemux86-64 login: 
> 
> """
> 
> i.e. the getty didn't appear in 1000s but sometime in shutdown the
> original prompt, the "helloA" and the new getty prompt did.
> 
> So the data *is* there but stuck in a buffer somehow. Kernel or qemu
> side, I don't know.

To update, the latest debug is:

https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/5836/steps/14/logs/stdio

which shows that it is enough to echo "helloA" to /dev/ttyS1 to cause
the data to be flushed, we don't need to restart the getty. It also
shows there can be other system console messages "stuck" in the queue.

I did also test forcing a read onto the socket but that just triggers a
BlockingIOError exception and doesn't help.

I'm going to try writing "\n\n" to the port after 120s if the getty
hasn't appeared, see if that helps.

Cheers,

Richard






^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
       [not found]   ` <178C704E48D36FAB.24994@lists.openembedded.org>
@ 2023-10-10 20:37     ` Richard Purdie
       [not found]     ` <178CD9772ED4DEB5.9719@lists.openembedded.org>
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Purdie @ 2023-10-10 20:37 UTC (permalink / raw)
  To: openembedded-core
  Cc: Paul Gortmaker, Bruce Ashfield, Mikko Rapeli, Randy MacLeod

I've been asked how to reproduce this setup more easily outside of
OEQA. I hacked this up:

https://www.rpsys.net/wp/rp/simpleqemu.tgz

extract it, hack the paths in commands.py to point at a qemux86-64
build of core-image-minimal and then "./commands" should boot an qemu
image using the system's qemu binary as it would during the QA tests,
setting up the two serial ports as it would during tests.

Obviously I've just brutally hacked this out of the QA code so there
are plenty of sharp edges. It should allow the serial setup to be
replicated and experimented with more easily though without bitbake
involved. It could be cut down a lot more too.

Cheers,

Richard

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
       [not found]     ` <178CD9772ED4DEB5.9719@lists.openembedded.org>
@ 2023-10-10 20:53       ` Richard Purdie
  2023-10-11  4:49         ` Paul Gortmaker
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Purdie @ 2023-10-10 20:53 UTC (permalink / raw)
  To: openembedded-core
  Cc: Paul Gortmaker, Bruce Ashfield, Mikko Rapeli, Randy MacLeod

On Tue, 2023-10-10 at 21:37 +0100, Richard Purdie via
lists.openembedded.org wrote:
> I've been asked how to reproduce this setup more easily outside of
> OEQA. I hacked this up:
> 
> https://www.rpsys.net/wp/rp/simpleqemu.tgz
> 
> extract it, hack the paths in commands.py to point at a qemux86-64
> build of core-image-minimal and then "./commands" should boot an qemu
> image using the system's qemu binary as it would during the QA tests,
> setting up the two serial ports as it would during tests.
> 
> Obviously I've just brutally hacked this out of the QA code so there
> are plenty of sharp edges. It should allow the serial setup to be
> replicated and experimented with more easily though without bitbake
> involved. It could be cut down a lot more too.

and set runqemuparams = 'nographic' in commands.py if you can't cope
with graphics :).

Cheers,

Richard


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-10 20:53       ` Richard Purdie
@ 2023-10-11  4:49         ` Paul Gortmaker
  2023-10-13 12:00           ` Richard Purdie
  2023-10-13 15:15           ` Paul Gortmaker
  0 siblings, 2 replies; 25+ messages in thread
From: Paul Gortmaker @ 2023-10-11  4:49 UTC (permalink / raw)
  To: richard.purdie
  Cc: openembedded-core, Paul Gortmaker, Bruce Ashfield, Mikko Rapeli,
	Randy MacLeod

[Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)] On 10/10/2023 (Tue 21:53) Richard Purdie via lists.openembedded.org wrote:

> On Tue, 2023-10-10 at 21:37 +0100, Richard Purdie via
> lists.openembedded.org wrote:
> > I've been asked how to reproduce this setup more easily outside of
> > OEQA. I hacked this up:
> > 
> > https://www.rpsys.net/wp/rp/simpleqemu.tgz
> > 
> > extract it, hack the paths in commands.py to point at a qemux86-64
> > build of core-image-minimal and then "./commands" should boot an qemu
> > image using the system's qemu binary as it would during the QA tests,
> > setting up the two serial ports as it would during tests.
> > 
> > Obviously I've just brutally hacked this out of the QA code so there
> > are plenty of sharp edges. It should allow the serial setup to be
> > replicated and experimented with more easily though without bitbake
> > involved. It could be cut down a lot more too.
> 
> and set runqemuparams = 'nographic' in commands.py if you can't cope
> with graphics :).

So I think we are onto something here.  Let me rewind a bit to provide
a bit of context for those who didn't follow the IRC discussion.

The OE/QA is running qemu with the "-serial tcp:<IP>:<port>" option and
binds to that directly through python.  In theory, one should also be
able to use netcat/socat (or even telnet) to bind to the qemu TCP via
manual command line testing without the autobuilder/OEQA python.

But I couldn't get that to work without getting "broken" echo behaviour
and similar tty/terminal odd characters displayed.  As that was also
observed on v6.1 and mickledore, I didn't want to go off on a wild goose
chase after something that probably has nothing to do with the issue at
hand here.

Hence leading to what Richard created above.  The basic premise is that
similar to the AB output, we are looking for log files ending in ".2"
that are from ttyS1 -- that are smaller than normal - indicating that a
getty wasn't able to get the login banner out successfully.  Now the results.

After 300 runs on mickledore with v6.1, all the .2 files were the same
size - 477 bytes.

After 155 runs on today's master with v6.5 kernel, I got two runs where
instead of 477 bytes, they were 345 bytes.

dot-two$ls -l1|wc -l
155
dot-two$ls -l1| grep -v 477
total 616
-rw-r--r-- 1 pgortmak users 345 Oct 10 17:54 qemu_boot_log.20231010175351.2
-rw-r--r-- 1 pgortmak users 345 Oct 10 18:43 qemu_boot_log.20231010184307.2

The difference?  The "short" files ended with "Starting crond: OK" and
were missing the login banner:
   --------
   Poky (Yocto Project Reference Distro) 4.2+snapshot-0fb1ff0ffb527a6364cef7159c0d9dcb879e8dd9 qemux86-64 ttyS1

   qemux86-64 login:
   --------

Similarly, for the ttyS0 logs, if we strip off the printk.time prefixes
thus allowing us to diff the corresponding "outlier" case against any
one of the 153 "normal" cases, we see that this line is missing:

Last login: Tue Oct 10 22:40:31 +0000 2023 on /dev/ttyS1.

...which makes sense, since the login never appeared on ttyS1 for OEQA
to interact with and login to.

(interesting that otherwise, ttyS1 always "wins" the race to update
wtmp/utmp over ttyS0)

Anyway, that works out to approximately a reproduce factor of a dismal
1.5% by my math.  Ugh.  But at least it isn't coupled to a full build!

Finally, using an identical build on master but with PREF VERSION of
v6.4 kernel, I ran 300+ instances and got zero instances of short .2
ttyS1 log files that were missing the getty output.

Note that these runs were using the default qemu v4.x that ships with
ubuntu-20.04 ; I've not (yet?) run similar tests with our own more
up-to-date qemu build from the sysroot.  Maybe I don't need to?

Possible conclusions:

The reproducibility seems pretty low, which seems to match what I've
heard with respect to the AB results.  It also seems confined to the
v6.5.x kernel, based on these limited results - matching the AB results.

So maybe we have a reproducer that is independent of OEQA and the
backing of a full build to get there?  I'd like to think so, but I'm
going to run more v6.5+master tests overnight to try and get better
stats - a +/- of just 1 changes my current reproduce # by a large
percentage!

Next steps:

Each run takes approximately 20s, and with a 1.5% success factor, a
minimum of 150 clean runs is needed before one can say "OK, *maybe* this
kernel currently under test is OK.  That translates to over an hour to
get a data point you have some faith in as being "good".  [Still a heck
of a lot better than doing 150 *builds* though!]

On the other hand, just one instance of missing login on ttyS1 is
concretely "bad" and hence the test run can be stopped as soon as one
instance of that is seen.

This "good is really maybe perhaps good, but bad is for sure bad" type
of bisects that I love to hate.

Since the data points are "expensive", I probably won't just attempt a
mindless brute force bisect.  I'd probably do something more like...

--ensure modules (or lack of) don't impact the results, so I don't have
  to rebundle rootfs images.

--maybe see if I can build with host gcc vs. using devshell/sysroot gcc
  and still see the issue.  Although really, kernel build time optimization
  won't be the bottleneck in this particular case, so not worthwhile?

--check vanilla v6.5 (i.e. v6.5.0 -- no linux-stable content) so I know
  whether to dig around in v6.5.0 --> v6.5.5, or dig around in the
  much much larger v6.4 --> v6.5 additions.  I expect it to be the
  latter, i.e. digging through mainline, but who knows.

--assuming mainline introduced it, instead of a brute force bisect, I'd
  look for merges by Linus from gregKH pull requests for the tty and
  serial subsystems in the merge window (v6.4 ---> v6.5-rc1) and test
  both sides of that/those merge(s).

--if things haven't completely fallen apart by that point, I'd be left
  with a relatively small set of subsystem specific commits from a single
  pull request which then could be bisected.

Sometimes in the past, when I've got down to that small set of commits,
the culprit almost jumps out at you when scanning over the shortlogs.

Anyway, that is where my thoughts are at.  But of course we don't even
yet know if this "test" is even accurately reflecting the OEQA/AB issue!
So nobody should get their hopes up too early.

As a side note, and something that kind of was briefly discussed on IRC,
it seems to make sense to someday informally "support" the idea of
developers being able to use a chunk of the OEQA like this in a
stand-alone mode - since we always seem to be chasing issues that only
show up in AB output - with people struggling to reproduce elsewhere.

Paul.
--

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-11  4:49         ` Paul Gortmaker
@ 2023-10-13 12:00           ` Richard Purdie
  2023-10-13 15:15           ` Paul Gortmaker
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Purdie @ 2023-10-13 12:00 UTC (permalink / raw)
  To: paul.gortmaker
  Cc: openembedded-core, Bruce Ashfield, Mikko Rapeli, Randy MacLeod

On Wed, 2023-10-11 at 00:49 -0400, Paul Gortmaker via
lists.openembedded.org wrote:
> [Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)] On 10/10/2023 (Tue 21:53) Richard Purdie via lists.openembedded.org wrote:
> 
> > On Tue, 2023-10-10 at 21:37 +0100, Richard Purdie via
> > lists.openembedded.org wrote:
> > > I've been asked how to reproduce this setup more easily outside of
> > > OEQA. I hacked this up:
> > > 
> > > https://www.rpsys.net/wp/rp/simpleqemu.tgz
> > > 
> > > extract it, hack the paths in commands.py to point at a qemux86-64
> > > build of core-image-minimal and then "./commands" should boot an qemu
> > > image using the system's qemu binary as it would during the QA tests,
> > > setting up the two serial ports as it would during tests.
> > > 
> > > Obviously I've just brutally hacked this out of the QA code so there
> > > are plenty of sharp edges. It should allow the serial setup to be
> > > replicated and experimented with more easily though without bitbake
> > > involved. It could be cut down a lot more too.
> > 
> > and set runqemuparams = 'nographic' in commands.py if you can't cope
> > with graphics :).
> 
> So I think we are onto something here.  Let me rewind a bit to provide
> a bit of context for those who didn't follow the IRC discussion.
> 
> The OE/QA is running qemu with the "-serial tcp:<IP>:<port>" option and
> binds to that directly through python.  In theory, one should also be
> able to use netcat/socat (or even telnet) to bind to the qemu TCP via
> manual command line testing without the autobuilder/OEQA python.
> 
> But I couldn't get that to work without getting "broken" echo behaviour
> and similar tty/terminal odd characters displayed.  As that was also
> observed on v6.1 and mickledore, I didn't want to go off on a wild goose
> chase after something that probably has nothing to do with the issue at
> hand here.
> 
> Hence leading to what Richard created above.  The basic premise is that
> similar to the AB output, we are looking for log files ending in ".2"
> that are from ttyS1 -- that are smaller than normal - indicating that a
> getty wasn't able to get the login banner out successfully.  Now the results.
> 
> After 300 runs on mickledore with v6.1, all the .2 files were the same
> size - 477 bytes.
> 
> After 155 runs on today's master with v6.5 kernel, I got two runs where
> instead of 477 bytes, they were 345 bytes.
> 
> dot-two$ls -l1|wc -l
> 155
> dot-two$ls -l1| grep -v 477
> total 616
> -rw-r--r-- 1 pgortmak users 345 Oct 10 17:54 qemu_boot_log.20231010175351.2
> -rw-r--r-- 1 pgortmak users 345 Oct 10 18:43 qemu_boot_log.20231010184307.2
> 
> The difference?  The "short" files ended with "Starting crond: OK" and
> were missing the login banner:
>    --------
>    Poky (Yocto Project Reference Distro) 4.2+snapshot-0fb1ff0ffb527a6364cef7159c0d9dcb879e8dd9 qemux86-64 ttyS1
> 
>    qemux86-64 login:
>    --------
> 
> Similarly, for the ttyS0 logs, if we strip off the printk.time prefixes
> thus allowing us to diff the corresponding "outlier" case against any
> one of the 153 "normal" cases, we see that this line is missing:
> 
> Last login: Tue Oct 10 22:40:31 +0000 2023 on /dev/ttyS1.
> 
> ...which makes sense, since the login never appeared on ttyS1 for OEQA
> to interact with and login to.
> 
> (interesting that otherwise, ttyS1 always "wins" the race to update
> wtmp/utmp over ttyS0)
> 
> Anyway, that works out to approximately a reproduce factor of a dismal
> 1.5% by my math.  Ugh.  But at least it isn't coupled to a full build!
> 
> Finally, using an identical build on master but with PREF VERSION of
> v6.4 kernel, I ran 300+ instances and got zero instances of short .2
> ttyS1 log files that were missing the getty output.
> 
> Note that these runs were using the default qemu v4.x that ships with
> ubuntu-20.04 ; I've not (yet?) run similar tests with our own more
> up-to-date qemu build from the sysroot.  Maybe I don't need to?
> 
> Possible conclusions:
> 
> The reproducibility seems pretty low, which seems to match what I've
> heard with respect to the AB results.  It also seems confined to the
> v6.5.x kernel, based on these limited results - matching the AB results.
> 
> So maybe we have a reproducer that is independent of OEQA and the
> backing of a full build to get there?  I'd like to think so, but I'm
> going to run more v6.5+master tests overnight to try and get better
> stats - a +/- of just 1 changes my current reproduce # by a large
> percentage!
> 
> 
> Next steps:
> 
> Each run takes approximately 20s, and with a 1.5% success factor, a
> minimum of 150 clean runs is needed before one can say "OK, *maybe* this
> kernel currently under test is OK.  That translates to over an hour to
> get a data point you have some faith in as being "good".  [Still a heck
> of a lot better than doing 150 *builds* though!]
> 
> On the other hand, just one instance of missing login on ttyS1 is
> concretely "bad" and hence the test run can be stopped as soon as one
> instance of that is seen.
> 
> This "good is really maybe perhaps good, but bad is for sure bad" type
> of bisects that I love to hate.
> 
> Since the data points are "expensive", I probably won't just attempt a
> mindless brute force bisect.  I'd probably do something more like...
> 
> --ensure modules (or lack of) don't impact the results, so I don't have
>   to rebundle rootfs images.
> 
> --maybe see if I can build with host gcc vs. using devshell/sysroot gcc
>   and still see the issue.  Although really, kernel build time optimization
>   won't be the bottleneck in this particular case, so not worthwhile?
> 
> --check vanilla v6.5 (i.e. v6.5.0 -- no linux-stable content) so I know
>   whether to dig around in v6.5.0 --> v6.5.5, or dig around in the
>   much much larger v6.4 --> v6.5 additions.  I expect it to be the
>   latter, i.e. digging through mainline, but who knows.
> 
> --assuming mainline introduced it, instead of a brute force bisect, I'd
>   look for merges by Linus from gregKH pull requests for the tty and
>   serial subsystems in the merge window (v6.4 ---> v6.5-rc1) and test
>   both sides of that/those merge(s).
> 
> --if things haven't completely fallen apart by that point, I'd be left
>   with a relatively small set of subsystem specific commits from a single
>   pull request which then could be bisected.
> 
> Sometimes in the past, when I've got down to that small set of commits,
> the culprit almost jumps out at you when scanning over the shortlogs.
> 
> Anyway, that is where my thoughts are at.  But of course we don't even
> yet know if this "test" is even accurately reflecting the OEQA/AB issue!
> So nobody should get their hopes up too early.
> 
> As a side note, and something that kind of was briefly discussed on IRC,
> it seems to make sense to someday informally "support" the idea of
> developers being able to use a chunk of the OEQA like this in a
> stand-alone mode - since we always seem to be chasing issues that only
> show up in AB output - with people struggling to reproduce elsewhere.

You've been doing some great work on this, thanks!

It is worth noting that for the purposes of experimentation, cutting
the 1000s timeout in the script to something more reasonable like 60s
is fine. If the thing fails, it hangs and you know about it quickly.

I did have some experiments of my own which I thought I'd share.

I hacked the script to append sys.argv[1] to the log filename, then
experimented with a command like:

$ for i in `seq 1 88`; do ./commands.py $i & done

which launches 88 qemus in parallel. It is clear when you do this and
watch the log file sizes, several "hang", for me at different points:

$ ls *.2 -la | cut -d ' ' -f 5 | sort | uniq -c
      1 134
      1 249
      1 251
      2 254
      1 255
     51 273

I think a few crash with races too since that total clearly isn't 88
but I'm not worrying too much about it.

I also came up with the idea of running "cat /proc/tty/driver/serial"
before and after the "echo helloA > /dev/ttyS1\n". This was interesting
as it resulted in:

root@qemux86-64:~# cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:000003F8 irq:4 tx:418 rx:43 RTS|CTS|DTR|DSR|CD
1: uart:16550A port:000002F8 irq:3 tx:249 rx:0 RTS|CTS|DTR|DSR|CD
2: uart:unknown port:000003E8 irq:4
3: uart:unknown port:000002E8 irq:3
root@qemux86-64:~# echo helloA > /dev/ttyS1
root@qemux86-64:~# echo helloB > /dev/ttyS0
helloB
root@qemux86-64:~# cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:000003F8 irq:4 tx:803 rx:121 RTS|CTS|DTR|DSR|CD
1: uart:16550A port:000002F8 irq:3 tx:281 rx:0 RTS|CTS|DTR|DSR|CD
2: uart:unknown port:000003E8 irq:4
3: uart:unknown port:000002E8 irq:3

The .2 logfile in this specific case was 249 bytes:

-rw-rw-r--  1 richard richard      249 Oct 13 12:50 qemu_boot_log.2023101312503368.2

which corresponds to the "tx:249". The last line was:

"""
Poky (Yocto Project Reference Distro) 4.2+snapshot-5b7479692b49154f4c299496974e7c81892ae336 qemux86-64 /dev/ttyS1
"""
i.e. the login prompt wasn't shown.

After the echo "helloA\n" (8 chars) we see the tx count increase to 281
so the login *and* the hello are sent. The expected 273 + 8 == 281.

So what this seems to prove is the kernel is hogging the data and our
qemu processing side of this likely isn't at fault as the kernel never
transmits it.

Cheers,

Richard








^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-11  4:49         ` Paul Gortmaker
  2023-10-13 12:00           ` Richard Purdie
@ 2023-10-13 15:15           ` Paul Gortmaker
       [not found]             ` <178C6861D9B097E8.24994@lists.openembedded.org>
  1 sibling, 1 reply; 25+ messages in thread
From: Paul Gortmaker @ 2023-10-13 15:15 UTC (permalink / raw)
  To: richard.purdie
  Cc: openembedded-core, Paul Gortmaker, Bruce Ashfield, Mikko Rapeli,
	Randy MacLeod

So, here is an update for those not following on IRC:

[Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)] On 11/10/2023 (Wed 00:50) Paul Gortmaker wrote:

[snip details of getting RP's chunk of AB python working as a 1% reproducer]

> Next steps:
> 
> Each run takes approximately 20s, and with a 1.5% success factor, a
> minimum of 150 clean runs is needed before one can say "OK, *maybe* this
> kernel currently under test is OK.  That translates to over an hour to
> get a data point you have some faith in as being "good".  [Still a heck
> of a lot better than doing 150 *builds* though!]
> 
> On the other hand, just one instance of missing login on ttyS1 is
> concretely "bad" and hence the test run can be stopped as soon as one
> instance of that is seen.
> 
> This "good is really maybe perhaps good, but bad is for sure bad" type
> of bisects that I love to hate.
> 
> Since the data points are "expensive", I probably won't just attempt a
> mindless brute force bisect.  I'd probably do something more like...
> 
> --ensure modules (or lack of) don't impact the results, so I don't have
>   to rebundle rootfs images.

This turned out to be true - didn't need modules.

> --maybe see if I can build with host gcc vs. using devshell/sysroot gcc
>   and still see the issue.  Although really, kernel build time optimization
>   won't be the bottleneck in this particular case, so not worthwhile?

I didn't bother with the above, since build time was not significant here.

> --check vanilla v6.5 (i.e. v6.5.0 -- no linux-stable content) so I know
>   whether to dig around in v6.5.0 --> v6.5.5, or dig around in the
>   much much larger v6.4 --> v6.5 additions.  I expect it to be the
>   latter, i.e. digging through mainline, but who knows.

As expected, it followed v6.5.0 and linux-stable isn't to blame here.

> --assuming mainline introduced it, instead of a brute force bisect, I'd
>   look for merges by Linus from gregKH pull requests for the tty and
>   serial subsystems in the merge window (v6.4 ---> v6.5-rc1) and test
>   both sides of that/those merge(s).

This also largely went according to plan. I started with the tty merge:

v6.5-rc1~74:
868a9fd94807 Merge tag 'tty-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

and did burrow down into the 60-odd commits within.  One complication
was that there were interim mainline merges into the dev stream:

$ git log --oneline --merges v6.5-rc1~74^2  ^v6.5-rc1~74^1
2076b2a4a6b7 Merge 6.4-rc5 into tty-next
0e4daea31d83 Merge 6.4-rc3 into tty-next

..which usually indicate a dependency on work from another subsystem.
Fortunately the test pass/fail boundary didn't land on one of those.

> --if things haven't completely fallen apart by that point, I'd be left
>   with a relatively small set of subsystem specific commits from a single
>   pull request which then could be bisected.

We did somehow manage to get here with just 60 commits to consider.

> 
> Sometimes in the past, when I've got down to that small set of commits,
> the culprit almost jumps out at you when scanning over the shortlogs.

While I didn't immediately spot the commit via the shortlog, extensive
test iterations did give us confidence in pinpointing a single commit:

84a9582fd203 serial: core: Start managing serial controllers to enable runtime PM       *******         <---- FAIL (33/1436)
ae62c49c0cef serial: 8250: omap: convert to modern PM ops                                               <---- PASS (0/1593)

> Anyway, that is where my thoughts are at.  But of course we don't even
> yet know if this "test" is even accurately reflecting the OEQA/AB issue!
> So nobody should get their hopes up too early.

I think now, we can say with a solid degree of confidence that the PM
commit gets to own the responsibility of causing the ttyS1 getty not
flushing the login banner and prompt.  Hence why I did over 1000 runs on
both the PASS and FAIL commit boundary.

Unfortunately, there is a lot in that commit ; also a lot in the
discussion and the twelve iterations before acceptance that I've not
even looked at yet:

https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118c11@gmail.com/T/#t

My "next steps" at this point would be to try and attempt to break down
that relatively large commit into multiple "sub-commits" in order to use
the AB test harness portion to better get an idea of which change
component is the root cause of the ttyS1 getty stall/seizure.

I won't get to that until Saturday at the earliest - going AFK after
sending this.  However, Mikko already opened a dialog with the
linux-serial folks a few days ago:

https://lore.kernel.org/all/ZSPQY6UYg21Z0PnN@nuoska/

Mikko -- if you want to take our pseudo bisect information (below) and
continue that thread; sharing what we've learned since, and also loop in
Tony (author of v6.4-rc3-31-g84a9582fd203) and others on the CC there,
that might be a good next step to do now, while I am AFK.

A quick technical note on testing - all my kernels have been built in
the same devshell instance, so same path, same compiler, same ".config",
same "test harness" from RP, same qemu, same underlying machine.  Both
Richcard and I have also found that loading on the host machine running
the qemu binary (ubu 20.04 default in my case) influences the rate of
incidence - an otherwise idle machine is less likely to show the issue
vs a machine that has other things going on -- indicating a qemu
scheduling component influences the incident rate.

Paul.
--

Note asterik "*****" marking suspect commit.

v6.5.7													<---- FAIL (2/188)
v6.5.5													<---- FAIL (13/1039)
v6.5.0													<---- FAIL (6/422)
v6.5-rc1												<---- FAIL (2/230)

v6.5-rc1~74:
868a9fd94807 Merge tag 'tty-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

$git log --oneline v6.5-rc1~74^2  ^v6.5-rc1~74^1
e534756c9412 tty_audit: make data of tty_audit_log() const						<---- FAIL (3/266)
e64ed44bce43 tty_audit: make tty pointers in exposed functions const
3e540a7bad85 tty_audit: make icanon a bool
94f94810efc4 tty_audit: invert the condition in tty_audit_log()
e1488513abee tty_audit: use kzalloc() in tty_audit_buf_alloc()
777e456852f7 tty_audit: use TASK_COMM_LEN for task comm
a82d62f70854 Revert "8250: add support for ASIX devices with a FIFO bug"
27a826837ec9 serial: atmel: don't enable IRQs prematurely
930cbf92db01 tty: serial: Add Nuvoton ma35d1 serial driver support
e0edfdc15863 tty: serial: fsl_lpuart: add earlycon for imx8ulp platform
639949a7031e tty: serial: imx: fix rs485 rx after tx
e8cc334847db selftests: tty: add selftest for tty timestamp updates
360c11e2258c tty: tty_io: update timestamps on all device nodes
4903fde8047a tty: fix hang on tty device with no_room set
cef09673c35b serial: core: fix -EPROBE_DEFER handling in init
20a41a62618d serial: 8250_omap: Use force_suspend and resume for system suspend
f3710f5e9e1a tty: serial: samsung_tty: Use abs() to simplify some code
832e231cff47 tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() when iterating clk
a9c09546e903 tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() in case of error
d9f59caf94a9 serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE
1eea99c04555 powerpc/legacy_serial: Handle SERIAL_8250_FSL=n build failures
52861a3be8d6 serial: core: don't kfree device managed data
b6c7ff2693dd serial: 8250_mtk: Simplify clock sequencing and runtime PM
643662d12bc2 serial: st-asc: fix typo in property name
66eff0ef528b powerpc/legacy_serial: Warn about 8250 devices operated without active FSL workarounds
2076b2a4a6b7 Merge 6.4-rc5 into tty-next								<---- FAIL (2/145)
49c80922ff81 serial: core: Fix error handling for serial_core_ctrl_device_add()				<---- FAIL (6/992)
539914240a01 serial: core: Fix probing serial_base_bus devices						<---- FAIL (3/145)
d0a396083e91 serial: core: Don't drop port_mutex in serial_core_remove_one_port
84a9582fd203 serial: core: Start managing serial controllers to enable runtime PM	*******		<---- FAIL (33/1436)
ae62c49c0cef serial: 8250: omap: convert to modern PM ops						<---- PASS (0/1593)
c53aab207622 serial: 8250: omap: Move uart_write() inside PM section
d2d4bd217ccd serial: 8250-fsl: Expand description of the MPC83xx UART's misbehaviour
1b997aef4f54 serial: Indicate fintek option may also be required for RS232 support
20ec397d694b tty: serial: fsl_lpuart: Check the return value of dmaengine_tx_status			<---- PASS (0/316)
0d07703be74f serial: Switch i2c drivers back to use .probe()
d0b309a5d3f4 serial: 8250: synchronize and annotate UART_IER access
25614735a647 serial: 8250: lock port for UART_IER access in omap8250_irq()
8b45503776b6 serial: 8250: lock port for omap8250_restore_regs()
87660fb4041f serial: 8250: lock port for rx_dma() callback
51e45fba14bf serial: core: lock port for start_rx() in uart_resume_port()
ca73a892c5be serial: 8250: lock port for stop_rx() in omap8250_irq()
abcb0cf1f5b2 serial: core: lock port for stop_rx() in uart_suspend_port()
b1207d86169d serial: 8250: lock port in startup() callbacks
0e4daea31d83 Merge 6.4-rc3 into tty-next
d5b3d02d0b10 serial: Make uart_remove_one_port() return void
6bd6cd29c924 serial: stm32: Ignore return value of uart_remove_one_port() in .remove()
72fc578b220c serial: 8250_rt288x: Remove unnecessary UART_REG_UNMAPPED
33e3b0eb7550 serial: 8250_rt288x: Name non-standard divisor latch reg
b334214ea08d serial: 8250: RT288x/Au1xxx code away from core
30c61f53fdf2 serial: 8250: Add dl_read/write, bugs and mapsize into plat_serial8250_port
98658ae8f392 serial: 8250: Document uart_8250_port's ->dl_read/write()
b245aa0cc583 serial: 8250: Change dl_read/write to handle value as u32
9d86719f8769 serial: 8250: Allow using ports higher than SERIAL_8250_RUNTIME_UARTS
db86bb6ed497 serial: 8250: omap: Shut down on remove for console uart
fef4f600319e serial: 8250: omap: Fix life cycle issues for interrupt handlers
398cecc24846 serial: 8250: omap: Fix imprecise external abort for omap_8250_pm()
b9ab22c2bc86 serial: 8250: omap: Fix freeing of resources on failed register
6d8c1fca0a6e serial: pl011: set UART011_CR_RXE in pl011_set_termios after port shutdown
cf9aa72d2f91 tty: serial: fsl_lpuart: optimize the timer based EOP logic

Baseline of tty queue for v6.5 was:
$git show --oneline --stat cf9aa72d2f91~
ac9a78681b92 (tag: v6.4-rc1) Linux 6.4-rc1
 Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Have also 1000+ runs on Yocto's v6.4.16 based kernel with no fails.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-09 10:05     ` Mikko Rapeli
  (?)
@ 2023-10-14  9:41     ` Richard Purdie
  2023-10-14 11:13       ` Richard Purdie
       [not found]       ` <178DF50519C11C84.8679@lists.openembedded.org>
  -1 siblings, 2 replies; 25+ messages in thread
From: Richard Purdie @ 2023-10-14  9:41 UTC (permalink / raw)
  To: Mikko Rapeli, linux-serial, Greg Kroah-Hartman, Tony Lindgren
  Cc: openembedded-core, Bruce Ashfield, Randy MacLeod, Paul Gortmaker

On Mon, 2023-10-09 at 13:05 +0300, Mikko Rapeli wrote:
> Yocto Linux distro maintainer Richard Purdie is seeing a regression or behavior
> change after updating kernel from 6.4 to 6.5. Yocto runs a lot of automated tests with qemu
> where a python test framework configures and spawns qemu (version 8.1) with two
> serial ports and boots a newly built kernel and rootfs there. The python test framework
> tries to detect getty login prompt from ttyS1 while boot time logging
> from qemu, kernel, userspace etc happens on serial console ttyS0. This has worked
> reliably for years and across multiple kernel versions. Now with 6.5 kernel there are
> suddenly frequent failures where ttyS1 doesn't show the login prompt in the python code
> which listens on the qemu socket from target system ttyS1. The target system ttyS0 is working
> and boot time logs and getty prompt are visible there. The ttyS1 logs from test framework show
> that only two characters CR and LF make it through. This same happens on x86_64 host
> running qemu machine for x86_64 and arm64 targets.
> 
> The kernel boot logs from target system with 6.5.5 don't show any errors and getty
> processes are started correctly and they don't see any errors either. Data from
> target system userspace is being written, but seems to sometimes hang inside kernel
> or qemu serial port buffers, and only on the ttyS1 while ttyS0 works reliably all
> the time.
> 
> Example target logs with 6.5.5 via ttyS0:
> 
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
> or https://pastebin.com/raw/jRRa2CwW
> 
> ttyS1 logs from the same run:
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2
> 
> Kernel config:
> https://pastebin.com/raw/Lv9Hfeuh
> 
> Do you have any ideas or hints how to debug or possibly fix this?
> 
> tty layer has seen quite a few major changes between 6.4.14 and 6.5.5 and we're not able
> to pinpoint or bisect the issue, since it so annoyingly rare.
> 
> Some more details from Richard below.

Brief summary:

We're seeing an issue on x86_64 with 6.5.X where data appears to be
left in the transmission buffer and not sent to the port on the second
serial port (ttyS1) until we trigger it with intervention.

Paul Gortmaker did some painful bisection over a few days down to:

serial: core: Start managing serial controllers to enable runtime PM
https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118c11@gmail.com/T/#t

More details:

We run very simple plain images under qemu with two serial ports
configured. These are x86_64 images with two 16550A ports with the
standard x86 port addresses which appear as ttyS0 and ttyS1, nothing
special. We run a console and getty on ttyS0 and a getty on ttyS1.
After boot, we wait for a getty to appear on ttyS1, login to it and run
commands.

With 6.5.5 (and latest 6.5.X head as of yesterday), we see an issue say
1% of the time where the getty never appears on ttyS1 even after our
timeout of 1000s.

When this happens we've added code to login to the ttyS0 getty and run
debug commands. We've been able to confirm the getty is running and the
init system doesn't matter (happens with sysvinit and systemd). The
most interesting debug I've seen is this:

root@qemux86-64:~# cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:000003F8 irq:4 tx:418 rx:43 RTS|CTS|DTR|DSR|CD
1: uart:16550A port:000002F8 irq:3 tx:249 rx:0 RTS|CTS|DTR|DSR|CD
2: uart:unknown port:000003E8 irq:4
3: uart:unknown port:000002E8 irq:3
root@qemux86-64:~# echo helloA > /dev/ttyS1
root@qemux86-64:~# echo helloB > /dev/ttyS0
helloB
root@qemux86-64:~# cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:000003F8 irq:4 tx:803 rx:121 RTS|CTS|DTR|DSR|CD
1: uart:16550A port:000002F8 irq:3 tx:281 rx:0 RTS|CTS|DTR|DSR|CD
2: uart:unknown port:000003E8 irq:4
3: uart:unknown port:000002E8 irq:3

This is being run after the getty didn't appear for 60s on ttyS1 so
we've logged into ttyS0 and run these commands. We've seen that if it
doesn't appear after 60s, it won't appear after 1000s either.

The tx:249 is interesting as it should be tx:273, 273 being the number
of bytes our successful serial getty prompt has. Once we echo something
to the port (8 bytes), tx: jumps to 281, so it suddenly found our
missing login prompt. This is confirmed with the data appearing on the
port after the echo.

I did try disabling the autosuspend code in the commit above but it
made no difference.

Reproducing this is a pain. I thought I had a local setup doing it
fairly frequently, then spent an hour this morning without it happening
once.

I'm hoping someone might recognise what is going on here? I'm happy to
try and add debug or find further information if that would help narrow
it down.

Cheers,

Richard

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-14  9:41     ` Richard Purdie
@ 2023-10-14 11:13       ` Richard Purdie
       [not found]       ` <178DF50519C11C84.8679@lists.openembedded.org>
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Purdie @ 2023-10-14 11:13 UTC (permalink / raw)
  To: Mikko Rapeli, linux-serial, Greg Kroah-Hartman, Tony Lindgren
  Cc: openembedded-core, Bruce Ashfield, Randy MacLeod, Paul Gortmaker

On Sat, 2023-10-14 at 10:41 +0100, Richard Purdie wrote:
> Brief summary:
> 
> We're seeing an issue on x86_64 with 6.5.X where data appears to be
> left in the transmission buffer and not sent to the port on the second
> serial port (ttyS1) until we trigger it with intervention.
> 
> Paul Gortmaker did some painful bisection over a few days down to:
> 
> serial: core: Start managing serial controllers to enable runtime PM
> https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118c11@gmail.com/T/#t

Having poked around a bit and knowing nothing about any of this, should
this bit of new code added in the above commit to __uart_start() in
serial_core.c:

	/*
	 * Start TX if enabled, and kick runtime PM. If the device is not
	 * enabled, serial_port_runtime_resume() calls start_tx()
again
	 * after enabling the device.
	 */
	if (pm_runtime_active(&port_dev->dev))
		port->ops->start_tx(port);


actually be something like:


	if (pm_runtime_active(&port_dev->dev) || !pm_runtime_enabled(&port_dev->dev))
		port->ops->start_tx(port);


since there are uarts that don't enable runtime PM?

I notice that 16550A I'm using doesn't set UART_CAP_RPM and since we
have data left in the xmit buffer (I managed to confirm that), it is as
if during init, there is a race between the serial probing and the
getty putting data in the buffer? If it weren't statrted, that would
explain things...

Keep in mind I know nothing about pm_runtime or serial/uarts so this is
guesswork on my part.

Cheers,

Richard


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
       [not found]       ` <178DF50519C11C84.8679@lists.openembedded.org>
@ 2023-10-15 12:29         ` Richard Purdie
  2023-10-15 15:31           ` Greg Kroah-Hartman
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Purdie @ 2023-10-15 12:29 UTC (permalink / raw)
  To: Mikko Rapeli, linux-serial, Greg Kroah-Hartman, Tony Lindgren
  Cc: openembedded-core, Bruce Ashfield, Randy MacLeod, Paul Gortmaker

On Sat, 2023-10-14 at 12:13 +0100, Richard Purdie via
lists.openembedded.org wrote:
> On Sat, 2023-10-14 at 10:41 +0100, Richard Purdie wrote:
> > Brief summary:
> > 
> > We're seeing an issue on x86_64 with 6.5.X where data appears to be
> > left in the transmission buffer and not sent to the port on the second
> > serial port (ttyS1) until we trigger it with intervention.
> > 
> > Paul Gortmaker did some painful bisection over a few days down to:
> > 
> > serial: core: Start managing serial controllers to enable runtime PM
> > https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118c11@gmail.com/T/#t
> 
> Having poked around a bit and knowing nothing about any of this, should
> this bit of new code added in the above commit to __uart_start() in
> serial_core.c:
> 
> 	/*
> 	 * Start TX if enabled, and kick runtime PM. If the device is not
> 	 * enabled, serial_port_runtime_resume() calls start_tx()
> again
> 	 * after enabling the device.
> 	 */
> 	if (pm_runtime_active(&port_dev->dev))
> 		port->ops->start_tx(port);
> 
> 
> actually be something like:
> 
> 
> 	if (pm_runtime_active(&port_dev->dev) || !pm_runtime_enabled(&port_dev->dev))
> 		port->ops->start_tx(port);
> 
> 
> since there are uarts that don't enable runtime PM?
> 
> I notice that 16550A I'm using doesn't set UART_CAP_RPM and since we
> have data left in the xmit buffer (I managed to confirm that), it is as
> if during init, there is a race between the serial probing and the
> getty putting data in the buffer? If it weren't statrted, that would
> explain things...

The above change didn't work but what does appear to be making a
difference is making this code call start_tx unconditionally which is
what it did prior to the patch. That does cause a "wake" when there
might not be any data but the code handles that gracefully.

I therefore suspect this is the place the issue is, the question is
what the right conditions for calling start_tx are?

I'll keep going with testing of that as the intermittent nature does
make this hard to know if any change helps or not.

Cheers,

Richard

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-15 12:29         ` Richard Purdie
@ 2023-10-15 15:31           ` Greg Kroah-Hartman
  2023-10-15 21:30             ` Richard Purdie
  0 siblings, 1 reply; 25+ messages in thread
From: Greg Kroah-Hartman @ 2023-10-15 15:31 UTC (permalink / raw)
  To: Richard Purdie
  Cc: Mikko Rapeli, linux-serial, Tony Lindgren, openembedded-core,
	Bruce Ashfield, Randy MacLeod, Paul Gortmaker

On Sun, Oct 15, 2023 at 01:29:48PM +0100, Richard Purdie wrote:
> On Sat, 2023-10-14 at 12:13 +0100, Richard Purdie via
> lists.openembedded.org wrote:
> > On Sat, 2023-10-14 at 10:41 +0100, Richard Purdie wrote:
> > > Brief summary:
> > > 
> > > We're seeing an issue on x86_64 with 6.5.X where data appears to be
> > > left in the transmission buffer and not sent to the port on the second
> > > serial port (ttyS1) until we trigger it with intervention.
> > > 
> > > Paul Gortmaker did some painful bisection over a few days down to:
> > > 
> > > serial: core: Start managing serial controllers to enable runtime PM
> > > https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118c11@gmail.com/T/#t
> > 
> > Having poked around a bit and knowing nothing about any of this, should
> > this bit of new code added in the above commit to __uart_start() in
> > serial_core.c:
> > 
> > 	/*
> > 	 * Start TX if enabled, and kick runtime PM. If the device is not
> > 	 * enabled, serial_port_runtime_resume() calls start_tx()
> > again
> > 	 * after enabling the device.
> > 	 */
> > 	if (pm_runtime_active(&port_dev->dev))
> > 		port->ops->start_tx(port);
> > 
> > 
> > actually be something like:
> > 
> > 
> > 	if (pm_runtime_active(&port_dev->dev) || !pm_runtime_enabled(&port_dev->dev))
> > 		port->ops->start_tx(port);
> > 
> > 
> > since there are uarts that don't enable runtime PM?
> > 
> > I notice that 16550A I'm using doesn't set UART_CAP_RPM and since we
> > have data left in the xmit buffer (I managed to confirm that), it is as
> > if during init, there is a race between the serial probing and the
> > getty putting data in the buffer? If it weren't statrted, that would
> > explain things...
> 
> The above change didn't work but what does appear to be making a
> difference is making this code call start_tx unconditionally which is
> what it did prior to the patch. That does cause a "wake" when there
> might not be any data but the code handles that gracefully.
> 
> I therefore suspect this is the place the issue is, the question is
> what the right conditions for calling start_tx are?
> 
> I'll keep going with testing of that as the intermittent nature does
> make this hard to know if any change helps or not.

Can you try the patch below?  I just sent it to Linus and it's from Tony
to resolve some other pm issues with the serial port code.

thanks,

greg k-h

From 81a61051e0ce5fd7e09225c0d5985da08c7954a7 Mon Sep 17 00:00:00 2001
From: Tony Lindgren <tony@atomide.com>
Date: Thu, 5 Oct 2023 10:56:42 +0300
Subject: [PATCH 3/4] serial: core: Fix checks for tx runtime PM state

Maximilian reported that surface_serial_hub serdev tx does not work during
system suspend. During system suspend, runtime PM gets disabled in
__device_suspend_late(), and tx is unable to wake-up the serial core port
device that we use to check if tx is safe to start. Johan summarized the
regression noting that serdev tx no longer always works as earlier when the
serdev device is runtime PM active.

The serdev device and the serial core controller devices are siblings of
the serial port hardware device. The runtime PM usage count from serdev
device does not propagate to the serial core device siblings, it only
propagates to the parent.

In addition to the tx issue for suspend, testing for the serial core port
device can cause an unnecessary delay in enabling tx while waiting for the
serial core port device to wake-up. The serial core port device wake-up is
only needed to flush pending tx when the serial port hardware device was
in runtime PM suspended state.

To fix the regression, we need to check the runtime PM state of the parent
serial port hardware device for tx instead of the serial core port device.

As the serial port device drivers may or may not implement runtime PM, we
need to also add a check for pm_runtime_enabled().

Reported-by: Maximilian Luz <luzmaximilian@gmail.com>
Cc: stable <stable@kernel.org>
Fixes: 84a9582fd203 ("serial: core: Start managing serial controllers to enable runtime PM")
Signed-off-by: Tony Lindgren <tony@atomide.com>
Tested-by: Maximilian Luz <luzmaximilian@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20231005075644.25936-1-tony@atomide.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/tty/serial/serial_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index ca26a8aef2cb..d5ba6e90bd95 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -156,7 +156,7 @@ static void __uart_start(struct uart_state *state)
 	 * enabled, serial_port_runtime_resume() calls start_tx() again
 	 * after enabling the device.
 	 */
-	if (pm_runtime_active(&port_dev->dev))
+	if (!pm_runtime_enabled(port->dev) || pm_runtime_active(port->dev))
 		port->ops->start_tx(port);
 	pm_runtime_mark_last_busy(&port_dev->dev);
 	pm_runtime_put_autosuspend(&port_dev->dev);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-15 15:31           ` Greg Kroah-Hartman
@ 2023-10-15 21:30             ` Richard Purdie
  2023-10-16  6:35               ` Tony Lindgren
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Purdie @ 2023-10-15 21:30 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Mikko Rapeli, linux-serial, Tony Lindgren, openembedded-core,
	Bruce Ashfield, Randy MacLeod, Paul Gortmaker

On Sun, 2023-10-15 at 17:31 +0200, Greg Kroah-Hartman wrote:
> On Sun, Oct 15, 2023 at 01:29:48PM +0100, Richard Purdie wrote:
> > On Sat, 2023-10-14 at 12:13 +0100, Richard Purdie via
> > lists.openembedded.org wrote:
> > > On Sat, 2023-10-14 at 10:41 +0100, Richard Purdie wrote:
> > > > Brief summary:
> > > > 
> > > > We're seeing an issue on x86_64 with 6.5.X where data appears to be
> > > > left in the transmission buffer and not sent to the port on the second
> > > > serial port (ttyS1) until we trigger it with intervention.
> > > > 
> > > > Paul Gortmaker did some painful bisection over a few days down to:
> > > > 
> > > > serial: core: Start managing serial controllers to enable runtime PM
> > > > https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118c11@gmail.com/T/#t
> > > 
> > > Having poked around a bit and knowing nothing about any of this, should
> > > this bit of new code added in the above commit to __uart_start() in
> > > serial_core.c:
> > > 
> > > 	/*
> > > 	 * Start TX if enabled, and kick runtime PM. If the device is not
> > > 	 * enabled, serial_port_runtime_resume() calls start_tx()
> > > again
> > > 	 * after enabling the device.
> > > 	 */
> > > 	if (pm_runtime_active(&port_dev->dev))
> > > 		port->ops->start_tx(port);
> > > 
> > > 
> > > actually be something like:
> > > 
> > > 
> > > 	if (pm_runtime_active(&port_dev->dev) || !pm_runtime_enabled(&port_dev->dev))
> > > 		port->ops->start_tx(port);
> > > 
> > > 
> > > since there are uarts that don't enable runtime PM?
> > > 
> > > I notice that 16550A I'm using doesn't set UART_CAP_RPM and since we
> > > have data left in the xmit buffer (I managed to confirm that), it is as
> > > if during init, there is a race between the serial probing and the
> > > getty putting data in the buffer? If it weren't statrted, that would
> > > explain things...
> > 
> > The above change didn't work but what does appear to be making a
> > difference is making this code call start_tx unconditionally which is
> > what it did prior to the patch. That does cause a "wake" when there
> > might not be any data but the code handles that gracefully.
> > 
> > I therefore suspect this is the place the issue is, the question is
> > what the right conditions for calling start_tx are?
> > 
> > I'll keep going with testing of that as the intermittent nature does
> > make this hard to know if any change helps or not.
> 
> Can you try the patch below?  I just sent it to Linus and it's from Tony
> to resolve some other pm issues with the serial port code.

Thanks for the pointer to this. I've put it through some testing and
had one failure so far so I suspect this isn't enough unfortunately.

FWIW I was looping the testing on the complete removal of the
conditions and didn't see any failures with that.

Cheers,

Richard

> 
> thanks,
> 
> greg k-h
> 
> From 81a61051e0ce5fd7e09225c0d5985da08c7954a7 Mon Sep 17 00:00:00 2001
> From: Tony Lindgren <tony@atomide.com>
> Date: Thu, 5 Oct 2023 10:56:42 +0300
> Subject: [PATCH 3/4] serial: core: Fix checks for tx runtime PM state
> 
> Maximilian reported that surface_serial_hub serdev tx does not work during
> system suspend. During system suspend, runtime PM gets disabled in
> __device_suspend_late(), and tx is unable to wake-up the serial core port
> device that we use to check if tx is safe to start. Johan summarized the
> regression noting that serdev tx no longer always works as earlier when the
> serdev device is runtime PM active.
> 
> The serdev device and the serial core controller devices are siblings of
> the serial port hardware device. The runtime PM usage count from serdev
> device does not propagate to the serial core device siblings, it only
> propagates to the parent.
> 
> In addition to the tx issue for suspend, testing for the serial core port
> device can cause an unnecessary delay in enabling tx while waiting for the
> serial core port device to wake-up. The serial core port device wake-up is
> only needed to flush pending tx when the serial port hardware device was
> in runtime PM suspended state.
> 
> To fix the regression, we need to check the runtime PM state of the parent
> serial port hardware device for tx instead of the serial core port device.
> 
> As the serial port device drivers may or may not implement runtime PM, we
> need to also add a check for pm_runtime_enabled().
> 
> Reported-by: Maximilian Luz <luzmaximilian@gmail.com>
> Cc: stable <stable@kernel.org>
> Fixes: 84a9582fd203 ("serial: core: Start managing serial controllers to enable runtime PM")
> Signed-off-by: Tony Lindgren <tony@atomide.com>
> Tested-by: Maximilian Luz <luzmaximilian@gmail.com>
> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Link: https://lore.kernel.org/r/20231005075644.25936-1-tony@atomide.com
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
>  drivers/tty/serial/serial_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
> index ca26a8aef2cb..d5ba6e90bd95 100644
> --- a/drivers/tty/serial/serial_core.c
> +++ b/drivers/tty/serial/serial_core.c
> @@ -156,7 +156,7 @@ static void __uart_start(struct uart_state *state)
>  	 * enabled, serial_port_runtime_resume() calls start_tx() again
>  	 * after enabling the device.
>  	 */
> -	if (pm_runtime_active(&port_dev->dev))
> +	if (!pm_runtime_enabled(port->dev) || pm_runtime_active(port->dev))
>  		port->ops->start_tx(port);
>  	pm_runtime_mark_last_busy(&port_dev->dev);
>  	pm_runtime_put_autosuspend(&port_dev->dev);


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
       [not found]             ` <178C6861D9B097E8.24994@lists.openembedded.org>
@ 2023-10-16  5:59                 ` Mikko Rapeli
  0 siblings, 0 replies; 25+ messages in thread
From: Mikko Rapeli @ 2023-10-16  5:59 UTC (permalink / raw)
  To: linux-serial
  Cc: Paul Gortmaker, Greg Kroah-Hartman, Richard Purdie,
	openembedded-core, Paul Gortmaker, Bruce Ashfield, Randy MacLeod

Hi,

Thanks a lot Paul! Updating linux-serial that bisection shows

84a9582fd203 serial: core: Start managing serial controllers to enable runtime PM

caused this regression in ttyS1 functionality with qemu.

Lots of more details below.

Cheers,

-Mikko

On Fri, Oct 13, 2023 at 11:15:35AM -0400, Paul Gortmaker wrote:
> So, here is an update for those not following on IRC:
> 
> [Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)] On 11/10/2023 (Wed 00:50) Paul Gortmaker wrote:
> 
> [snip details of getting RP's chunk of AB python working as a 1% reproducer]
> 
> > Next steps:
> > 
> > Each run takes approximately 20s, and with a 1.5% success factor, a
> > minimum of 150 clean runs is needed before one can say "OK, *maybe* this
> > kernel currently under test is OK.  That translates to over an hour to
> > get a data point you have some faith in as being "good".  [Still a heck
> > of a lot better than doing 150 *builds* though!]
> > 
> > On the other hand, just one instance of missing login on ttyS1 is
> > concretely "bad" and hence the test run can be stopped as soon as one
> > instance of that is seen.
> > 
> > This "good is really maybe perhaps good, but bad is for sure bad" type
> > of bisects that I love to hate.
> > 
> > Since the data points are "expensive", I probably won't just attempt a
> > mindless brute force bisect.  I'd probably do something more like...
> > 
> > --ensure modules (or lack of) don't impact the results, so I don't have
> >   to rebundle rootfs images.
> 
> This turned out to be true - didn't need modules.
> 
> > --maybe see if I can build with host gcc vs. using devshell/sysroot gcc
> >   and still see the issue.  Although really, kernel build time optimization
> >   won't be the bottleneck in this particular case, so not worthwhile?
> 
> I didn't bother with the above, since build time was not significant here.
> 
> > --check vanilla v6.5 (i.e. v6.5.0 -- no linux-stable content) so I know
> >   whether to dig around in v6.5.0 --> v6.5.5, or dig around in the
> >   much much larger v6.4 --> v6.5 additions.  I expect it to be the
> >   latter, i.e. digging through mainline, but who knows.
> 
> As expected, it followed v6.5.0 and linux-stable isn't to blame here.
> 
> > --assuming mainline introduced it, instead of a brute force bisect, I'd
> >   look for merges by Linus from gregKH pull requests for the tty and
> >   serial subsystems in the merge window (v6.4 ---> v6.5-rc1) and test
> >   both sides of that/those merge(s).
> 
> This also largely went according to plan. I started with the tty merge:
> 
> v6.5-rc1~74:
> 868a9fd94807 Merge tag 'tty-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
> 
> and did burrow down into the 60-odd commits within.  One complication
> was that there were interim mainline merges into the dev stream:
> 
> $ git log --oneline --merges v6.5-rc1~74^2  ^v6.5-rc1~74^1
> 2076b2a4a6b7 Merge 6.4-rc5 into tty-next
> 0e4daea31d83 Merge 6.4-rc3 into tty-next
> 
> ..which usually indicate a dependency on work from another subsystem.
> Fortunately the test pass/fail boundary didn't land on one of those.
> 
> > --if things haven't completely fallen apart by that point, I'd be left
> >   with a relatively small set of subsystem specific commits from a single
> >   pull request which then could be bisected.
> 
> We did somehow manage to get here with just 60 commits to consider.
> 
> > 
> > Sometimes in the past, when I've got down to that small set of commits,
> > the culprit almost jumps out at you when scanning over the shortlogs.
> 
> While I didn't immediately spot the commit via the shortlog, extensive
> test iterations did give us confidence in pinpointing a single commit:
> 
> 84a9582fd203 serial: core: Start managing serial controllers to enable runtime PM       *******         <---- FAIL (33/1436)
> ae62c49c0cef serial: 8250: omap: convert to modern PM ops                                               <---- PASS (0/1593)
> 
> > Anyway, that is where my thoughts are at.  But of course we don't even
> > yet know if this "test" is even accurately reflecting the OEQA/AB issue!
> > So nobody should get their hopes up too early.
> 
> I think now, we can say with a solid degree of confidence that the PM
> commit gets to own the responsibility of causing the ttyS1 getty not
> flushing the login banner and prompt.  Hence why I did over 1000 runs on
> both the PASS and FAIL commit boundary.
> 
> Unfortunately, there is a lot in that commit ; also a lot in the
> discussion and the twelve iterations before acceptance that I've not
> even looked at yet:
> 
> https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118c11@gmail.com/T/#t
> 
> My "next steps" at this point would be to try and attempt to break down
> that relatively large commit into multiple "sub-commits" in order to use
> the AB test harness portion to better get an idea of which change
> component is the root cause of the ttyS1 getty stall/seizure.
> 
> I won't get to that until Saturday at the earliest - going AFK after
> sending this.  However, Mikko already opened a dialog with the
> linux-serial folks a few days ago:
> 
> https://lore.kernel.org/all/ZSPQY6UYg21Z0PnN@nuoska/
> 
> Mikko -- if you want to take our pseudo bisect information (below) and
> continue that thread; sharing what we've learned since, and also loop in
> Tony (author of v6.4-rc3-31-g84a9582fd203) and others on the CC there,
> that might be a good next step to do now, while I am AFK.
> 
> A quick technical note on testing - all my kernels have been built in
> the same devshell instance, so same path, same compiler, same ".config",
> same "test harness" from RP, same qemu, same underlying machine.  Both
> Richcard and I have also found that loading on the host machine running
> the qemu binary (ubu 20.04 default in my case) influences the rate of
> incidence - an otherwise idle machine is less likely to show the issue
> vs a machine that has other things going on -- indicating a qemu
> scheduling component influences the incident rate.
> 
> Paul.
> --
> 
> Note asterik "*****" marking suspect commit.
> 
> v6.5.7													<---- FAIL (2/188)
> v6.5.5													<---- FAIL (13/1039)
> v6.5.0													<---- FAIL (6/422)
> v6.5-rc1												<---- FAIL (2/230)
> 
> v6.5-rc1~74:
> 868a9fd94807 Merge tag 'tty-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
> 
> $git log --oneline v6.5-rc1~74^2  ^v6.5-rc1~74^1
> e534756c9412 tty_audit: make data of tty_audit_log() const						<---- FAIL (3/266)
> e64ed44bce43 tty_audit: make tty pointers in exposed functions const
> 3e540a7bad85 tty_audit: make icanon a bool
> 94f94810efc4 tty_audit: invert the condition in tty_audit_log()
> e1488513abee tty_audit: use kzalloc() in tty_audit_buf_alloc()
> 777e456852f7 tty_audit: use TASK_COMM_LEN for task comm
> a82d62f70854 Revert "8250: add support for ASIX devices with a FIFO bug"
> 27a826837ec9 serial: atmel: don't enable IRQs prematurely
> 930cbf92db01 tty: serial: Add Nuvoton ma35d1 serial driver support
> e0edfdc15863 tty: serial: fsl_lpuart: add earlycon for imx8ulp platform
> 639949a7031e tty: serial: imx: fix rs485 rx after tx
> e8cc334847db selftests: tty: add selftest for tty timestamp updates
> 360c11e2258c tty: tty_io: update timestamps on all device nodes
> 4903fde8047a tty: fix hang on tty device with no_room set
> cef09673c35b serial: core: fix -EPROBE_DEFER handling in init
> 20a41a62618d serial: 8250_omap: Use force_suspend and resume for system suspend
> f3710f5e9e1a tty: serial: samsung_tty: Use abs() to simplify some code
> 832e231cff47 tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() when iterating clk
> a9c09546e903 tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() in case of error
> d9f59caf94a9 serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE
> 1eea99c04555 powerpc/legacy_serial: Handle SERIAL_8250_FSL=n build failures
> 52861a3be8d6 serial: core: don't kfree device managed data
> b6c7ff2693dd serial: 8250_mtk: Simplify clock sequencing and runtime PM
> 643662d12bc2 serial: st-asc: fix typo in property name
> 66eff0ef528b powerpc/legacy_serial: Warn about 8250 devices operated without active FSL workarounds
> 2076b2a4a6b7 Merge 6.4-rc5 into tty-next								<---- FAIL (2/145)
> 49c80922ff81 serial: core: Fix error handling for serial_core_ctrl_device_add()				<---- FAIL (6/992)
> 539914240a01 serial: core: Fix probing serial_base_bus devices						<---- FAIL (3/145)
> d0a396083e91 serial: core: Don't drop port_mutex in serial_core_remove_one_port
> 84a9582fd203 serial: core: Start managing serial controllers to enable runtime PM	*******		<---- FAIL (33/1436)
> ae62c49c0cef serial: 8250: omap: convert to modern PM ops						<---- PASS (0/1593)
> c53aab207622 serial: 8250: omap: Move uart_write() inside PM section
> d2d4bd217ccd serial: 8250-fsl: Expand description of the MPC83xx UART's misbehaviour
> 1b997aef4f54 serial: Indicate fintek option may also be required for RS232 support
> 20ec397d694b tty: serial: fsl_lpuart: Check the return value of dmaengine_tx_status			<---- PASS (0/316)
> 0d07703be74f serial: Switch i2c drivers back to use .probe()
> d0b309a5d3f4 serial: 8250: synchronize and annotate UART_IER access
> 25614735a647 serial: 8250: lock port for UART_IER access in omap8250_irq()
> 8b45503776b6 serial: 8250: lock port for omap8250_restore_regs()
> 87660fb4041f serial: 8250: lock port for rx_dma() callback
> 51e45fba14bf serial: core: lock port for start_rx() in uart_resume_port()
> ca73a892c5be serial: 8250: lock port for stop_rx() in omap8250_irq()
> abcb0cf1f5b2 serial: core: lock port for stop_rx() in uart_suspend_port()
> b1207d86169d serial: 8250: lock port in startup() callbacks
> 0e4daea31d83 Merge 6.4-rc3 into tty-next
> d5b3d02d0b10 serial: Make uart_remove_one_port() return void
> 6bd6cd29c924 serial: stm32: Ignore return value of uart_remove_one_port() in .remove()
> 72fc578b220c serial: 8250_rt288x: Remove unnecessary UART_REG_UNMAPPED
> 33e3b0eb7550 serial: 8250_rt288x: Name non-standard divisor latch reg
> b334214ea08d serial: 8250: RT288x/Au1xxx code away from core
> 30c61f53fdf2 serial: 8250: Add dl_read/write, bugs and mapsize into plat_serial8250_port
> 98658ae8f392 serial: 8250: Document uart_8250_port's ->dl_read/write()
> b245aa0cc583 serial: 8250: Change dl_read/write to handle value as u32
> 9d86719f8769 serial: 8250: Allow using ports higher than SERIAL_8250_RUNTIME_UARTS
> db86bb6ed497 serial: 8250: omap: Shut down on remove for console uart
> fef4f600319e serial: 8250: omap: Fix life cycle issues for interrupt handlers
> 398cecc24846 serial: 8250: omap: Fix imprecise external abort for omap_8250_pm()
> b9ab22c2bc86 serial: 8250: omap: Fix freeing of resources on failed register
> 6d8c1fca0a6e serial: pl011: set UART011_CR_RXE in pl011_set_termios after port shutdown
> cf9aa72d2f91 tty: serial: fsl_lpuart: optimize the timer based EOP logic
> 
> Baseline of tty queue for v6.5 was:
> $git show --oneline --stat cf9aa72d2f91~
> ac9a78681b92 (tag: v6.4-rc1) Linux 6.4-rc1
>  Makefile | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Have also 1000+ runs on Yocto's v6.4.16 based kernel with no fails.

On Mon, Oct 09, 2023 at 01:05:23PM +0300, Mikko Rapeli via lists.openembedded.org wrote:
> Hi linux-serial and Greg,
> 
> Yocto Linux distro maintainer Richard Purdie is seeing a regression or behavior
> change after updating kernel from 6.4 to 6.5. Yocto runs a lot of automated tests with qemu
> where a python test framework configures and spawns qemu (version 8.1) with two
> serial ports and boots a newly built kernel and rootfs there. The python test framework
> tries to detect getty login prompt from ttyS1 while boot time logging
> from qemu, kernel, userspace etc happens on serial console ttyS0. This has worked
> reliably for years and across multiple kernel versions. Now with 6.5 kernel there are
> suddenly frequent failures where ttyS1 doesn't show the login prompt in the python code
> which listens on the qemu socket from target system ttyS1. The target system ttyS0 is working
> and boot time logs and getty prompt are visible there. The ttyS1 logs from test framework show
> that only two characters CR and LF make it through. This same happens on x86_64 host
> running qemu machine for x86_64 and arm64 targets.
> 
> The kernel boot logs from target system with 6.5.5 don't show any errors and getty
> processes are started correctly and they don't see any errors either. Data from
> target system userspace is being written, but seems to sometimes hang inside kernel
> or qemu serial port buffers, and only on the ttyS1 while ttyS0 works reliably all
> the time.
> 
> Example target logs with 6.5.5 via ttyS0:
> 
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
> or https://pastebin.com/raw/jRRa2CwW
> 
> ttyS1 logs from the same run:
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2
> 
> Kernel config:
> https://pastebin.com/raw/Lv9Hfeuh
> 
> Do you have any ideas or hints how to debug or possibly fix this?
> 
> tty layer has seen quite a few major changes between 6.4.14 and 6.5.5 and we're not able
> to pinpoint or bisect the issue, since it so annoyingly rare.
> 
> Some more details from Richard below.
> 
> Cheers,
> 
> -Mikko
> 
> On Sun, Oct 08, 2023 at 09:23:57AM +0100, Richard Purdie wrote:
> > On Sat, 2023-10-07 at 23:05 +0100, Richard Purdie via
> > lists.openembedded.org wrote:
> > > I thought I'd summarise where things are at with the 6.5 kernel.
> > > 
> > > We've fixed:
> > > * the ARM LTP OOM lockup (kernel patch)
> > > * the locale ARM selftest failure which was OOM due to silly buffer 
> > >   allocations in 6.5 (kernel commandline)
> > > * the ARM jitterentropy errors (kernel patch)
> > > * the cryptodev build failures (recipe updated)
> > > 
> > > We've also:
> > > * disabled the strace tests that fail with 6.5.
> > > * made sure the serial ports and getty counts match
> > > * added ttyrun which wraps serial consoles and avoids hacks
> > > * made the qemurunner logging save all the port logs
> > > * made the qemurunner write the binary data it is sent verbatim
> > > * made sure to use nodelay on qemu's tcpserial
> > > 
> > > This leaves an annoying serial console problem where ttyS1 never has
> > > the getty login prompt appear.
> > > 
> > > What we know:
> > > 
> > > * We've only seen this on x86 more recently (yesterday/today) but have
> > > seen it on ARM in the days before that.
> > > 
> > > * It affects both sysvinit and systemd images.
> > > 
> > > * Systemd does print that it started a getty on ttyS0 and ttyS1 when
> > > the failure occurs
> > > 
> > > * There is a getty running according to "ps" when the failure occurs
> > > 
> > > * There are only ever one or three characters received to ttyS1 in the
> > > failure case (0x0d and 0x0a chars, i.e. CR and LF)
> > > 
> > > * It can't be any kind of utf-8 conversion issue since the login prompt
> > > isn't visible in the binary log
> > > 
> > > * the kernel boot logs do show the serial port created with the same
> > > ioport and irq on x86.
> > > 
> > > Previously we did see some logs with timing issues on the ttyS0 port
> > > but the nodelay parameter may have helped with that.
> > > 
> > > There are debug patches in master-next against qemurunner which try and
> > > poke around to gather more debug when things fail using ttyS0.
> > > 
> > > The best failure log we have is now this one:
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/5874/steps/14/logs/stdio
> > > 
> > > where I've saved the logs:
> > > 
> > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
> > > and
> > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2
> > > 
> > > You can see ttyS1 times out after 1000 seconds and the port only has a
> > > single byte (in the .2 file). The other log shows ps output showing the
> > > getty running for ttyS1.
> > > 
> > > Ideas welcome on where from here. 
> > > 
> > > I've tweaked master-next to keep reading the ttyS1 port after we poke
> > > it from ttyS0 to see if that reveals anything next time it fails (build
> > > running).
> > 
> > Testing overnight with the new debug yielded:
> > 
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5895/steps/14/logs/stdio
> > 
> > The interesting bit being:
> > 
> > """
> > WARNING: core-image-full-cmdline-1.0-r0 do_testimage: Extra read data: 
> > Poky (Yocto Project Reference Distro) 4.2+snapshot-
> > 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
> > 
> > qemux86-64 login: helloA
> > 
> > Poky (Yocto Project Reference Distro) 4.2+snapshot-
> > 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
> > qemux86-64 login: 
> > 
> > """
> > 
> > i.e. the getty didn't appear in 1000s but sometime in shutdown the
> > original prompt, the "helloA" and the new getty prompt did.
> > 
> > So the data *is* there but stuck in a buffer somehow. Kernel or qemu
> > side, I don't know.
> > 
> > Cheers,
> > 
> > Richard
> > 
> > 
> > 
> > 

> 
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#188831): https://lists.openembedded.org/g/openembedded-core/message/188831
> Mute This Topic: https://lists.openembedded.org/mt/101849211/7159507
> Group Owner: openembedded-core+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [mikko.rapeli@linaro.org]
> -=-=-=-=-=-=-=-=-=-=-=-
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
@ 2023-10-16  5:59                 ` Mikko Rapeli
  0 siblings, 0 replies; 25+ messages in thread
From: Mikko Rapeli @ 2023-10-16  5:59 UTC (permalink / raw)
  To: linux-serial
  Cc: Paul Gortmaker, Greg Kroah-Hartman, Richard Purdie,
	openembedded-core, Paul Gortmaker, Bruce Ashfield, Randy MacLeod

Hi,

Thanks a lot Paul! Updating linux-serial that bisection shows

84a9582fd203 serial: core: Start managing serial controllers to enable runtime PM

caused this regression in ttyS1 functionality with qemu.

Lots of more details below.

Cheers,

-Mikko

On Fri, Oct 13, 2023 at 11:15:35AM -0400, Paul Gortmaker wrote:
> So, here is an update for those not following on IRC:
> 
> [Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)] On 11/10/2023 (Wed 00:50) Paul Gortmaker wrote:
> 
> [snip details of getting RP's chunk of AB python working as a 1% reproducer]
> 
> > Next steps:
> > 
> > Each run takes approximately 20s, and with a 1.5% success factor, a
> > minimum of 150 clean runs is needed before one can say "OK, *maybe* this
> > kernel currently under test is OK.  That translates to over an hour to
> > get a data point you have some faith in as being "good".  [Still a heck
> > of a lot better than doing 150 *builds* though!]
> > 
> > On the other hand, just one instance of missing login on ttyS1 is
> > concretely "bad" and hence the test run can be stopped as soon as one
> > instance of that is seen.
> > 
> > This "good is really maybe perhaps good, but bad is for sure bad" type
> > of bisects that I love to hate.
> > 
> > Since the data points are "expensive", I probably won't just attempt a
> > mindless brute force bisect.  I'd probably do something more like...
> > 
> > --ensure modules (or lack of) don't impact the results, so I don't have
> >   to rebundle rootfs images.
> 
> This turned out to be true - didn't need modules.
> 
> > --maybe see if I can build with host gcc vs. using devshell/sysroot gcc
> >   and still see the issue.  Although really, kernel build time optimization
> >   won't be the bottleneck in this particular case, so not worthwhile?
> 
> I didn't bother with the above, since build time was not significant here.
> 
> > --check vanilla v6.5 (i.e. v6.5.0 -- no linux-stable content) so I know
> >   whether to dig around in v6.5.0 --> v6.5.5, or dig around in the
> >   much much larger v6.4 --> v6.5 additions.  I expect it to be the
> >   latter, i.e. digging through mainline, but who knows.
> 
> As expected, it followed v6.5.0 and linux-stable isn't to blame here.
> 
> > --assuming mainline introduced it, instead of a brute force bisect, I'd
> >   look for merges by Linus from gregKH pull requests for the tty and
> >   serial subsystems in the merge window (v6.4 ---> v6.5-rc1) and test
> >   both sides of that/those merge(s).
> 
> This also largely went according to plan. I started with the tty merge:
> 
> v6.5-rc1~74:
> 868a9fd94807 Merge tag 'tty-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
> 
> and did burrow down into the 60-odd commits within.  One complication
> was that there were interim mainline merges into the dev stream:
> 
> $ git log --oneline --merges v6.5-rc1~74^2  ^v6.5-rc1~74^1
> 2076b2a4a6b7 Merge 6.4-rc5 into tty-next
> 0e4daea31d83 Merge 6.4-rc3 into tty-next
> 
> ..which usually indicate a dependency on work from another subsystem.
> Fortunately the test pass/fail boundary didn't land on one of those.
> 
> > --if things haven't completely fallen apart by that point, I'd be left
> >   with a relatively small set of subsystem specific commits from a single
> >   pull request which then could be bisected.
> 
> We did somehow manage to get here with just 60 commits to consider.
> 
> > 
> > Sometimes in the past, when I've got down to that small set of commits,
> > the culprit almost jumps out at you when scanning over the shortlogs.
> 
> While I didn't immediately spot the commit via the shortlog, extensive
> test iterations did give us confidence in pinpointing a single commit:
> 
> 84a9582fd203 serial: core: Start managing serial controllers to enable runtime PM       *******         <---- FAIL (33/1436)
> ae62c49c0cef serial: 8250: omap: convert to modern PM ops                                               <---- PASS (0/1593)
> 
> > Anyway, that is where my thoughts are at.  But of course we don't even
> > yet know if this "test" is even accurately reflecting the OEQA/AB issue!
> > So nobody should get their hopes up too early.
> 
> I think now, we can say with a solid degree of confidence that the PM
> commit gets to own the responsibility of causing the ttyS1 getty not
> flushing the login banner and prompt.  Hence why I did over 1000 runs on
> both the PASS and FAIL commit boundary.
> 
> Unfortunately, there is a lot in that commit ; also a lot in the
> discussion and the twelve iterations before acceptance that I've not
> even looked at yet:
> 
> https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118c11@gmail.com/T/#t
> 
> My "next steps" at this point would be to try and attempt to break down
> that relatively large commit into multiple "sub-commits" in order to use
> the AB test harness portion to better get an idea of which change
> component is the root cause of the ttyS1 getty stall/seizure.
> 
> I won't get to that until Saturday at the earliest - going AFK after
> sending this.  However, Mikko already opened a dialog with the
> linux-serial folks a few days ago:
> 
> https://lore.kernel.org/all/ZSPQY6UYg21Z0PnN@nuoska/
> 
> Mikko -- if you want to take our pseudo bisect information (below) and
> continue that thread; sharing what we've learned since, and also loop in
> Tony (author of v6.4-rc3-31-g84a9582fd203) and others on the CC there,
> that might be a good next step to do now, while I am AFK.
> 
> A quick technical note on testing - all my kernels have been built in
> the same devshell instance, so same path, same compiler, same ".config",
> same "test harness" from RP, same qemu, same underlying machine.  Both
> Richcard and I have also found that loading on the host machine running
> the qemu binary (ubu 20.04 default in my case) influences the rate of
> incidence - an otherwise idle machine is less likely to show the issue
> vs a machine that has other things going on -- indicating a qemu
> scheduling component influences the incident rate.
> 
> Paul.
> --
> 
> Note asterik "*****" marking suspect commit.
> 
> v6.5.7													<---- FAIL (2/188)
> v6.5.5													<---- FAIL (13/1039)
> v6.5.0													<---- FAIL (6/422)
> v6.5-rc1												<---- FAIL (2/230)
> 
> v6.5-rc1~74:
> 868a9fd94807 Merge tag 'tty-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
> 
> $git log --oneline v6.5-rc1~74^2  ^v6.5-rc1~74^1
> e534756c9412 tty_audit: make data of tty_audit_log() const						<---- FAIL (3/266)
> e64ed44bce43 tty_audit: make tty pointers in exposed functions const
> 3e540a7bad85 tty_audit: make icanon a bool
> 94f94810efc4 tty_audit: invert the condition in tty_audit_log()
> e1488513abee tty_audit: use kzalloc() in tty_audit_buf_alloc()
> 777e456852f7 tty_audit: use TASK_COMM_LEN for task comm
> a82d62f70854 Revert "8250: add support for ASIX devices with a FIFO bug"
> 27a826837ec9 serial: atmel: don't enable IRQs prematurely
> 930cbf92db01 tty: serial: Add Nuvoton ma35d1 serial driver support
> e0edfdc15863 tty: serial: fsl_lpuart: add earlycon for imx8ulp platform
> 639949a7031e tty: serial: imx: fix rs485 rx after tx
> e8cc334847db selftests: tty: add selftest for tty timestamp updates
> 360c11e2258c tty: tty_io: update timestamps on all device nodes
> 4903fde8047a tty: fix hang on tty device with no_room set
> cef09673c35b serial: core: fix -EPROBE_DEFER handling in init
> 20a41a62618d serial: 8250_omap: Use force_suspend and resume for system suspend
> f3710f5e9e1a tty: serial: samsung_tty: Use abs() to simplify some code
> 832e231cff47 tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() when iterating clk
> a9c09546e903 tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() in case of error
> d9f59caf94a9 serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE
> 1eea99c04555 powerpc/legacy_serial: Handle SERIAL_8250_FSL=n build failures
> 52861a3be8d6 serial: core: don't kfree device managed data
> b6c7ff2693dd serial: 8250_mtk: Simplify clock sequencing and runtime PM
> 643662d12bc2 serial: st-asc: fix typo in property name
> 66eff0ef528b powerpc/legacy_serial: Warn about 8250 devices operated without active FSL workarounds
> 2076b2a4a6b7 Merge 6.4-rc5 into tty-next								<---- FAIL (2/145)
> 49c80922ff81 serial: core: Fix error handling for serial_core_ctrl_device_add()				<---- FAIL (6/992)
> 539914240a01 serial: core: Fix probing serial_base_bus devices						<---- FAIL (3/145)
> d0a396083e91 serial: core: Don't drop port_mutex in serial_core_remove_one_port
> 84a9582fd203 serial: core: Start managing serial controllers to enable runtime PM	*******		<---- FAIL (33/1436)
> ae62c49c0cef serial: 8250: omap: convert to modern PM ops						<---- PASS (0/1593)
> c53aab207622 serial: 8250: omap: Move uart_write() inside PM section
> d2d4bd217ccd serial: 8250-fsl: Expand description of the MPC83xx UART's misbehaviour
> 1b997aef4f54 serial: Indicate fintek option may also be required for RS232 support
> 20ec397d694b tty: serial: fsl_lpuart: Check the return value of dmaengine_tx_status			<---- PASS (0/316)
> 0d07703be74f serial: Switch i2c drivers back to use .probe()
> d0b309a5d3f4 serial: 8250: synchronize and annotate UART_IER access
> 25614735a647 serial: 8250: lock port for UART_IER access in omap8250_irq()
> 8b45503776b6 serial: 8250: lock port for omap8250_restore_regs()
> 87660fb4041f serial: 8250: lock port for rx_dma() callback
> 51e45fba14bf serial: core: lock port for start_rx() in uart_resume_port()
> ca73a892c5be serial: 8250: lock port for stop_rx() in omap8250_irq()
> abcb0cf1f5b2 serial: core: lock port for stop_rx() in uart_suspend_port()
> b1207d86169d serial: 8250: lock port in startup() callbacks
> 0e4daea31d83 Merge 6.4-rc3 into tty-next
> d5b3d02d0b10 serial: Make uart_remove_one_port() return void
> 6bd6cd29c924 serial: stm32: Ignore return value of uart_remove_one_port() in .remove()
> 72fc578b220c serial: 8250_rt288x: Remove unnecessary UART_REG_UNMAPPED
> 33e3b0eb7550 serial: 8250_rt288x: Name non-standard divisor latch reg
> b334214ea08d serial: 8250: RT288x/Au1xxx code away from core
> 30c61f53fdf2 serial: 8250: Add dl_read/write, bugs and mapsize into plat_serial8250_port
> 98658ae8f392 serial: 8250: Document uart_8250_port's ->dl_read/write()
> b245aa0cc583 serial: 8250: Change dl_read/write to handle value as u32
> 9d86719f8769 serial: 8250: Allow using ports higher than SERIAL_8250_RUNTIME_UARTS
> db86bb6ed497 serial: 8250: omap: Shut down on remove for console uart
> fef4f600319e serial: 8250: omap: Fix life cycle issues for interrupt handlers
> 398cecc24846 serial: 8250: omap: Fix imprecise external abort for omap_8250_pm()
> b9ab22c2bc86 serial: 8250: omap: Fix freeing of resources on failed register
> 6d8c1fca0a6e serial: pl011: set UART011_CR_RXE in pl011_set_termios after port shutdown
> cf9aa72d2f91 tty: serial: fsl_lpuart: optimize the timer based EOP logic
> 
> Baseline of tty queue for v6.5 was:
> $git show --oneline --stat cf9aa72d2f91~
> ac9a78681b92 (tag: v6.4-rc1) Linux 6.4-rc1
>  Makefile | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Have also 1000+ runs on Yocto's v6.4.16 based kernel with no fails.

On Mon, Oct 09, 2023 at 01:05:23PM +0300, Mikko Rapeli via lists.openembedded.org wrote:
> Hi linux-serial and Greg,
> 
> Yocto Linux distro maintainer Richard Purdie is seeing a regression or behavior
> change after updating kernel from 6.4 to 6.5. Yocto runs a lot of automated tests with qemu
> where a python test framework configures and spawns qemu (version 8.1) with two
> serial ports and boots a newly built kernel and rootfs there. The python test framework
> tries to detect getty login prompt from ttyS1 while boot time logging
> from qemu, kernel, userspace etc happens on serial console ttyS0. This has worked
> reliably for years and across multiple kernel versions. Now with 6.5 kernel there are
> suddenly frequent failures where ttyS1 doesn't show the login prompt in the python code
> which listens on the qemu socket from target system ttyS1. The target system ttyS0 is working
> and boot time logs and getty prompt are visible there. The ttyS1 logs from test framework show
> that only two characters CR and LF make it through. This same happens on x86_64 host
> running qemu machine for x86_64 and arm64 targets.
> 
> The kernel boot logs from target system with 6.5.5 don't show any errors and getty
> processes are started correctly and they don't see any errors either. Data from
> target system userspace is being written, but seems to sometimes hang inside kernel
> or qemu serial port buffers, and only on the ttyS1 while ttyS0 works reliably all
> the time.
> 
> Example target logs with 6.5.5 via ttyS0:
> 
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
> or https://pastebin.com/raw/jRRa2CwW
> 
> ttyS1 logs from the same run:
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2
> 
> Kernel config:
> https://pastebin.com/raw/Lv9Hfeuh
> 
> Do you have any ideas or hints how to debug or possibly fix this?
> 
> tty layer has seen quite a few major changes between 6.4.14 and 6.5.5 and we're not able
> to pinpoint or bisect the issue, since it so annoyingly rare.
> 
> Some more details from Richard below.
> 
> Cheers,
> 
> -Mikko
> 
> On Sun, Oct 08, 2023 at 09:23:57AM +0100, Richard Purdie wrote:
> > On Sat, 2023-10-07 at 23:05 +0100, Richard Purdie via
> > lists.openembedded.org wrote:
> > > I thought I'd summarise where things are at with the 6.5 kernel.
> > > 
> > > We've fixed:
> > > * the ARM LTP OOM lockup (kernel patch)
> > > * the locale ARM selftest failure which was OOM due to silly bufferï¿½
> > >   allocations in 6.5 (kernel commandline)
> > > * the ARM jitterentropy errors (kernel patch)
> > > * the cryptodev build failures (recipe updated)
> > > 
> > > We've also:
> > > * disabled the strace tests that fail with 6.5.
> > > * made sure the serial ports and getty counts match
> > > * added ttyrun which wraps serial consoles and avoids hacks
> > > * made the qemurunner logging save all the port logs
> > > * made the qemurunner write the binary data it is sent verbatim
> > > * made sure to use nodelay on qemu's tcpserial
> > > 
> > > This leaves an annoying serial console problem where ttyS1 never has
> > > the getty login prompt appear.
> > > 
> > > What we know:
> > > 
> > > * We've only seen this on x86 more recently (yesterday/today) but have
> > > seen it on ARM in the days before that.
> > > 
> > > * It affects both sysvinit and systemd images.
> > > 
> > > * Systemd does print that it started a getty on ttyS0 and ttyS1 when
> > > the failure occurs
> > > 
> > > * There is a getty running according to "ps" when the failure occurs
> > > 
> > > * There are only ever one or three characters received to ttyS1 in the
> > > failure case (0x0d and 0x0a chars, i.e. CR and LF)
> > > 
> > > * It can't be any kind of utf-8 conversion issue since the login prompt
> > > isn't visible in the binary log
> > > 
> > > * the kernel boot logs do show the serial port created with the same
> > > ioport and irq on x86.
> > > 
> > > Previously we did see some logs with timing issues on the ttyS0 port
> > > but the nodelay parameter may have helped with that.
> > > 
> > > There are debug patches in master-next against qemurunner which try and
> > > poke around to gather more debug when things fail using ttyS0.
> > > 
> > > The best failure log we have is now this one:
> > > 
> > > https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/5874/steps/14/logs/stdio
> > > 
> > > where I've saved the logs:
> > > 
> > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
> > > and
> > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2
> > > 
> > > You can see ttyS1 times out after 1000 seconds and the port only has a
> > > single byte (in the .2 file). The other log shows ps output showing the
> > > getty running for ttyS1.
> > > 
> > > Ideas welcome on where from here.ï¿½
> > > 
> > > I've tweaked master-next to keep reading the ttyS1 port after we poke
> > > it from ttyS0 to see if that reveals anything next time it fails (build
> > > running).
> > 
> > Testing overnight with the new debug yielded:
> > 
> > https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5895/steps/14/logs/stdio
> > 
> > The interesting bit being:
> > 
> > """
> > WARNING: core-image-full-cmdline-1.0-r0 do_testimage: Extra read data: 
> > Poky (Yocto Project Reference Distro) 4.2+snapshot-
> > 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
> > 
> > qemux86-64 login: helloA
> > 
> > Poky (Yocto Project Reference Distro) 4.2+snapshot-
> > 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1
> > qemux86-64 login: 
> > 
> > """
> > 
> > i.e. the getty didn't appear in 1000s but sometime in shutdown the
> > original prompt, the "helloA" and the new getty prompt did.
> > 
> > So the data *is* there but stuck in a buffer somehow. Kernel or qemu
> > side, I don't know.
> > 
> > Cheers,
> > 
> > Richard
> > 
> > 
> > 
> > 

> 
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#188831): https://lists.openembedded.org/g/openembedded-core/message/188831
> Mute This Topic: https://lists.openembedded.org/mt/101849211/7159507
> Group Owner: openembedded-core+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [mikko.rapeli@linaro.org]
> -=-=-=-=-=-=-=-=-=-=-=-
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-15 21:30             ` Richard Purdie
@ 2023-10-16  6:35               ` Tony Lindgren
  2023-10-16  7:16                 ` Mikko Rapeli
  0 siblings, 1 reply; 25+ messages in thread
From: Tony Lindgren @ 2023-10-16  6:35 UTC (permalink / raw)
  To: Richard Purdie
  Cc: Greg Kroah-Hartman, Mikko Rapeli, linux-serial,
	openembedded-core, Bruce Ashfield, Randy MacLeod, Paul Gortmaker

* Richard Purdie <richard.purdie@linuxfoundation.org> [231015 21:30]:
> On Sun, 2023-10-15 at 17:31 +0200, Greg Kroah-Hartman wrote:
> > Can you try the patch below?  I just sent it to Linus and it's from Tony
> > to resolve some other pm issues with the serial port code.
> 
> Thanks for the pointer to this. I've put it through some testing and
> had one failure so far so I suspect this isn't enough unfortunately.
>
> FWIW I was looping the testing on the complete removal of the
> conditions and didn't see any failures with that.

Care to clarify what's the failing test now?

Is the issue still the second port not always coming up after boot or
something else?

Regards,

Tony

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-16  6:35               ` Tony Lindgren
@ 2023-10-16  7:16                 ` Mikko Rapeli
  2023-10-16  7:23                   ` Tony Lindgren
  0 siblings, 1 reply; 25+ messages in thread
From: Mikko Rapeli @ 2023-10-16  7:16 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Richard Purdie, Greg Kroah-Hartman, linux-serial,
	openembedded-core, Bruce Ashfield, Randy MacLeod, Paul Gortmaker

Hi,

On Mon, Oct 16, 2023 at 09:35:01AM +0300, Tony Lindgren wrote:
> * Richard Purdie <richard.purdie@linuxfoundation.org> [231015 21:30]:
> > On Sun, 2023-10-15 at 17:31 +0200, Greg Kroah-Hartman wrote:
> > > Can you try the patch below?  I just sent it to Linus and it's from Tony
> > > to resolve some other pm issues with the serial port code.
> > 
> > Thanks for the pointer to this. I've put it through some testing and
> > had one failure so far so I suspect this isn't enough unfortunately.
> >
> > FWIW I was looping the testing on the complete removal of the
> > conditions and didn't see any failures with that.
> 
> Care to clarify what's the failing test now?
> 
> Is the issue still the second port not always coming up after boot or
> something else?

Yes, data from the ttyS1 getty is not coming through from kernel and qemu to
the test framework looking for login prompt after qemu machine boot.
Workarounds like sending "\n\n" from the test framework through qemu to ttyS1
or "echo helloB > /dev/ttyS1" via working ttyS0 don't seem to help and wake
it up.

Cheers,

-Mikko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-16  7:16                 ` Mikko Rapeli
@ 2023-10-16  7:23                   ` Tony Lindgren
  2023-10-16  8:09                     ` Mikko Rapeli
  2023-10-16  8:10                     ` Richard Purdie
  0 siblings, 2 replies; 25+ messages in thread
From: Tony Lindgren @ 2023-10-16  7:23 UTC (permalink / raw)
  To: Mikko Rapeli
  Cc: Richard Purdie, Greg Kroah-Hartman, linux-serial,
	openembedded-core, Bruce Ashfield, Randy MacLeod, Paul Gortmaker

* Mikko Rapeli <mikko.rapeli@linaro.org> [231016 07:16]:
> Hi,
> 
> On Mon, Oct 16, 2023 at 09:35:01AM +0300, Tony Lindgren wrote:
> > * Richard Purdie <richard.purdie@linuxfoundation.org> [231015 21:30]:
> > > On Sun, 2023-10-15 at 17:31 +0200, Greg Kroah-Hartman wrote:
> > > > Can you try the patch below?  I just sent it to Linus and it's from Tony
> > > > to resolve some other pm issues with the serial port code.
> > > 
> > > Thanks for the pointer to this. I've put it through some testing and
> > > had one failure so far so I suspect this isn't enough unfortunately.
> > >
> > > FWIW I was looping the testing on the complete removal of the
> > > conditions and didn't see any failures with that.
> > 
> > Care to clarify what's the failing test now?
> > 
> > Is the issue still the second port not always coming up after boot or
> > something else?
> 
> Yes, data from the ttyS1 getty is not coming through from kernel and qemu to
> the test framework looking for login prompt after qemu machine boot.
> Workarounds like sending "\n\n" from the test framework through qemu to ttyS1
> or "echo helloB > /dev/ttyS1" via working ttyS0 don't seem to help and wake
> it up.

OK so for trying to reproduce this with qemu, is this with the default uarts
or with some -device pci-serial-2x type options?

Regards,

Tony

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-16  7:23                   ` Tony Lindgren
@ 2023-10-16  8:09                     ` Mikko Rapeli
  2023-10-16  8:10                     ` Richard Purdie
  1 sibling, 0 replies; 25+ messages in thread
From: Mikko Rapeli @ 2023-10-16  8:09 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Richard Purdie, Greg Kroah-Hartman, linux-serial,
	openembedded-core, Bruce Ashfield, Randy MacLeod, Paul Gortmaker

Hi,

On Mon, Oct 16, 2023 at 10:23:52AM +0300, Tony Lindgren wrote:
> * Mikko Rapeli <mikko.rapeli@linaro.org> [231016 07:16]:
> > On Mon, Oct 16, 2023 at 09:35:01AM +0300, Tony Lindgren wrote:
> > > * Richard Purdie <richard.purdie@linuxfoundation.org> [231015 21:30]:
> > > > On Sun, 2023-10-15 at 17:31 +0200, Greg Kroah-Hartman wrote:
> > > > > Can you try the patch below?  I just sent it to Linus and it's from Tony
> > > > > to resolve some other pm issues with the serial port code.
> > > > 
> > > > Thanks for the pointer to this. I've put it through some testing and
> > > > had one failure so far so I suspect this isn't enough unfortunately.
> > > >
> > > > FWIW I was looping the testing on the complete removal of the
> > > > conditions and didn't see any failures with that.
> > > 
> > > Care to clarify what's the failing test now?
> > > 
> > > Is the issue still the second port not always coming up after boot or
> > > something else?
> > 
> > Yes, data from the ttyS1 getty is not coming through from kernel and qemu to
> > the test framework looking for login prompt after qemu machine boot.
> > Workarounds like sending "\n\n" from the test framework through qemu to ttyS1
> > or "echo helloB > /dev/ttyS1" via working ttyS0 don't seem to help and wake
> > it up.
> 
> OK so for trying to reproduce this with qemu, is this with the default uarts
> or with some -device pci-serial-2x type options?

Here is how yocto test frameworks starts qemu:

qemu-system-x86_64 -device virtio-net-pci,netdev=net0,mac=52:54:00:12:35:07 -netdev user,id=net0,hostfwd=tcp:127.0.0.1:2225-:22,hostfwd=tcp:127.0.0.1:2325-:23,tftp=/home/builder/src/base/build_test/tmp/deploy/images/qemux86-64 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -drive file=/home/builder/src/base/build_test/tmp/deploy/images/qemux86-64/core-image-ptest-openssh-qemux86-64.rootfs.ext4,if=virtio,format=raw -usb -device usb-tablet -usb -device usb-kbd   -cpu IvyBridge -machine q35,i8042=off -smp 4 -enable-kvm -m 1024 -serial tcp:127.0.0.1:46313 -serial tcp:127.0.0.1:53891  -pidfile /home/builder/src/base/build_test/pidfile_1926373  -S -qmp unix:./.lzkynxe1,server,wait -qmp unix:./.qcpdyz4j,server,nowait -snapshot -nographic  -kernel /home/builder/src/base/build_test/tmp/deploy/images/qemux86-64/bzImage -append 'root=/dev/vda rw  ip=dhcp console=ttyS0 console=ttyS1 oprofile.timer=1 tsc=reliable no_timer_check rcupdate.rcu_expedited=1 swiotlb=1  printk.time=1'

Example boot log when this ttyS1 failure happens is here:

https://pastebin.com/raw/jRRa2CwW

Cheers,

-Mikko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-16  7:23                   ` Tony Lindgren
  2023-10-16  8:09                     ` Mikko Rapeli
@ 2023-10-16  8:10                     ` Richard Purdie
  2023-10-17  6:56                       ` Tony Lindgren
  1 sibling, 1 reply; 25+ messages in thread
From: Richard Purdie @ 2023-10-16  8:10 UTC (permalink / raw)
  To: Tony Lindgren, Mikko Rapeli
  Cc: Greg Kroah-Hartman, linux-serial, openembedded-core,
	Bruce Ashfield, Randy MacLeod, Paul Gortmaker

On Mon, 2023-10-16 at 10:23 +0300, Tony Lindgren wrote:
> * Mikko Rapeli <mikko.rapeli@linaro.org> [231016 07:16]:
> > Hi,
> > 
> > On Mon, Oct 16, 2023 at 09:35:01AM +0300, Tony Lindgren wrote:
> > > * Richard Purdie <richard.purdie@linuxfoundation.org> [231015 21:30]:
> > > > On Sun, 2023-10-15 at 17:31 +0200, Greg Kroah-Hartman wrote:
> > > > > Can you try the patch below?  I just sent it to Linus and it's from Tony
> > > > > to resolve some other pm issues with the serial port code.
> > > > 
> > > > Thanks for the pointer to this. I've put it through some testing and
> > > > had one failure so far so I suspect this isn't enough unfortunately.
> > > > 
> > > > FWIW I was looping the testing on the complete removal of the
> > > > conditions and didn't see any failures with that.
> > > 
> > > Care to clarify what's the failing test now?

Failure is where the data never all makes it through ttyS1 and the
login prompt doesn't appear. In our CI this appears like this

https://autobuilder.yoctoproject.org/typhoon/#/builders/145/builds/711/steps/12/logs/stdio

Click the magnify glass to make searchable, then search for "Target
didn't reach login banner in 1000 seconds". You'll then see the echo
helloB to /dev/ttyS1, then the "Extra log data read:" containing the
getty for ttyS1 which woke up.

> > > 
> > > Is the issue still the second port not always coming up after boot or
> > > something else?
> > 
> > Yes, data from the ttyS1 getty is not coming through from kernel and qemu to
> > the test framework looking for login prompt after qemu machine boot.
> > Workarounds like sending "\n\n" from the test framework through qemu to ttyS1
> > or "echo helloB > /dev/ttyS1" via working ttyS0 don't seem to help and wake
> > it up.
> 
> OK so for trying to reproduce this with qemu, is this with the default uarts
> or with some -device pci-serial-2x type options?

The port sometimes doesn't come up properly at boot.

To be clear, the "\n\n" from the qemu side into the port doesn't seem
to help. The "echo helloB > /dev/ttyS1" inside the image does seem to
wake it up. 

The qemu command we're using is:

qemu-system-x86_64 -device virtio-net-pci,netdev=net0,mac=52:54:00:12:34:02 
    -serial tcp:127.0.0.1:50421 
    -serial tcp:127.0.0.1:46457  
    -netdev tap,id=net0,ifname=tap0,script=no,downscript=no 
    -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 
    -drive file=./core-image-minimal-qemux86-64.rootfs.ext4.19673,if=virtio,format=raw -usb -device usb-tablet -usb -device usb-kbd   
    -cpu IvyBridge -machine q35,i8042=off -smp 4 -enable-kvm -m 256 
    -pidfile /media/build/poky/aa/pidfile_19670  -S -qmp unix:./.rbmorp7r,server,wait -qmp unix:./.kf6y7yqg,server,nowait -nographic  
    -kernel /media/build/poky/build/tmp/deploy/images/qemux86-64/bzImage 
    -append 'root=/dev/vda rw  ip=192.168.7.2::192.168.7.1:255.255.255.0::eth0:off:8.8.8.8 net.ifnames=0 console=ttyS0 console=ttyS1 oprofile.timer=1 tsc=reliable no_timer_check rcupdate.rcu_expedited=1 swiotlb=0  printk.time=1'

This is with qemu 8.1.0. The image we're testing with usually doesn't
have an ssh server so we're using the serial ports for control/testing
and not the networking even if we configure it. 

We use the serial ports over the tcp connections to handle the multiple
ports and have python code for that. I did extract that code into a
more standalone form, https://www.rpsys.net/wp/rp/simpleqemu.tgz where
"./commands.py" will then run a boot and wait for the login prompt.
You'd need to set a path in commands.py to point at an images directory
Extra log data read:of an qemux86-64 core-image-minimal OE image build.
Setting runqemuparams = 'nographic' gets rid of the graphics need.

If you don't have a build, I shared a prebuilt image and hacked config
which you could point it at:
https://www.rpsys.net/wp/rp/simpleqemu-images.tgz

I did have to remove some of the cpu qemu options to make it work with
the older qemu versions often found on distros which don't support them
(the q35 machine). It will dump log files into the current directory
and there will be a log for each serial port.

I hacked the script to append sys.argv[1] to the log filename, then
experimented with a command like:

for i in `seq 1 88`; do ./commands.py $i & done

which launches 88 qemus in parallel. Sometimes you see them "hang"
through the size of the ".2" logfile:

$ ls *.2 -la | cut -d ' ' -f 5 | sort | uniq -c
      1 134
      1 249
      1 251
      2 254
      1 255
     51 273

273 is the correct size, the smaller ones are truncated. 

Sadly in some cases this test appears to work when the issue is still
present with the issue only showing up intermittently in our CI. It
does seem timing dependent.

Let me know if can provide any other info.

Cheers,

Richard

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-16  8:10                     ` Richard Purdie
@ 2023-10-17  6:56                       ` Tony Lindgren
  2023-10-17 22:14                         ` Richard Purdie
  0 siblings, 1 reply; 25+ messages in thread
From: Tony Lindgren @ 2023-10-17  6:56 UTC (permalink / raw)
  To: Richard Purdie
  Cc: Mikko Rapeli, Greg Kroah-Hartman, linux-serial,
	openembedded-core, Bruce Ashfield, Randy MacLeod, Paul Gortmaker

* Richard Purdie <richard.purdie@linuxfoundation.org> [231016 08:10]:
> The port sometimes doesn't come up properly at boot.
> 
> To be clear, the "\n\n" from the qemu side into the port doesn't seem
> to help. The "echo helloB > /dev/ttyS1" inside the image does seem to
> wake it up. 

So if I understand correctly, this issue still happens with kernel patched
with commit 81a61051e0ce ("serial: core: Fix checks for tx runtime PM
state"), and the issue now only happens sometimes.

I wonder if the following additional change might help?

I have not yet tried to reproduce with the full test case you guys have,
just with qemu and two ports but no luck so far with a minimal test case.

Regards,

Tony

8< -------------------
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -146,7 +146,7 @@ static void __uart_start(struct uart_state *state)
 
 	/* Increment the runtime PM usage count for the active check below */
 	err = pm_runtime_get(&port_dev->dev);
-	if (err < 0) {
+	if (err < 0 && err != -EINPROGRESS) {
 		pm_runtime_put_noidle(&port_dev->dev);
 		return;
 	}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-17  6:56                       ` Tony Lindgren
@ 2023-10-17 22:14                         ` Richard Purdie
  2023-10-18  5:28                           ` Tony Lindgren
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Purdie @ 2023-10-17 22:14 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Mikko Rapeli, Greg Kroah-Hartman, linux-serial,
	openembedded-core, Bruce Ashfield, Randy MacLeod, Paul Gortmaker

On Tue, 2023-10-17 at 09:56 +0300, Tony Lindgren wrote:
> * Richard Purdie <richard.purdie@linuxfoundation.org> [231016 08:10]:
> > The port sometimes doesn't come up properly at boot.
> > 
> > To be clear, the "\n\n" from the qemu side into the port doesn't seem
> > to help. The "echo helloB > /dev/ttyS1" inside the image does seem to
> > wake it up. 
> 
> So if I understand correctly, this issue still happens with kernel patched
> with commit 81a61051e0ce ("serial: core: Fix checks for tx runtime PM
> state"), and the issue now only happens sometimes.

The issue has always been intermittent and it appeared to happen less
frequently with 81a61051e0ce added but it was hard to know if I was
imagining that.

> I wonder if the following additional change might help?

I've added it into testing and have not reproduced the failure with it
applied yet, locally or on our autobuilder. We need to sort some
release pieces which have been delayed by these issues and we're going
with a workaround for that. Once that is built I can get back to
testing this change more extensively, see if we can still provoke the
issue or not. It may take a day or two of testing before we know with
any certainty if the issue is resolved or not.

Cheers,

Richard

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-17 22:14                         ` Richard Purdie
@ 2023-10-18  5:28                           ` Tony Lindgren
  2023-10-19 12:44                             ` Richard Purdie
  0 siblings, 1 reply; 25+ messages in thread
From: Tony Lindgren @ 2023-10-18  5:28 UTC (permalink / raw)
  To: Richard Purdie
  Cc: Mikko Rapeli, Greg Kroah-Hartman, linux-serial,
	openembedded-core, Bruce Ashfield, Randy MacLeod, Paul Gortmaker

* Richard Purdie <richard.purdie@linuxfoundation.org> [231017 22:15]:
> On Tue, 2023-10-17 at 09:56 +0300, Tony Lindgren wrote:
> > * Richard Purdie <richard.purdie@linuxfoundation.org> [231016 08:10]:
> > > The port sometimes doesn't come up properly at boot.
> > > 
> > > To be clear, the "\n\n" from the qemu side into the port doesn't seem
> > > to help. The "echo helloB > /dev/ttyS1" inside the image does seem to
> > > wake it up. 
> > 
> > So if I understand correctly, this issue still happens with kernel patched
> > with commit 81a61051e0ce ("serial: core: Fix checks for tx runtime PM
> > state"), and the issue now only happens sometimes.
> 
> The issue has always been intermittent and it appeared to happen less
> frequently with 81a61051e0ce added but it was hard to know if I was
> imagining that.

Oh OK.

> > I wonder if the following additional change might help?
> 
> I've added it into testing and have not reproduced the failure with it
> applied yet, locally or on our autobuilder. We need to sort some
> release pieces which have been delayed by these issues and we're going
> with a workaround for that. Once that is built I can get back to
> testing this change more extensively, see if we can still provoke the
> issue or not. It may take a day or two of testing before we know with
> any certainty if the issue is resolved or not.

Thanks for the update, let's wait a few days and see then.

Regards,

Tony

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)
  2023-10-18  5:28                           ` Tony Lindgren
@ 2023-10-19 12:44                             ` Richard Purdie
  0 siblings, 0 replies; 25+ messages in thread
From: Richard Purdie @ 2023-10-19 12:44 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Mikko Rapeli, Greg Kroah-Hartman, linux-serial,
	openembedded-core, Bruce Ashfield, Randy MacLeod, Paul Gortmaker

On Wed, 2023-10-18 at 08:28 +0300, Tony Lindgren wrote:
> * Richard Purdie <richard.purdie@linuxfoundation.org> [231017 22:15]:
> > On Tue, 2023-10-17 at 09:56 +0300, Tony Lindgren wrote:
> > > * Richard Purdie <richard.purdie@linuxfoundation.org> [231016 08:10]:
> > > > The port sometimes doesn't come up properly at boot.
> > > > 
> > > > To be clear, the "\n\n" from the qemu side into the port doesn't seem
> > > > to help. The "echo helloB > /dev/ttyS1" inside the image does seem to
> > > > wake it up. 
> > > 
> > > So if I understand correctly, this issue still happens with kernel patched
> > > with commit 81a61051e0ce ("serial: core: Fix checks for tx runtime PM
> > > state"), and the issue now only happens sometimes.
> > 
> > The issue has always been intermittent and it appeared to happen less
> > frequently with 81a61051e0ce added but it was hard to know if I was
> > imagining that.
> 
> Oh OK.
> 
> > > I wonder if the following additional change might help?
> > 
> > I've added it into testing and have not reproduced the failure with it
> > applied yet, locally or on our autobuilder. We need to sort some
> > release pieces which have been delayed by these issues and we're going
> > with a workaround for that. Once that is built I can get back to
> > testing this change more extensively, see if we can still provoke the
> > issue or not. It may take a day or two of testing before we know with
> > any certainty if the issue is resolved or not.
> 
> Thanks for the update, let's wait a few days and see then.

Our release build failed with our workaround, probably as this piece is
missing. 

I've not seen the failure occur in any build with it applied so far. As
such I think we'll move over to this patch as it seems to be better.
Whether it is fixed or not I'm still not 100% sure but it is looking
more likely. Thanks for the patch/fix!

Cheers,

Richard

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-10-19 12:44 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <178BF2895FF685E6.5378@lists.openembedded.org>
2023-10-08  8:23 ` [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary) Richard Purdie
2023-10-09 10:05   ` Kernel 6.5 ttyS1 hang with qemu (was " Mikko Rapeli
2023-10-09 10:05     ` Mikko Rapeli
2023-10-14  9:41     ` Richard Purdie
2023-10-14 11:13       ` Richard Purdie
     [not found]       ` <178DF50519C11C84.8679@lists.openembedded.org>
2023-10-15 12:29         ` Richard Purdie
2023-10-15 15:31           ` Greg Kroah-Hartman
2023-10-15 21:30             ` Richard Purdie
2023-10-16  6:35               ` Tony Lindgren
2023-10-16  7:16                 ` Mikko Rapeli
2023-10-16  7:23                   ` Tony Lindgren
2023-10-16  8:09                     ` Mikko Rapeli
2023-10-16  8:10                     ` Richard Purdie
2023-10-17  6:56                       ` Tony Lindgren
2023-10-17 22:14                         ` Richard Purdie
2023-10-18  5:28                           ` Tony Lindgren
2023-10-19 12:44                             ` Richard Purdie
     [not found] ` <178C1443D3C8D91F.5378@lists.openembedded.org>
2023-10-09 12:30   ` Richard Purdie
     [not found]   ` <178C704E48D36FAB.24994@lists.openembedded.org>
2023-10-10 20:37     ` Richard Purdie
     [not found]     ` <178CD9772ED4DEB5.9719@lists.openembedded.org>
2023-10-10 20:53       ` Richard Purdie
2023-10-11  4:49         ` Paul Gortmaker
2023-10-13 12:00           ` Richard Purdie
2023-10-13 15:15           ` Paul Gortmaker
     [not found]             ` <178C6861D9B097E8.24994@lists.openembedded.org>
2023-10-16  5:59               ` Kernel 6.5 ttyS1 hang with qemu (was " Mikko Rapeli
2023-10-16  5:59                 ` Mikko Rapeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.