All of lore.kernel.org
 help / color / mirror / Atom feed
* rt_dev_send() stalls periodic task
@ 2019-04-15 17:28 C Smith
  2019-04-16  8:03 ` Jan Kiszka
  0 siblings, 1 reply; 21+ messages in thread
From: C Smith @ 2019-04-15 17:28 UTC (permalink / raw)
  To: Sumitabh Ghosh via Xenomai

My Xenomai periodic routine normally runs for days at a time on most
motherboards, but it is spontaneously getting stuck forever in
rt_dev_write(). This is a write to a xeno_16550A driver serial port.

I must use this brand of motherboard, where the first serial port (rtser0
0x3f8 irq 4) does not have a problem, but the other two serial ports have
the stalling problem (rtser1 0x2f8 irq 5, rtser2 0x2e8 irq 3). Three
motherboards of this brand have been tried with the same results. There are
no shared interrupts in this scenario.

The serial device is set up this way:

struct rtser_config serial_config = {
        .config_mask       = 0xFFFF,
        .baud_rate         = 115200,
        .parity            = RTSER_NO_PARITY,
        .data_bits         = RTSER_8_BITS,
        .stop_bits         = RTSER_1_STOPB,
        .handshake         = RTSER_NO_HAND,
        .fifo_depth        = RTSER_DEF_FIFO_DEPTH, //RTSER_FIFO_DEPTH_8,
    .reserved          = 0,
        .rx_timeout        = 500000,
        .tx_timeout        = RTSER_DEF_TIMEOUT,
        .event_timeout     = 5000000,
        .timestamp_history = RTSER_RX_TIMESTAMP_HISTORY,
        .event_mask        = RTSER_EVENT_RXPEND,
};
fd_tty[0] = rt_dev_open("rtser1", O_RDWR | O_NONBLOCK);
sret = rt_dev_ioctl(fd_tty[0], RTSER_RTIOC_SET_CONFIG, &serial_config);

The application transmits a packet of about 75 bytes repeatedly from a
xenomai periodic task that wakes up at 125Hz repeatedly. Note that there is
also a small RX serial packet arriving so there is some full-duplex
overlap.  On rtser0 this works fine, on the other serial ports the stall
happens after a few hours and my periodic xenomai task stops. There is no
xenomai watchdog message in dmesg. The code is repeatedly checking the
serial port status ioctl and there are no errors like framing errors etc.

The periodic task is just a typical xenomai while() loop:
  next += period_ns + adjust_ns;
    rt_task_sleep_until(next);

When my periodic task stops the kernel says the stack trace is:
[root@oyx ~]# cd /proc/1066/task/1075/
[root@oyx 1075]# cat stack
[<c112d058>] xnpod_suspend_thread+0x3d8/0x650
[<c1132f09>] xnsynch_sleep_on+0x139/0x320
[<c11a7f14>] rtdm_event_timedwait+0x2e4/0x390
[<e858ed3b>] rt_16550_write+0x35b/0x540 [xeno_16550A]
[<c11a1e23>] __rt_dev_write+0x63/0x110
[<c11a9374>] sys_rtdm_write+0x24/0x30
[<c113c2dc>] hisyscall_event+0x1ec/0x380
[<c10eb31a>] ipipe_syscall_hook+0x3a/0x50
[<c10ea220>] __ipipe_notify_syscall+0xb0/0x160
[<c16a73bb>] pipeline_syscall+0x7/0x18
[<ffffffff>] 0xffffffff

I can attach with a debugger, and when I do I think the debugger gets us
out of the stall, so can actually single step the code for a little while.
I can't see any suspicious variable values, only that the serial port
transmitted 40 of my 75 bytes, which is unusual. But I can only single step
until my task sleeps one more time. At the next wakeup if I step into the
rt_dev_write() the task stalls forever and I can no longer debug.

(gdb) thread 2
[Switching to thread 2 (Thread 0xb7797b40 (LWP 1336))]
#0  0xb77caa92 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) where
#0  0xb77caa92 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0xb775d872 in rt_dev_write (fd=12, buf=0xa8eda001, nbyte=72) at
core.c:72
#2  0x08056515 in Process_serial (comm_p=0x810e644 <comm_object+4>,
portnum=1 '\001') at periodic_app.cpp:5404
#3  0x0804e0e4 in Periodic_routine (cookie=0x0) at periodic_app.cpp:1654
#4  0xb7764acd in rt_task_trampoline (cookie=0x0) at task.c:113
#5  0xb777a313 in start_thread () from /lib/libpthread.so.0
#6  0xb7528f2e in clone () from /lib/libc.so.6

I'm using an Intel I5 CPU, 32 bit kernel 3.18.20, Xenomai 2.6.5. I must be
on this Xenomai/kernel version to support tens of thousands of lines of
legacy code. I diffed the driver sources and the rtl_16550 driver did not
functionally change between Xenomai 2.6.5 and Xenomai 3.0.8.

I looked at the rt_dev_write() source code, but I don't see an obvious
infinite loop (though the assembly code is a bit beyond my understanding).
I'd like to detect the problem early and continue without stalling.
It seems the physical serial ports are misbehaving, sure. But what would
make rt_dev_write() stall forever?

thanks,
C Smith

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-15 17:28 rt_dev_send() stalls periodic task C Smith
@ 2019-04-16  8:03 ` Jan Kiszka
  2019-04-18  6:42   ` C Smith
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kiszka @ 2019-04-16  8:03 UTC (permalink / raw)
  To: C Smith, Sumitabh Ghosh via Xenomai

On 15.04.19 19:28, C Smith via Xenomai wrote:
> My Xenomai periodic routine normally runs for days at a time on most
> motherboards, but it is spontaneously getting stuck forever in
> rt_dev_write(). This is a write to a xeno_16550A driver serial port.
> 
> I must use this brand of motherboard, where the first serial port (rtser0
> 0x3f8 irq 4) does not have a problem, but the other two serial ports have
> the stalling problem (rtser1 0x2f8 irq 5, rtser2 0x2e8 irq 3). Three
> motherboards of this brand have been tried with the same results. There are
> no shared interrupts in this scenario.
> 
> The serial device is set up this way:
> 
> struct rtser_config serial_config = {
>          .config_mask       = 0xFFFF,
>          .baud_rate         = 115200,
>          .parity            = RTSER_NO_PARITY,
>          .data_bits         = RTSER_8_BITS,
>          .stop_bits         = RTSER_1_STOPB,
>          .handshake         = RTSER_NO_HAND,
>          .fifo_depth        = RTSER_DEF_FIFO_DEPTH, //RTSER_FIFO_DEPTH_8,
>      .reserved          = 0,
>          .rx_timeout        = 500000,
>          .tx_timeout        = RTSER_DEF_TIMEOUT,
>          .event_timeout     = 5000000,
>          .timestamp_history = RTSER_RX_TIMESTAMP_HISTORY,
>          .event_mask        = RTSER_EVENT_RXPEND,
> };
> fd_tty[0] = rt_dev_open("rtser1", O_RDWR | O_NONBLOCK);
> sret = rt_dev_ioctl(fd_tty[0], RTSER_RTIOC_SET_CONFIG, &serial_config);
> 
> The application transmits a packet of about 75 bytes repeatedly from a
> xenomai periodic task that wakes up at 125Hz repeatedly. Note that there is
> also a small RX serial packet arriving so there is some full-duplex
> overlap.  On rtser0 this works fine, on the other serial ports the stall
> happens after a few hours and my periodic xenomai task stops. There is no
> xenomai watchdog message in dmesg. The code is repeatedly checking the
> serial port status ioctl and there are no errors like framing errors etc.
> 
> The periodic task is just a typical xenomai while() loop:
>    next += period_ns + adjust_ns;
>      rt_task_sleep_until(next);
> 
> When my periodic task stops the kernel says the stack trace is:
> [root@oyx ~]# cd /proc/1066/task/1075/
> [root@oyx 1075]# cat stack
> [<c112d058>] xnpod_suspend_thread+0x3d8/0x650
> [<c1132f09>] xnsynch_sleep_on+0x139/0x320
> [<c11a7f14>] rtdm_event_timedwait+0x2e4/0x390
> [<e858ed3b>] rt_16550_write+0x35b/0x540 [xeno_16550A]

This means the driver is stuck while writing because there are no more free 
entries in the hardware TX FIFO. Do you have hardware flow control enabled? Are 
you sure the that the receiving side is playing nicely?

Jan

> [<c11a1e23>] __rt_dev_write+0x63/0x110
> [<c11a9374>] sys_rtdm_write+0x24/0x30
> [<c113c2dc>] hisyscall_event+0x1ec/0x380
> [<c10eb31a>] ipipe_syscall_hook+0x3a/0x50
> [<c10ea220>] __ipipe_notify_syscall+0xb0/0x160
> [<c16a73bb>] pipeline_syscall+0x7/0x18
> [<ffffffff>] 0xffffffff
> 
> I can attach with a debugger, and when I do I think the debugger gets us
> out of the stall, so can actually single step the code for a little while.
> I can't see any suspicious variable values, only that the serial port
> transmitted 40 of my 75 bytes, which is unusual. But I can only single step
> until my task sleeps one more time. At the next wakeup if I step into the
> rt_dev_write() the task stalls forever and I can no longer debug.
> 
> (gdb) thread 2
> [Switching to thread 2 (Thread 0xb7797b40 (LWP 1336))]
> #0  0xb77caa92 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> (gdb) where
> #0  0xb77caa92 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> #1  0xb775d872 in rt_dev_write (fd=12, buf=0xa8eda001, nbyte=72) at
> core.c:72
> #2  0x08056515 in Process_serial (comm_p=0x810e644 <comm_object+4>,
> portnum=1 '\001') at periodic_app.cpp:5404
> #3  0x0804e0e4 in Periodic_routine (cookie=0x0) at periodic_app.cpp:1654
> #4  0xb7764acd in rt_task_trampoline (cookie=0x0) at task.c:113
> #5  0xb777a313 in start_thread () from /lib/libpthread.so.0
> #6  0xb7528f2e in clone () from /lib/libc.so.6
> 
> I'm using an Intel I5 CPU, 32 bit kernel 3.18.20, Xenomai 2.6.5. I must be
> on this Xenomai/kernel version to support tens of thousands of lines of
> legacy code. I diffed the driver sources and the rtl_16550 driver did not
> functionally change between Xenomai 2.6.5 and Xenomai 3.0.8.
> 
> I looked at the rt_dev_write() source code, but I don't see an obvious
> infinite loop (though the assembly code is a bit beyond my understanding).
> I'd like to detect the problem early and continue without stalling.
> It seems the physical serial ports are misbehaving, sure. But what would
> make rt_dev_write() stall forever?
> 
> thanks,
> C Smith
> 

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-16  8:03 ` Jan Kiszka
@ 2019-04-18  6:42   ` C Smith
  2019-04-18  8:36     ` Jan Kiszka
  0 siblings, 1 reply; 21+ messages in thread
From: C Smith @ 2019-04-18  6:42 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Sumitabh Ghosh via Xenomai

On Tue, Apr 16, 2019 at 1:03 AM Jan Kiszka <jan.kiszka@siemens.com> wrote:

> > The serial device is set up this way:
> > struct rtser_config serial_config = {
> >          .config_mask       = 0xFFFF,
> >          .baud_rate         = 115200,
> >          .parity            = RTSER_NO_PARITY,
> >          .data_bits         = RTSER_8_BITS,
> >          .stop_bits         = RTSER_1_STOPB,
> >          .handshake         = RTSER_NO_HAND,
> >          .fifo_depth        = RTSER_DEF_FIFO_DEPTH, //RTSER_FIFO_DEPTH_8,
> >      .reserved          = 0,
> >          .rx_timeout        = 500000,
> >          .tx_timeout        = RTSER_DEF_TIMEOUT,
> >          .event_timeout     = 5000000,
> >          .timestamp_history = RTSER_RX_TIMESTAMP_HISTORY,
> >          .event_mask        = RTSER_EVENT_RXPEND,
> > };
> > fd_tty[0] = rt_dev_open("rtser1", O_RDWR | O_NONBLOCK);
> > sret = rt_dev_ioctl(fd_tty[0], RTSER_RTIOC_SET_CONFIG, &serial_config);
> >
> > The application transmits a packet of about 75 bytes repeatedly from a
> > xenomai periodic task that wakes up at 125Hz repeatedly. Note that there
> is
> > also a small RX serial packet arriving so there is some full-duplex
> > overlap.  On rtser0 this works fine, on the other serial ports the stall
> > happens after a few hours and my periodic xenomai task stops. There is no
> > xenomai watchdog message in dmesg. The code is repeatedly checking the
> > serial port status ioctl and there are no errors like framing errors etc.
> >
> > The periodic task is just a typical xenomai while() loop:
> >    next += period_ns + adjust_ns;
> >      rt_task_sleep_until(next);
> >
> > When my periodic task stops the kernel says the stack trace is:
> > [root@oyx ~]# cd /proc/1066/task/1075/
> > [root@oyx 1075]# cat stack
> > [<c112d058>] xnpod_suspend_thread+0x3d8/0x650
> > [<c1132f09>] xnsynch_sleep_on+0x139/0x320
> > [<c11a7f14>] rtdm_event_timedwait+0x2e4/0x390
> > [<e858ed3b>] rt_16550_write+0x35b/0x540 [xeno_16550A]
>
> This means the driver is stuck while writing because there are no more
> free
> entries in the hardware TX FIFO. Do you have hardware flow control
> enabled? Are
> you sure the that the receiving side is playing nicely?
> Jan
>

Well I'm not trying to use hardware flow control. Is hardware flow control
the xeno_16550A driver default?
I am using a 3-wire serial cable, no RTS/CTS wires at all, they are
floating.

At present my app only ever calls the ioctl RTSER_RTIOC_SET_CONFIG as
detailed above.
Are you saying I need to do an ioctl RTSER_RTIOC_SET_CONTROL and clear
control bit  *RTSER_MCR_RTS*   ?

I have another Xenomai app which has been running fine as a 3-wire serial
connection for about 4 years,
and it does not do an ioctl RTSER_RTIOC_SET_CONFIG either. Doesn't that
mean setting flow control is unnecessary?

-C Smith

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-18  6:42   ` C Smith
@ 2019-04-18  8:36     ` Jan Kiszka
  2019-04-21  4:33       ` C Smith
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kiszka @ 2019-04-18  8:36 UTC (permalink / raw)
  To: C Smith; +Cc: Sumitabh Ghosh via Xenomai

On 18.04.19 08:42, C Smith wrote:
> 
> 
> On Tue, Apr 16, 2019 at 1:03 AM Jan Kiszka <jan.kiszka@siemens.com 
> <mailto:jan.kiszka@siemens.com>> wrote:
> 
>      > The serial device is set up this way:
>      > struct rtser_config serial_config = {
>      >          .config_mask       = 0xFFFF,
>      >          .baud_rate         = 115200,
>      >          .parity            = RTSER_NO_PARITY,
>      >          .data_bits         = RTSER_8_BITS,
>      >          .stop_bits         = RTSER_1_STOPB,
>      >          .handshake         = RTSER_NO_HAND,
>      >          .fifo_depth        = RTSER_DEF_FIFO_DEPTH, //RTSER_FIFO_DEPTH_8,
>      >      .reserved          = 0,
>      >          .rx_timeout        = 500000,
>      >          .tx_timeout        = RTSER_DEF_TIMEOUT,
>      >          .event_timeout     = 5000000,
>      >          .timestamp_history = RTSER_RX_TIMESTAMP_HISTORY,
>      >          .event_mask        = RTSER_EVENT_RXPEND,
>      > };
>      > fd_tty[0] = rt_dev_open("rtser1", O_RDWR | O_NONBLOCK);
>      > sret = rt_dev_ioctl(fd_tty[0], RTSER_RTIOC_SET_CONFIG, &serial_config);
>      >
>      > The application transmits a packet of about 75 bytes repeatedly from a
>      > xenomai periodic task that wakes up at 125Hz repeatedly. Note that there is
>      > also a small RX serial packet arriving so there is some full-duplex
>      > overlap.  On rtser0 this works fine, on the other serial ports the stall
>      > happens after a few hours and my periodic xenomai task stops. There is no
>      > xenomai watchdog message in dmesg. The code is repeatedly checking the
>      > serial port status ioctl and there are no errors like framing errors etc.
>      >
>      > The periodic task is just a typical xenomai while() loop:
>      >    next += period_ns + adjust_ns;
>      >      rt_task_sleep_until(next);
>      >
>      > When my periodic task stops the kernel says the stack trace is:
>      > [root@oyx ~]# cd /proc/1066/task/1075/
>      > [root@oyx 1075]# cat stack
>      > [<c112d058>] xnpod_suspend_thread+0x3d8/0x650
>      > [<c1132f09>] xnsynch_sleep_on+0x139/0x320
>      > [<c11a7f14>] rtdm_event_timedwait+0x2e4/0x390
>      > [<e858ed3b>] rt_16550_write+0x35b/0x540 [xeno_16550A]
> 
>     This means the driver is stuck while writing because there are no more free
>     entries in the hardware TX FIFO. Do you have hardware flow control enabled? Are
>     you sure the that the receiving side is playing nicely?
>     Jan
> 
> 
> Well I'm not trying to use hardware flow control. Is hardware flow control the 
> xeno_16550A driver default?
> I am using a 3-wire serial cable, no RTS/CTS wires at all, they are floating.z
> 
> At present my app only ever calls the ioctl RTSER_RTIOC_SET_CONFIG as detailed 
> above.
> Are you saying I need to do an ioctl RTSER_RTIOC_SET_CONTROL and clear control 
> bit *RTSER_MCR_RTS*   ?
> 
> I have another Xenomai app which has been running fine as a 3-wire serial 
> connection for about 4 years,
> and it does not do an ioctl RTSER_RTIOC_SET_CONFIG either. Doesn't that mean 
> setting flow control is unnecessary?

The default handshake mode is

#define RTSER_DEF_HAND                  RTSER_NO_HAND

and your setting also does not change that. So that is not the reason. but we 
still see that the TX fifo does not get emptied at some point - or we miss the 
interrupt that informs us about this.

You could confirm that missed-interrupt theory by checking the line status 
register (RTSER_RTIOC_GET_STATUS -> RTSER_LSR_TRANSM_EMPTY) when TX got stuck.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-18  8:36     ` Jan Kiszka
@ 2019-04-21  4:33       ` C Smith
  2019-04-21 20:10         ` Steve Freyder
  0 siblings, 1 reply; 21+ messages in thread
From: C Smith @ 2019-04-21  4:33 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Sumitabh Ghosh via Xenomai

Per your suggestion, I added code to call this ioctl, right after the
rt_dev_write() :
   rt_dev_ioctl(fd_tty[1], RTSER_RTIOC_GET_STATUS, &serial_status);
I let the transmit stall again, then attached with a gdb, which allows me
to step forward to the ioctl:
   serial_status.line_status was 96 decimal, or  0110 0000 binary
which means both transmit holding and transmit shift registers were empty,
thus nothing was queued up in the UART for transmission.
The return value of rt_dev_write() was only 8, after a 72 byte packet was
submitted to rt_dev_write().
So your theory that the TX interrupt got lost seems correct.

First, why does rt_dev_write() wait until all bytes are transmitted ?
Shouldn't it be effectively "non blocking" ?

Second, how might l generate another UART TX interrupt to keep the
transmission going?
Can we modify the serial driver at a low level to check the LSR vs the
bytes in the buffer, and force transmission until the buffer is empty?

thanks,
-C Smith

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-21  4:33       ` C Smith
@ 2019-04-21 20:10         ` Steve Freyder
  2019-04-22  6:40           ` C Smith
  0 siblings, 1 reply; 21+ messages in thread
From: Steve Freyder @ 2019-04-21 20:10 UTC (permalink / raw)
  Cc: xenomai, C Smith

On 4/20/2019 11:33 PM, C Smith via Xenomai wrote:
> Per your suggestion, I added code to call this ioctl, right after the
> rt_dev_write() :
>     rt_dev_ioctl(fd_tty[1], RTSER_RTIOC_GET_STATUS, &serial_status);
> I let the transmit stall again, then attached with a gdb, which allows me
> to step forward to the ioctl:
>     serial_status.line_status was 96 decimal, or  0110 0000 binary
> which means both transmit holding and transmit shift registers were empty,
> thus nothing was queued up in the UART for transmission.
> The return value of rt_dev_write() was only 8, after a 72 byte packet was
> submitted to rt_dev_write().
> So your theory that the TX interrupt got lost seems correct.
>
> First, why does rt_dev_write() wait until all bytes are transmitted ?
> Shouldn't it be effectively "non blocking" ?
>
> Second, how might l generate another UART TX interrupt to keep the
> transmission going?
> Can we modify the serial driver at a low level to check the LSR vs the
> bytes in the buffer, and force transmission until the buffer is empty?
>
> thanks,
> -C Smith
>
[ pls excuse the intrusion on your thread, I experienced this same
   problem years ago, 16550A hardware, bare metal,, perhaps I could
   add a couple of thoughts ]

I would point out that Phillipe made some changes to the 3.0.x iMX
UART driver circa 2019/04/01, in what sounds like the same functional
area.  Granted that's different hardware, but it appears to be a
descendant of this driver, so if those changes were good for the iMX
driver, maybe they're good for this one too.

I got curious about "tx_timeout", and why it doesn't help in this
situation, so I looked at the code.  The driver rightly assumes that
the hardware is going to produce a TX interrupt when the FIFO trigger
level is reached.  The TX interrupt handler will pull more bytes from
the (4K) software transmit buffer to fill the TX FIFO, set the
IER.THRE interrupt enable, and return.    If the TX interrupt doesn't
fire, that process of emptying the software FIFO into the hardware TX
FIFO stops, and there's no timeout-based provision for restoring the
flow of output, so it's only a matter of time before the software FIFO
overflows, and at that point your writes start to stall.

I might argue that since you are in nonblocking mode, the driver
write routine should be doing this check before attempting to put
anything in the software buffer:

     if (userwritelen > freebytesinsoftwarebuffer)  {
         return(-EWOULDBLOCK) ;
     }

With obvious issues for user buffers larger than 4K in NB mode.

That'd keep your task from hanging, but the output is still going to
stop shortly after losing a THRE interrupt.

BTW, if you truly *are* losing an interrupt, the IER.THRE bit should
be equal to 1 when you look at it in the debugger.  If the IER.THRE
bit is 0, then it means that the driver made the mistake, OR perhaps
that there's a timing problem where the CPU *tried* to set IER.THRE
but the chip wasn't ready and never heard the request.        As I
remember it there's a software copy of the last requested output
state of the IER kept in the per-port context structure, so you could
look there to see what the driver last attempted to write to the IER.

I remember you mentioned that you have three such UARTs in your
system, and that one (COM1?) is not having this problem, but the
other two *are*.  I think I would be interested in how the hardware
related to COM1 differs from that of the others.  Are they all on the
motherboard?  Maybe the IRQ assignments are what make the
difference.

Finally, you could run a test where you let the port be handled by
Linux, and exercise it with:

     strace dd if=/dev/zero bs=75 of=/dev/ttyXX

to see if you have the same problem with output stopping (eg, dropping
a THRE interrupt).  Keep an eye on your dmesg output while you're
running the test, Linux might have code to detect a dropped transmit
interrupt based on a timer, and if that happens, it should be logged
via printk() and show up in dmesg.

HTH

Regards,
Steve


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-21 20:10         ` Steve Freyder
@ 2019-04-22  6:40           ` C Smith
  2019-04-22  6:45             ` Jan Kiszka
  0 siblings, 1 reply; 21+ messages in thread
From: C Smith @ 2019-04-22  6:40 UTC (permalink / raw)
  To: Steve Freyder; +Cc: xenomai

Thanks for your insight, Steve. I didn't realize rt_dev_write() doesnt
actually stall until it is called many times and the 4K TX buffer gets
full. (is that right Jan?)
It that is the case, sure I could find a way to check the TX buffer fill
level to prevent my app from stalling.

I rewrote the xeno_16550A driver RTSER_RTIOC_GET_STATUS ioctl to return to
userspace the contents of the IIR and the IER too.
I'm getting IIR = 0b 0001 0100, so the source of the latest interrupt is a
RX (not surprising, as I'm doing full duplex) and there is no THRE
interrupt pending.
So regardless of the ultimate cause, this state will never empty the TX
buffer.

I think my only choice is to try something I had to do once before on a
similarly misbehaving serial port: I'll rewrite the xeno_16550A interrupt
handlers to redundantly check for data pending in the TX buffer whenever
any interrupt like an RX interrupt happens. I do have bidirectional traffic
after all, so the driver will wake up frequently and keep the TX data
transmitting.

Interesting enough, the stall problem did not occur when I used the sample
serial code provided by xenomai: cross-link.c . I also rewrote cross-link.c
to send a 72 byte packet and receive on the same port (I installed a
physical loopback device on the serial port). No stalls for 12+ hours with
packets streaming at 100 Hz.
The only difference in the serial configuration between that cross-link.c
app and my app was :
struct rtser_config :
        .rx_timeout        = RTSER_DEF_TIMEOUT  // infinite ,  no stall for
many hours in cross-link.c
versus:
        .rx_timeout        = 500000   // 500us, stalls within an hour in my
app
I don't know why an RX setting affects TX behavior. I also can't use
RTSER_DEF_TIMEOUT in my application or it dies when it starts up - no clue
why.  But I did try setting
  .rx_timeout        = 5000000   // 5 ms. my app doesnt stall for several
hours
and though that did not cause the serial to stall in my app for several
hours of testing, it is just open-loop finger-crossing, and not a real
solution.
I need the TX interrupts to fire reliably. So I think I must rewrite that
interrupt handler, as above.

-C Smith

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-22  6:40           ` C Smith
@ 2019-04-22  6:45             ` Jan Kiszka
  2019-04-22 19:51               ` Steve Freyder
  2019-04-23 12:15               ` Jan Kiszka
  0 siblings, 2 replies; 21+ messages in thread
From: Jan Kiszka @ 2019-04-22  6:45 UTC (permalink / raw)
  To: C Smith, Steve Freyder; +Cc: xenomai

On 22.04.19 08:40, C Smith via Xenomai wrote:
> Thanks for your insight, Steve. I didn't realize rt_dev_write() doesnt
> actually stall until it is called many times and the 4K TX buffer gets
> full. (is that right Jan?)
> It that is the case, sure I could find a way to check the TX buffer fill
> level to prevent my app from stalling.
>
> I rewrote the xeno_16550A driver RTSER_RTIOC_GET_STATUS ioctl to return to
> userspace the contents of the IIR and the IER too.
> I'm getting IIR = 0b 0001 0100, so the source of the latest interrupt is a
> RX (not surprising, as I'm doing full duplex) and there is no THRE
> interrupt pending.
> So regardless of the ultimate cause, this state will never empty the TX
> buffer.
>
> I think my only choice is to try something I had to do once before on a
> similarly misbehaving serial port: I'll rewrite the xeno_16550A interrupt
> handlers to redundantly check for data pending in the TX buffer whenever
> any interrupt like an RX interrupt happens. I do have bidirectional traffic
> after all, so the driver will wake up frequently and keep the TX data
> transmitting.
>
> Interesting enough, the stall problem did not occur when I used the sample
> serial code provided by xenomai: cross-link.c . I also rewrote cross-link.c
> to send a 72 byte packet and receive on the same port (I installed a
> physical loopback device on the serial port). No stalls for 12+ hours with
> packets streaming at 100 Hz.
> The only difference in the serial configuration between that cross-link.c
> app and my app was :
> struct rtser_config :
>          .rx_timeout        = RTSER_DEF_TIMEOUT  // infinite ,  no stall for
> many hours in cross-link.c
> versus:
>          .rx_timeout        = 500000   // 500us, stalls within an hour in my
> app
> I don't know why an RX setting affects TX behavior. I also can't use
> RTSER_DEF_TIMEOUT in my application or it dies when it starts up - no clue
> why.  But I did try setting
>    .rx_timeout        = 5000000   // 5 ms. my app doesnt stall for several
> hours
> and though that did not cause the serial to stall in my app for several
> hours of testing, it is just open-loop finger-crossing, and not a real
> solution.
> I need the TX interrupts to fire reliably. So I think I must rewrite that
> interrupt handler, as above.
>

I think we have a race between rt_16550_write filling the software queue that
the tx interrupt is supposed to write out and the latter already firing,
consuming that event without seeing the queue filled. I'll think about a better
algorithm tomorrow, one that can possibly get rid of some interrupt events as well.

Jan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-22  6:45             ` Jan Kiszka
@ 2019-04-22 19:51               ` Steve Freyder
  2019-04-22 20:58                 ` Steve Freyder
  2019-04-23 12:15               ` Jan Kiszka
  1 sibling, 1 reply; 21+ messages in thread
From: Steve Freyder @ 2019-04-22 19:51 UTC (permalink / raw)
  Cc: xenomai

On 4/22/2019 1:45 AM, Jan Kiszka wrote:
> On 22.04.19 08:40, C Smith via Xenomai wrote:
>> Thanks for your insight, Steve. I didn't realize rt_dev_write() doesnt
>> actually stall until it is called many times and the 4K TX buffer gets
>> full. (is that right Jan?)
>> It that is the case, sure I could find a way to check the TX buffer fill
>> level to prevent my app from stalling.
>>
>> I rewrote the xeno_16550A driver RTSER_RTIOC_GET_STATUS ioctl to 
>> return to
>> userspace the contents of the IIR and the IER too.
>> I'm getting IIR = 0b 0001 0100, so the source of the latest interrupt 
>> is a
>> RX (not surprising, as I'm doing full duplex) and there is no THRE
>> interrupt pending.
>> So regardless of the ultimate cause, this state will never empty the TX
>> buffer.
>>
>> I think my only choice is to try something I had to do once before on a
>> similarly misbehaving serial port: I'll rewrite the xeno_16550A 
>> interrupt
>> handlers to redundantly check for data pending in the TX buffer whenever
>> any interrupt like an RX interrupt happens. I do have bidirectional 
>> traffic
>> after all, so the driver will wake up frequently and keep the TX data
>> transmitting.
>>
>> Interesting enough, the stall problem did not occur when I used the 
>> sample
>> serial code provided by xenomai: cross-link.c . I also rewrote 
>> cross-link.c
>> to send a 72 byte packet and receive on the same port (I installed a
>> physical loopback device on the serial port). No stalls for 12+ hours 
>> with
>> packets streaming at 100 Hz.
>> The only difference in the serial configuration between that 
>> cross-link.c
>> app and my app was :
>> struct rtser_config :
>>          .rx_timeout        = RTSER_DEF_TIMEOUT  // infinite , no 
>> stall for
>> many hours in cross-link.c
>> versus:
>>          .rx_timeout        = 500000   // 500us, stalls within an 
>> hour in my
>> app
>> I don't know why an RX setting affects TX behavior. I also can't use
>> RTSER_DEF_TIMEOUT in my application or it dies when it starts up - no 
>> clue
>> why.  But I did try setting
>>    .rx_timeout        = 5000000   // 5 ms. my app doesnt stall for 
>> several
>> hours
>> and though that did not cause the serial to stall in my app for several
>> hours of testing, it is just open-loop finger-crossing, and not a real
>> solution.
>> I need the TX interrupts to fire reliably. So I think I must rewrite 
>> that
>> interrupt handler, as above.
>>
>
> I think we have a race between rt_16550_write filling the software 
> queue that
> the tx interrupt is supposed to write out and the latter already firing,
> consuming that event without seeing the queue filled. I'll think about 
> a better
> algorithm tomorrow, one that can possibly get rid of some interrupt 
> events as well.
>
> Jan
>
Greetings again,

If cross-link.c is not stalling, but the CSmith application hangs on
startup when using similar settings to what cross-link.c is using, it
tells me that understanding why this "hang on startup" is happening
would be a good idea.  I know this has happened to me when I got an
event from a UART that my code did not handle, and because I did not
handle it, the event continued to fire over and over - a hang. I
theorized that perhaps there's an issue with there being stale data
or a data overrun condition that exists when the app starts up that's
causing this hang.  In either case, it sounds as though the difference
in settings between CSmith app and cross-link.c might be a key factor.

I went back to the previous email trail, and if I interpreted it
correctly, the overall data rate is only about 80% of 115Kbaud. This
suggests that every time there is a write, the 4K software buffer in
the driver should be completely empty - as should the TX FIFO. The
only time that won't be true is when the transmit processing got
stalled (by loss of interrupt, or whatever).

I would be interested to see what happens if the CSmith app
were to be modified to write one byte at a time, with no delay
between rt_dev_write() calls.

Finally, some searching shows that back when the original National
Semiconductor 16550[A] UARTs were first being "cloned" by other
vendors, National created a program called "COMTEST" that was
designed to reveal the "misbehaviour" of those competing chips by
doing extensive testing of the timing and other characteristics and
how it deviated from "the real thing".  I wonder if anyone in this
group knows where a copy of that program (or a more modern version)
might exist?

Regards,
Steve



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-22 19:51               ` Steve Freyder
@ 2019-04-22 20:58                 ` Steve Freyder
  2019-04-22 22:56                   ` C Smith
  0 siblings, 1 reply; 21+ messages in thread
From: Steve Freyder @ 2019-04-22 20:58 UTC (permalink / raw)
  Cc: xenomai

On 4/22/2019 2:51 PM, Steve Freyder via Xenomai wrote:
> On 4/22/2019 1:45 AM, Jan Kiszka wrote:
>> On 22.04.19 08:40, C Smith via Xenomai wrote:
>>> Thanks for your insight, Steve. I didn't realize rt_dev_write() doesnt
>>> actually stall until it is called many times and the 4K TX buffer gets
>>> full. (is that right Jan?)
>>> It that is the case, sure I could find a way to check the TX buffer 
>>> fill
>>> level to prevent my app from stalling.
>>>
>>> I rewrote the xeno_16550A driver RTSER_RTIOC_GET_STATUS ioctl to 
>>> return to
>>> userspace the contents of the IIR and the IER too.
>>> I'm getting IIR = 0b 0001 0100, so the source of the latest 
>>> interrupt is a
>>> RX (not surprising, as I'm doing full duplex) and there is no THRE
>>> interrupt pending.
>>> So regardless of the ultimate cause, this state will never empty the TX
>>> buffer.
>>>
>>> I think my only choice is to try something I had to do once before on a
>>> similarly misbehaving serial port: I'll rewrite the xeno_16550A 
>>> interrupt
>>> handlers to redundantly check for data pending in the TX buffer 
>>> whenever
>>> any interrupt like an RX interrupt happens. I do have bidirectional 
>>> traffic
>>> after all, so the driver will wake up frequently and keep the TX data
>>> transmitting.
>>>
>>> Interesting enough, the stall problem did not occur when I used the 
>>> sample
>>> serial code provided by xenomai: cross-link.c . I also rewrote 
>>> cross-link.c
>>> to send a 72 byte packet and receive on the same port (I installed a
>>> physical loopback device on the serial port). No stalls for 12+ 
>>> hours with
>>> packets streaming at 100 Hz.
>>> The only difference in the serial configuration between that 
>>> cross-link.c
>>> app and my app was :
>>> struct rtser_config :
>>>          .rx_timeout        = RTSER_DEF_TIMEOUT  // infinite , no 
>>> stall for
>>> many hours in cross-link.c
>>> versus:
>>>          .rx_timeout        = 500000   // 500us, stalls within an 
>>> hour in my
>>> app
>>> I don't know why an RX setting affects TX behavior. I also can't use
>>> RTSER_DEF_TIMEOUT in my application or it dies when it starts up - 
>>> no clue
>>> why.  But I did try setting
>>>    .rx_timeout        = 5000000   // 5 ms. my app doesnt stall for 
>>> several
>>> hours
>>> and though that did not cause the serial to stall in my app for several
>>> hours of testing, it is just open-loop finger-crossing, and not a real
>>> solution.
>>> I need the TX interrupts to fire reliably. So I think I must rewrite 
>>> that
>>> interrupt handler, as above.
>>>
>>
>> I think we have a race between rt_16550_write filling the software 
>> queue that
>> the tx interrupt is supposed to write out and the latter already firing,
>> consuming that event without seeing the queue filled. I'll think 
>> about a better
>> algorithm tomorrow, one that can possibly get rid of some interrupt 
>> events as well.
>>
>> Jan
>>
> Greetings again,
>
> If cross-link.c is not stalling, but the CSmith application hangs on
> startup when using similar settings to what cross-link.c is using, it
> tells me that understanding why this "hang on startup" is happening
> would be a good idea.  I know this has happened to me when I got an
> event from a UART that my code did not handle, and because I did not
> handle it, the event continued to fire over and over - a hang. I
> theorized that perhaps there's an issue with there being stale data
> or a data overrun condition that exists when the app starts up that's
> causing this hang.  In either case, it sounds as though the difference
> in settings between CSmith app and cross-link.c might be a key factor.
>
> I went back to the previous email trail, and if I interpreted it
> correctly, the overall data rate is only about 80% of 115Kbaud. This
> suggests that every time there is a write, the 4K software buffer in
> the driver should be completely empty - as should the TX FIFO. The
> only time that won't be true is when the transmit processing got
> stalled (by loss of interrupt, or whatever).
>
> I would be interested to see what happens if the CSmith app
> were to be modified to write one byte at a time, with no delay
> between rt_dev_write() calls.
>
> Finally, some searching shows that back when the original National
> Semiconductor 16550[A] UARTs were first being "cloned" by other
> vendors, National created a program called "COMTEST" that was
> designed to reveal the "misbehaviour" of those competing chips by
> doing extensive testing of the timing and other characteristics and
> how it deviated from "the real thing".  I wonder if anyone in this
> group knows where a copy of that program (or a more modern version)
> might exist?
>
> Regards,
> Steve
>
>
Apologies, I said "hangs on startup" but the original statement was
"dies on startup".  So the theory was that if that were fixed, and
the timeout was RTSER_DEF_TIMEOUT like it is in cross-link.c, that
this might solve the problem.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-22 20:58                 ` Steve Freyder
@ 2019-04-22 22:56                   ` C Smith
  2019-04-22 23:44                     ` Steve Freyder
  0 siblings, 1 reply; 21+ messages in thread
From: C Smith @ 2019-04-22 22:56 UTC (permalink / raw)
  To: Steve Freyder; +Cc: xenomai

Please don't think the cross-link.c app config has the magic answer.
Changing RX timeouts to prevent TX stalls would be an open loop hack that
might fail the serial traffic jitters differently.   The most suspicious
difference between the two apps is that : cross-link.c behaves very
regularly in terms of timing, because it receives its own transmissions.
My app receives packets from another computer asynchronously (full duplex).
So there is likely a random timing anomaly causing the problem.  Any
properly working UART and driver should be able to handle this, National
Semiconductor would say. But I have a strange serial port, and it is
tricking the xeno_16550A driver.

Jan alluded to a buffer-filling race condition in his comment.
I also fear that Receive interrupt handlers are somehow clearing Transmit
interrupts?

I didn't look at the stack trace when my app hung at startup due to that
infinite RX config. BTW there was no serial traffic whatsover during this
hang. I could try it again and look at the stack...
Also, it would be too hard to rewrite my apps to send single bytes. Its two
computers, rigid packet protocols & CRCs etc. Lots and lots of code. Thats
why I did that test app.

-C Smith

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-22 22:56                   ` C Smith
@ 2019-04-22 23:44                     ` Steve Freyder
  0 siblings, 0 replies; 21+ messages in thread
From: Steve Freyder @ 2019-04-22 23:44 UTC (permalink / raw)
  To: C Smith; +Cc: xenomai

On 4/22/2019 5:56 PM, C Smith wrote:
> Please don't think the cross-link.c app config has the magic answer. 
> Changing RX timeouts to prevent TX stalls would be an open loop hack 
> that might fail the serial traffic jitters differently.   The most 
> suspicious difference between the two apps is that : cross-link.c 
> behaves very regularly in terms of timing, because it receives its own 
> transmissions.   My app receives packets from another computer 
> asynchronously (full duplex). So there is likely a random timing 
> anomaly causing the problem.  Any properly working UART and driver 
> should be able to handle this, National Semiconductor would say. But I 
> have a strange serial port, and it is tricking the xeno_16550A driver.
>
> Jan alluded to a buffer-filling race condition in his comment.
> I also fear that Receive interrupt handlers are somehow clearing 
> Transmit interrupts?
>
> I didn't look at the stack trace when my app hung at startup due to 
> that infinite RX config. BTW there was no serial traffic whatsover 
> during this hang. I could try it again and look at the stack...
> Also, it would be too hard to rewrite my apps to send single bytes. 
> Its two computers, rigid packet protocols & CRCs etc. Lots and lots of 
> code. Thats why I did that test app.
>
> -C Smith
Anything that can prove that there's no hardware problem causing
lost TX interrupts is good.  I remember you said that you had three
serial ports, and one never failed like this, the other two *do*
fail like this.  Isn't that true?  That seems important.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-22  6:45             ` Jan Kiszka
  2019-04-22 19:51               ` Steve Freyder
@ 2019-04-23 12:15               ` Jan Kiszka
  2019-04-24  6:53                 ` C Smith
  2019-04-25  7:15                 ` C Smith
  1 sibling, 2 replies; 21+ messages in thread
From: Jan Kiszka @ 2019-04-23 12:15 UTC (permalink / raw)
  To: C Smith, Steve Freyder; +Cc: xenomai

On 22.04.19 08:45, Jan Kiszka via Xenomai wrote:
> On 22.04.19 08:40, C Smith via Xenomai wrote:
>> Thanks for your insight, Steve. I didn't realize rt_dev_write() doesnt
>> actually stall until it is called many times and the 4K TX buffer gets
>> full. (is that right Jan?)
>> It that is the case, sure I could find a way to check the TX buffer fill
>> level to prevent my app from stalling.
>>
>> I rewrote the xeno_16550A driver RTSER_RTIOC_GET_STATUS ioctl to return to
>> userspace the contents of the IIR and the IER too.
>> I'm getting IIR = 0b 0001 0100, so the source of the latest interrupt is a
>> RX (not surprising, as I'm doing full duplex) and there is no THRE
>> interrupt pending.
>> So regardless of the ultimate cause, this state will never empty the TX
>> buffer.
>>
>> I think my only choice is to try something I had to do once before on a
>> similarly misbehaving serial port: I'll rewrite the xeno_16550A interrupt
>> handlers to redundantly check for data pending in the TX buffer whenever
>> any interrupt like an RX interrupt happens. I do have bidirectional traffic
>> after all, so the driver will wake up frequently and keep the TX data
>> transmitting.
>>
>> Interesting enough, the stall problem did not occur when I used the sample
>> serial code provided by xenomai: cross-link.c . I also rewrote cross-link.c
>> to send a 72 byte packet and receive on the same port (I installed a
>> physical loopback device on the serial port). No stalls for 12+ hours with
>> packets streaming at 100 Hz.
>> The only difference in the serial configuration between that cross-link.c
>> app and my app was :
>> struct rtser_config :
>>          .rx_timeout        = RTSER_DEF_TIMEOUT  // infinite ,  no stall for
>> many hours in cross-link.c
>> versus:
>>          .rx_timeout        = 500000   // 500us, stalls within an hour in my
>> app
>> I don't know why an RX setting affects TX behavior. I also can't use
>> RTSER_DEF_TIMEOUT in my application or it dies when it starts up - no clue
>> why.  But I did try setting
>>    .rx_timeout        = 5000000   // 5 ms. my app doesnt stall for several
>> hours
>> and though that did not cause the serial to stall in my app for several
>> hours of testing, it is just open-loop finger-crossing, and not a real
>> solution.
>> I need the TX interrupts to fire reliably. So I think I must rewrite that
>> interrupt handler, as above.
>>
> 
> I think we have a race between rt_16550_write filling the software queue that
> the tx interrupt is supposed to write out and the latter already firing,
> consuming that event without seeing the queue filled. I'll think about a better
> algorithm tomorrow, one that can possibly get rid of some interrupt events as well.
> 

Could you try this diff?

diff --git a/kernel/drivers/serial/16550A.c b/kernel/drivers/serial/16550A.c
index 24415cf67c..369d65be66 100644
--- a/kernel/drivers/serial/16550A.c
+++ b/kernel/drivers/serial/16550A.c
@@ -190,7 +190,7 @@ static inline int rt_16550_rx_interrupt(struct rt_16550_context *ctx,
 	return rbytes;
 }
 
-static inline void rt_16550_tx_interrupt(struct rt_16550_context *ctx)
+static void rt_16550_tx_fill(struct rt_16550_context *ctx)
 {
 	int c;
 	int count;
@@ -248,7 +248,7 @@ static int rt_16550_interrupt(rtdm_irq_t * irq_context)
 		} else if (iir == IIR_STAT)
 			rt_16550_stat_interrupt(ctx);
 		else if (iir == IIR_TX)
-			rt_16550_tx_interrupt(ctx);
+			rt_16550_tx_fill(ctx);
 		else if (iir == IIR_MODEM) {
 			modem = rt_16550_reg_in(mode, base, MSR);
 			if (modem & (modem << 4))
@@ -951,6 +951,7 @@ ssize_t rt_16550_write(struct rtdm_fd *fd, const void *buf, size_t nbyte)
 	int block;
 	int subblock;
 	int out_pos;
+	int lsr;
 	char *in_pos = (char *)buf;
 	rtdm_toseq_t timeout_seq;
 	ssize_t ret;
@@ -1027,11 +1028,18 @@ ssize_t rt_16550_write(struct rtdm_fd *fd, const void *buf, size_t nbyte)
 			    (ctx->out_tail + block) & (OUT_BUFFER_SIZE - 1);
 			ctx->out_npend += block;
 
-			/* unmask tx interrupt */
-			ctx->ier_status |= IER_TX;
-			rt_16550_reg_out(rt_16550_io_mode_from_ctx(ctx),
-					 ctx->base_addr, IER,
-					 ctx->ier_status);
+			lsr = rt_16550_reg_in(rt_16550_io_mode_from_ctx(ctx),
+					      ctx->base_addr, LSR);
+			if (lsr & RTSER_LSR_THR_EMTPY)
+				rt_16550_tx_fill(ctx);
+
+			if (ctx->out_npend > 0) {
+				/* unmask tx interrupt */
+				ctx->ier_status |= IER_TX;
+				rt_16550_reg_out(rt_16550_io_mode_from_ctx(ctx),
+						ctx->base_addr, IER,
+						ctx->ier_status);
+			}
 
 			rtdm_lock_put_irqrestore(&ctx->lock, lock_ctx);
 			continue;

Only compile-tested so far. In theory, it should plug that race and
avoid enabling the TX interrupt in case the FIFO can take the complete
write directly.

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-23 12:15               ` Jan Kiszka
@ 2019-04-24  6:53                 ` C Smith
  2019-04-25  7:15                 ` C Smith
  1 sibling, 0 replies; 21+ messages in thread
From: C Smith @ 2019-04-24  6:53 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Steve Freyder, xenomai

Thanks very much Jan, I'll try your patch tomorrow.

In the meantime it may confirm your suspicions to hear this :
I hacked the xeno_16550A.c driver to poll the transmit buffer every time
there was an RX interrupt (I have a bidirectional application). If there
was any data waiting just call the TX interrupt routine again. Essentially:
if (ctx->out_npend > 0)
   rt_16550_tx_interrupt(ctx)
With this modified driver I was able to stream data successfully without
any stalls for 8+ hours. An oscilloscope showed there were only occasional
delays in sending the data, like an extra 800us.
That seems to confirm that the UART is still functioning OK at the hardware
level during the problem, it just needs to be "reminded" to transmit. Your
patch may be all it needs...

-C Smith

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-23 12:15               ` Jan Kiszka
  2019-04-24  6:53                 ` C Smith
@ 2019-04-25  7:15                 ` C Smith
  2019-04-25  8:23                   ` Jan Kiszka
  1 sibling, 1 reply; 21+ messages in thread
From: C Smith @ 2019-04-25  7:15 UTC (permalink / raw)
  To: Jan Kiszka, Xenomai List; +Cc: Steve Freyder, w1

Hi Jan,

Your patch worked somewhat but not completely. It prevents my app from
stalling forever, but I caugh the serial transmission itself stalling on
the oscilloscope for quite a long time. My 72 byte TX packet from the
xenomai periodic task gets cut in half and there is no transmission for
7msec, then the transmission resumes. (I'll send you a screenshot)
(Note that I am on xeno 2.6.5 so I merged your 3.x patch above into
16550A.c manually.)

I'm currently doing a 12 hour test of your patch plus mine. In my patch I
check during every RX interrupt to see if I need to call rt_16550_tx_fill()
too. I know that doesn't work for others but my traffic is full duplex so
this test will tell us something if it works and maybe we can ultimately
get the same behavior without my hack.

Also, I made the *.rx_timeout/.tx_timeout *change Jeff found, and it had
the obvious effect. I can make a patch for xeno 2.6.5 if you want. But I'll
point out that this fix may break peoples code functionally, so it may be a
bad idea to fix it on 2.x. Older code was written with a dependence on a
truly different timeout. It broke my app to fix this because there was
suddenly a new unexpected timeout. What's your policy on this issue?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-25  7:15                 ` C Smith
@ 2019-04-25  8:23                   ` Jan Kiszka
  2019-04-26  0:59                     ` C Smith
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kiszka @ 2019-04-25  8:23 UTC (permalink / raw)
  To: C Smith, Xenomai List; +Cc: Steve Freyder, w1

On 25.04.19 09:15, C Smith wrote:
> Hi Jan,
> 
> Your patch worked somewhat but not completely. It prevents my app from stalling 
> forever, but I caugh the serial transmission itself stalling on the oscilloscope 
> for quite a long time. My 72 byte TX packet from the xenomai periodic task gets 
> cut in half and there is no transmission for 7msec, then the transmission 
> resumes. (I'll send you a screenshot)

What is driver and application state during that phase? Who is waiting on what? 
This will be the key to resolve that issue as I'm not yet seeing another mistake 
in the driver.

> (Note that I am on xeno 2.6.5 so I merged your 3.x patch above into 16550A.c 
> manually.)
> 
> I'm currently doing a 12 hour test of your patch plus mine. In my patch I check 
> during every RX interrupt to see if I need to call rt_16550_tx_fill() too. I 
> know that doesn't work for others but my traffic is full duplex so this test 
> will tell us something if it works and maybe we can ultimately get the same 
> behavior without my hack.
> 
> Also, I made the /.rx_timeout/.tx_timeout /change Jeff found, and it had the 
> obvious effect. I can make a patch for xeno 2.6.5 if you want. But I'll point 
> out that this fix may break peoples code functionally, so it may be a bad idea 
> to fix it on 2.x. Older code was written with a dependence on a truly different 
> timeout. It broke my app to fix this because there was suddenly a new unexpected 
> timeout. What's your policy on this issue?

The 2.6 repo won't be touched anymore, it's officially dead. If course, you can 
share your patch on the list in case there are other remaining users.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-25  8:23                   ` Jan Kiszka
@ 2019-04-26  0:59                     ` C Smith
  2019-04-26 16:38                       ` Jan Kiszka
  0 siblings, 1 reply; 21+ messages in thread
From: C Smith @ 2019-04-26  0:59 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai List, Steve Freyder, w1

On Thu, Apr 25, 2019 at 1:23 AM Jan Kiszka <jan.kiszka@siemens.com> wrote:

> On 25.04.19 09:15, C Smith wrote:
> > Hi Jan,
> >
> > Your patch worked somewhat but not completely. It prevents my app from
> stalling
> > forever, but I caugh the serial transmission itself stalling on the
> oscilloscope
> > for quite a long time. My 72 byte TX packet from the xenomai periodic
> task gets
> > cut in half and there is no transmission for 7msec, then the
> transmission
> > resumes. (I'll send you a screenshot)
>
> What is driver and application state during that phase? Who is waiting on
> what?
> This will be the key to resolve that issue as I'm not yet seeing another
> mistake
> in the driver.
>

I don't think there is a bug in the serial driver, per se, but my strange
UART requires more from a driver to prevent stalls.
This is a BCM corp 'BCM87Q' industrial motherboard. They are still sold,
not yet EOL.

We do know a lot about the state the serial driver is in: It is just
waiting, thinking it doesn't have any more bytes to transmit. Remember in
previous tests the IIR indicated no pending bytes in the THR. I've
demonstrated how to get past this state with my TX "polling patch".  I ran
my latest test for 12+ hours where I was using your patch plus my polling
patch and there were no stalls whatsoever of the serial driver, as verified
by an Oscilloscope which triggers on a TX stall. The maximum inter-packet
jitter of my TX packet was also fairly low, at <= 450us. In my polling
patch, during a RX interrupt, the code redundantly checks the high level
transmit buffer to see if rt_16550_tx_fill() should be called. Sure, this
workaround only helps when you have full-duplex communications, it would
not help during simplex communications.

Since a device driver can't be reliably polled, I'd prefer some
self-correcting mechanism in the driver which set a callback when it thinks
it has transmitted the last byte, and wakes up and checks one more time
about 100us later to see if it needs to transmit anything else.

> Also, I made the /.rx_timeout/.tx_timeout /change Jeff found, and it had
> the
> > obvious effect. I can make a patch for xeno 2.6.5 if you want. But I'll
> point
> > out that this fix may break peoples code functionally, so it may be a
> bad idea
> > to fix it on 2.x. Older code was written with a dependence on a truly
> different
> > timeout. It broke my app to fix this because there was suddenly a new
> unexpected
> > timeout. What's your policy on this issue?
>
> The 2.6 repo won't be touched anymore, it's officially dead. If course,
> you can
> share your patch on the list in case there are other remaining users.
>

Oh your fine work in 2.6 is very much alive!
But I can agree that adding fixes to it is not appropriate.

-C Smith

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-26  0:59                     ` C Smith
@ 2019-04-26 16:38                       ` Jan Kiszka
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kiszka @ 2019-04-26 16:38 UTC (permalink / raw)
  To: C Smith; +Cc: Xenomai List, Steve Freyder, w1

On 26.04.19 02:59, C Smith wrote:
> On Thu, Apr 25, 2019 at 1:23 AM Jan Kiszka <jan.kiszka@siemens.com 
> <mailto:jan.kiszka@siemens.com>> wrote:
> 
>     On 25.04.19 09:15, C Smith wrote:
>      > Hi Jan,
>      >
>      > Your patch worked somewhat but not completely. It prevents my app from
>     stalling
>      > forever, but I caugh the serial transmission itself stalling on the
>     oscilloscope
>      > for quite a long time. My 72 byte TX packet from the xenomai periodic
>     task gets
>      > cut in half and there is no transmission for 7msec, then the transmission
>      > resumes. (I'll send you a screenshot)
> 
>     What is driver and application state during that phase? Who is waiting on what?
>     This will be the key to resolve that issue as I'm not yet seeing another
>     mistake
>     in the driver.
> 
> 
> I don't think there is a bug in the serial driver, per se, but my strange UART 
> requires more from a driver to prevent stalls.
> This is a BCM corp 'BCM87Q' industrial motherboard. They are still sold, not yet 
> EOL.
> 
> We do know a lot about the state the serial driver is in: It is just waiting, 
> thinking it doesn't have any more bytes to transmit. Remember in previous tests 
> the IIR indicated no pending bytes in the THR. I've demonstrated how to get past 
> this state with my TX "polling patch".  I ran my latest test for 12+ hours where 
> I was using your patch plus my polling patch and there were no stalls whatsoever 
> of the serial driver, as verified by an Oscilloscope which triggers on a TX 
> stall. The maximum inter-packet jitter of my TX packet was also fairly low, at 
> <= 450us. In my polling patch, during a RX interrupt, the code redundantly 
> checks the high level transmit buffer to see if rt_16550_tx_fill() should be 
> called. Sure, this workaround only helps when you have full-duplex 
> communications, it would not help during simplex communications.
> 
> Since a device driver can't be reliably polled, I'd prefer some self-correcting 
> mechanism in the driver which set a callback when it thinks it has transmitted 
> the last byte, and wakes up and checks one more time about 100us later to see if 
> it needs to transmit anything else.

I'd prefer to install any watchdog for potential hardware issues until
we really know they aren't software races.

If there a chance to either break a trace or record the full run when
the issue happens? Then you could try this instrumentation, together with ftrace (trace-cmd record -e "cobalt*"):

diff --git a/kernel/drivers/serial/16550A.c b/kernel/drivers/serial/16550A.c
index 81acc6344e..504d85ccbe 100644
--- a/kernel/drivers/serial/16550A.c
+++ b/kernel/drivers/serial/16550A.c
@@ -197,6 +197,7 @@ static void rt_16550_tx_fill(struct rt_16550_context *ctx)
 	unsigned long base = ctx->base_addr;
 	int mode = rt_16550_io_mode_from_ctx(ctx);
 
+	trace_printk("tx_fill, out_npend: %ld", ctx->out_npend);
 /*	if (uart->modem & MSR_CTS)*/
 	{
 		for (count = ctx->tx_fifo;
@@ -239,6 +240,7 @@ static int rt_16550_interrupt(rtdm_irq_t * irq_context)
 
 	while (1) {
 		iir = rt_16550_reg_in(mode, base, IIR) & IIR_MASK;
+		trace_printk("IIR: 0x%x", iir);
 		if (iir & IIR_PIRQ)
 			break;
 
@@ -284,6 +286,7 @@ static int rt_16550_interrupt(rtdm_irq_t * irq_context)
 	}
 
 	if ((ctx->ier_status & IER_TX) && (ctx->out_npend == 0)) {
+		trace_printk("IER_TX off");
 		/* mask transmitter empty interrupt */
 		ctx->ier_status &= ~IER_TX;
 
@@ -1030,10 +1033,12 @@ ssize_t rt_16550_write(struct rtdm_fd *fd, const void *buf, size_t nbyte)
 
 			lsr = rt_16550_reg_in(rt_16550_io_mode_from_ctx(ctx),
 					      ctx->base_addr, LSR);
+			trace_printk("LSR: 0x%x", lsr);
 			if (lsr & RTSER_LSR_THR_EMTPY)
 				rt_16550_tx_fill(ctx);
 
 			if (ctx->out_npend > 0 && !(ctx->ier_status & IER_TX)) {
+				trace_printk("IER_TX on, out_npend: %lu", ctx->out_npend);
 				/* unmask tx interrupt */
 				ctx->ier_status |= IER_TX;
 				rt_16550_reg_out(rt_16550_io_mode_from_ctx(ctx),


Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-24 14:36 ` Jan Kiszka
@ 2019-04-26  0:41   ` Jeff Webb
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff Webb @ 2019-04-26  0:41 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, April 24, 2019 9:36 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:

> On 24.04.19 15:05, Jeff Webb via Xenomai wrote:
>
> > > The only difference in the serial configuration between that cross-link.c
> > > app and my app was :
> > > struct rtser_config :
> > > .rx_timeout = RTSER_DEF_TIMEOUT // infinite , no stall for
> > > many hours in cross-link.c
> > > versus:
> > > .rx_timeout = 500000 // 500us, stalls within an hour in my
> > > app
> > > I don't know why an RX setting affects TX behavior. I also can't use
> > > RTSER_DEF_TIMEOUT in my application or it dies when it starts up - no clue
> > > why. But I did try setting
> > > .rx_timeout = 5000000 // 5 ms. my app doesnt stall for several
> > > hours
> >
> > This may not be related, but this jogged my memory about a problem I had years ago regarding rx/tx timeouts affecting the other operation. I just took a look at rt_16550_write(), and I see:
> > rtdm_toseq_init(&timeout_seq, ctx->config.rx_timeout);
> > /* Make write operation atomic. */
> > ret = rtdm_mutex_timedlock(&ctx->out_lock, ctx->config.rx_timeout,
> > &timeout_seq);
> > This is the same code as in *_read(). I am wondering if this is a cut and paste error. It seems like these two lines should use tx_timeout instead, but I haven't looked into the code in detail. Maybe this is related, or maybe another bug?
>
> That is a good catch! Copy & paste would be my theory as well... Wanna write a
> patch?

I can't test this easily right now, but I'll try to post a patch at least.

-Jeff



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
  2019-04-24 13:05 Jeff Webb
@ 2019-04-24 14:36 ` Jan Kiszka
  2019-04-26  0:41   ` Jeff Webb
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kiszka @ 2019-04-24 14:36 UTC (permalink / raw)
  To: Jeff Webb, xenomai

On 24.04.19 15:05, Jeff Webb via Xenomai wrote:
>> The only difference in the serial configuration between that cross-link.c
>> app and my app was :
>> struct rtser_config :
>>           .rx_timeout        = RTSER_DEF_TIMEOUT  // infinite ,  no stall for
>> many hours in cross-link.c
>> versus:
>>           .rx_timeout        = 500000   // 500us, stalls within an hour in my
>> app
>> I don't know why an RX setting affects TX behavior. I also can't use
>> RTSER_DEF_TIMEOUT in my application or it dies when it starts up - no clue
>> why.  But I did try setting
>>     .rx_timeout        = 5000000   // 5 ms. my app doesnt stall for several
>> hours
> 
> This may not be related, but this jogged my memory about a problem I had years ago regarding rx/tx timeouts affecting the other operation.  I just took a look at rt_16550_write(), and I see:
> 
>    rtdm_toseq_init(&timeout_seq, ctx->config.rx_timeout);
> 
>    /* Make write operation atomic. */
>    ret = rtdm_mutex_timedlock(&ctx->out_lock, ctx->config.rx_timeout,
>       &timeout_seq);
> 
> This is the same code as in *_read().  I am wondering if this is a cut and paste error.  It seems like these two lines should use tx_timeout instead, but I haven't looked into the code in detail.  Maybe this is related, or maybe another bug?

That is a good catch! Copy & paste would be my theory as well... Wanna write a 
patch?

I don't think it will change the picture for the stalled tx issue, though.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rt_dev_send() stalls periodic task
@ 2019-04-24 13:05 Jeff Webb
  2019-04-24 14:36 ` Jan Kiszka
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Webb @ 2019-04-24 13:05 UTC (permalink / raw)
  To: xenomai

> The only difference in the serial configuration between that cross-link.c
> app and my app was :
> struct rtser_config :
>          .rx_timeout        = RTSER_DEF_TIMEOUT  // infinite ,  no stall for
> many hours in cross-link.c
> versus:
>          .rx_timeout        = 500000   // 500us, stalls within an hour in my
> app
> I don't know why an RX setting affects TX behavior. I also can't use
> RTSER_DEF_TIMEOUT in my application or it dies when it starts up - no clue
> why.  But I did try setting
>    .rx_timeout        = 5000000   // 5 ms. my app doesnt stall for several
> hours

This may not be related, but this jogged my memory about a problem I had years ago regarding rx/tx timeouts affecting the other operation.  I just took a look at rt_16550_write(), and I see:

  rtdm_toseq_init(&timeout_seq, ctx->config.rx_timeout);

  /* Make write operation atomic. */
  ret = rtdm_mutex_timedlock(&ctx->out_lock, ctx->config.rx_timeout,
     &timeout_seq);

This is the same code as in *_read().  I am wondering if this is a cut and paste error.  It seems like these two lines should use tx_timeout instead, but I haven't looked into the code in detail.  Maybe this is related, or maybe another bug?

Thanks,

-Jeff Webb



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-04-26 16:38 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-15 17:28 rt_dev_send() stalls periodic task C Smith
2019-04-16  8:03 ` Jan Kiszka
2019-04-18  6:42   ` C Smith
2019-04-18  8:36     ` Jan Kiszka
2019-04-21  4:33       ` C Smith
2019-04-21 20:10         ` Steve Freyder
2019-04-22  6:40           ` C Smith
2019-04-22  6:45             ` Jan Kiszka
2019-04-22 19:51               ` Steve Freyder
2019-04-22 20:58                 ` Steve Freyder
2019-04-22 22:56                   ` C Smith
2019-04-22 23:44                     ` Steve Freyder
2019-04-23 12:15               ` Jan Kiszka
2019-04-24  6:53                 ` C Smith
2019-04-25  7:15                 ` C Smith
2019-04-25  8:23                   ` Jan Kiszka
2019-04-26  0:59                     ` C Smith
2019-04-26 16:38                       ` Jan Kiszka
2019-04-24 13:05 Jeff Webb
2019-04-24 14:36 ` Jan Kiszka
2019-04-26  0:41   ` Jeff Webb

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.