All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Linuxptp-devel] strangeness
       [not found]       ` <20190228185758.6f37i2sfnr2mdrce@localhost>
@ 2019-03-01  6:24         ` Harini Katakam
  2019-03-06 23:06           ` Paul Thomas
  0 siblings, 1 reply; 13+ messages in thread
From: Harini Katakam @ 2019-03-01  6:24 UTC (permalink / raw)
  To: Richard Cochran, Paul Thomas; +Cc: linuxptp-devel, netdev

+netdev

Hi Paul,
On Fri, Mar 1, 2019 at 12:29 AM Richard Cochran
<richardcochran@gmail.com> wrote:
>
> On Thu, Feb 28, 2019 at 12:33:26PM -0500, Paul Thomas wrote:
> > Yes changing it to TSTAMP_ALL_PTP_FRAMES instead of TSTAMP_ALL_FRAMES
> > does seem to fix the ssh issue. My worry is that there is still a bug
> > somewhere in the network stack that this is just masking.

Ok thanks.
One place to check in the driver will be:
if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
/* skb now belongs to timestamp buffer
* and will be removed later
*/
tx_skb->skb = NULL;
}
When all TX packets are timestamped, the skb always belongs to the
timestamp buffer.

>
> Or the HW isn't sending the frames in the first place.
>
> Check that first!

To check this, the statistics registers in MAC will be one way.
But if there is no TX completion interrupt, then I wouldn't expect
these statistics to increase either. The used bit status in BD dump
might be of more use.

I will also try to reproduce (with TX timestamp ALL) and see if any of
the above gives some clue.

Regards,
Harini

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Linuxptp-devel] strangeness
  2019-03-01  6:24         ` [Linuxptp-devel] strangeness Harini Katakam
@ 2019-03-06 23:06           ` Paul Thomas
  2019-03-07  5:32             ` Harini Katakam
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Thomas @ 2019-03-06 23:06 UTC (permalink / raw)
  To: Harini Katakam; +Cc: Richard Cochran, linuxptp-devel, netdev

On Fri, Mar 1, 2019 at 1:24 AM Harini Katakam <harinik@xilinx.com> wrote:
>
> +netdev
>
> Hi Paul,
> On Fri, Mar 1, 2019 at 12:29 AM Richard Cochran
> <richardcochran@gmail.com> wrote:
> >
> > On Thu, Feb 28, 2019 at 12:33:26PM -0500, Paul Thomas wrote:
> > > Yes changing it to TSTAMP_ALL_PTP_FRAMES instead of TSTAMP_ALL_FRAMES
> > > does seem to fix the ssh issue. My worry is that there is still a bug
> > > somewhere in the network stack that this is just masking.
>
> Ok thanks.
> One place to check in the driver will be:
> if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
> /* skb now belongs to timestamp buffer
> * and will be removed later
> */
> tx_skb->skb = NULL;
> }
> When all TX packets are timestamped, the skb always belongs to the
> timestamp buffer.
>
> >
> > Or the HW isn't sending the frames in the first place.
> >
> > Check that first!
>
> To check this, the statistics registers in MAC will be one way.
> But if there is no TX completion interrupt, then I wouldn't expect
> these statistics to increase either. The used bit status in BD dump
> might be of more use.
>
> I will also try to reproduce (with TX timestamp ALL) and see if any of
> the above gives some clue.
>
> Regards,
> Harini

Hi Harini, any luck looking at this?

I didn't get very far, even in the "broken" state I see plenty of tx_frames:
root@xu5:/opt/linuxptp# ethtool -S eth0
NIC statistics:
     ...
     tx_frames: 39763
     ...

When you said "registers in the MAC" is ethtool -S displaying that?

-Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Linuxptp-devel] strangeness
  2019-03-06 23:06           ` Paul Thomas
@ 2019-03-07  5:32             ` Harini Katakam
  2019-03-07 19:02               ` Paul Thomas
  0 siblings, 1 reply; 13+ messages in thread
From: Harini Katakam @ 2019-03-07  5:32 UTC (permalink / raw)
  To: Paul Thomas; +Cc: Richard Cochran, linuxptp-devel, netdev

Hi Paul,
On Thu, Mar 7, 2019 at 4:38 AM Paul Thomas <pthomas8589@gmail.com> wrote:
>
> On Fri, Mar 1, 2019 at 1:24 AM Harini Katakam <harinik@xilinx.com> wrote:
> >
> > +netdev
> >
> > Hi Paul,
> > On Fri, Mar 1, 2019 at 12:29 AM Richard Cochran
> > <richardcochran@gmail.com> wrote:
> > >
> > > On Thu, Feb 28, 2019 at 12:33:26PM -0500, Paul Thomas wrote:
> > > > Yes changing it to TSTAMP_ALL_PTP_FRAMES instead of TSTAMP_ALL_FRAMES
> > > > does seem to fix the ssh issue. My worry is that there is still a bug
> > > > somewhere in the network stack that this is just masking.
> >
> > Ok thanks.
> > One place to check in the driver will be:
> > if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
> > /* skb now belongs to timestamp buffer
> > * and will be removed later
> > */
> > tx_skb->skb = NULL;
> > }
> > When all TX packets are timestamped, the skb always belongs to the
> > timestamp buffer.
> >
> > >
> > > Or the HW isn't sending the frames in the first place.
> > >
> > > Check that first!
> >
> > To check this, the statistics registers in MAC will be one way.
> > But if there is no TX completion interrupt, then I wouldn't expect
> > these statistics to increase either. The used bit status in BD dump
> > might be of more use.
> >
> > I will also try to reproduce (with TX timestamp ALL) and see if any of
> > the above gives some clue.
> >
> > Regards,
> > Harini
>
> Hi Harini, any luck looking at this?

I'm sorry, I was not able to debug this further.

>
> I didn't get very far, even in the "broken" state I see plenty of tx_frames:
> root@xu5:/opt/linuxptp# ethtool -S eth0
> NIC statistics:
>      ...
>      tx_frames: 39763
>      ...
>
> When you said "registers in the MAC" is ethtool -S displaying that?

Yes, ethtool does display these statistics.
I was referring to the registers starting offset 0xFF0B0108 (for GEM0) here:
https://www.xilinx.com/html_docs/registers/ug1087/ug1087-zynq-ultrascale-registers.html
If you see this value increasing, then the MAC is transmitting successfully.
Although, I realize it could be other traffic. To see if specific
packets (for the
failed SSH connection) are not being queued, a BD dump might help.

Regards,
Harini

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Linuxptp-devel] strangeness
  2019-03-07  5:32             ` Harini Katakam
@ 2019-03-07 19:02               ` Paul Thomas
  2019-03-08  6:08                 ` Harini Katakam
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Thomas @ 2019-03-07 19:02 UTC (permalink / raw)
  To: Harini Katakam; +Cc: Richard Cochran, linuxptp-devel, netdev

On Thu, Mar 7, 2019 at 12:32 AM Harini Katakam <harinik@xilinx.com> wrote:
>
> Hi Paul,
> On Thu, Mar 7, 2019 at 4:38 AM Paul Thomas <pthomas8589@gmail.com> wrote:
> >
> > On Fri, Mar 1, 2019 at 1:24 AM Harini Katakam <harinik@xilinx.com> wrote:
> > >
> > > +netdev
> > >
> > > Hi Paul,
> > > On Fri, Mar 1, 2019 at 12:29 AM Richard Cochran
> > > <richardcochran@gmail.com> wrote:
> > > >
> > > > On Thu, Feb 28, 2019 at 12:33:26PM -0500, Paul Thomas wrote:
> > > > > Yes changing it to TSTAMP_ALL_PTP_FRAMES instead of TSTAMP_ALL_FRAMES
> > > > > does seem to fix the ssh issue. My worry is that there is still a bug
> > > > > somewhere in the network stack that this is just masking.
> > >
> > > Ok thanks.
> > > One place to check in the driver will be:
> > > if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
> > > /* skb now belongs to timestamp buffer
> > > * and will be removed later
> > > */
> > > tx_skb->skb = NULL;
> > > }
> > > When all TX packets are timestamped, the skb always belongs to the
> > > timestamp buffer.
> > >
> > > >
> > > > Or the HW isn't sending the frames in the first place.
> > > >
> > > > Check that first!
> > >
> > > To check this, the statistics registers in MAC will be one way.
> > > But if there is no TX completion interrupt, then I wouldn't expect
> > > these statistics to increase either. The used bit status in BD dump
> > > might be of more use.
> > >
> > > I will also try to reproduce (with TX timestamp ALL) and see if any of
> > > the above gives some clue.
> > >
> > > Regards,
> > > Harini
> >
> > Hi Harini, any luck looking at this?
>
> I'm sorry, I was not able to debug this further.
>
> >
> > I didn't get very far, even in the "broken" state I see plenty of tx_frames:
> > root@xu5:/opt/linuxptp# ethtool -S eth0
> > NIC statistics:
> >      ...
> >      tx_frames: 39763
> >      ...
> >
> > When you said "registers in the MAC" is ethtool -S displaying that?
>
> Yes, ethtool does display these statistics.
> I was referring to the registers starting offset 0xFF0B0108 (for GEM0) here:
> https://www.xilinx.com/html_docs/registers/ug1087/ug1087-zynq-ultrascale-registers.html
> If you see this value increasing, then the MAC is transmitting successfully.
> Although, I realize it could be other traffic. To see if specific
> packets (for the
> failed SSH connection) are not being queued, a BD dump might help.
>
> Regards,
> Harini

OK, I think things are becoming more clear. After just doing ioctl(fd,
SIOCSHWTSTAMP, &ifreq) from userspace (tx_bd_control =
TSTAMP_ALL_FRAMES in macb_ptp.c) then with the nc experiment some udp
transmits do not make it to macb_start_xmit() until receive traffic on
the nc connection comes in (one-to-one, one new rx packet means one
old tx packet goes out).

Working setup:
Before the tx_bd_control = TSTAMP_ALL_FRAMES.
Every time I hit "sN Enter" from nc I see a macb_start_xmit
print_hex_dump() and I see the packet on the nc client side:
# nc -l -u -p 9999
...
s11
[  347.517080] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
f7 04 0a 26 08 00 45 00   ....) ....&..E.
s12
[  348.964369] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
f7 04 0a 26 08 00 45 00   ....) ....&..E.
...

Broken setup:
After the tx_bd_control = TSTAMP_ALL_FRAMES.
Not the first nc packet, but many of the subsequent ones never make it
to macb_start_xmit()
# nc -l -u -p 9999
...
s3
s4
s5
...
Eventually after I send data from the client nc I do see the
macb_start_xmit() lines.

Hopefully this helps point us in the right direction, I would very
much like to be able to do timestamping with my zynqmp board.

thanks,
Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Linuxptp-devel] strangeness
  2019-03-07 19:02               ` Paul Thomas
@ 2019-03-08  6:08                 ` Harini Katakam
  2019-03-08 18:07                   ` Paul Thomas
  0 siblings, 1 reply; 13+ messages in thread
From: Harini Katakam @ 2019-03-08  6:08 UTC (permalink / raw)
  To: Paul Thomas; +Cc: Richard Cochran, linuxptp-devel, netdev

Hi Paul,
On Fri, Mar 8, 2019 at 12:33 AM Paul Thomas <pthomas8589@gmail.com> wrote:
>
> On Thu, Mar 7, 2019 at 12:32 AM Harini Katakam <harinik@xilinx.com> wrote:
> >
> > Hi Paul,
> > On Thu, Mar 7, 2019 at 4:38 AM Paul Thomas <pthomas8589@gmail.com> wrote:
> > >
> > > On Fri, Mar 1, 2019 at 1:24 AM Harini Katakam <harinik@xilinx.com> wrote:
> > > >
> > > > +netdev
> > > >
> > > > Hi Paul,
> > > > On Fri, Mar 1, 2019 at 12:29 AM Richard Cochran
> > > > <richardcochran@gmail.com> wrote:
> > > > >
> > > > > On Thu, Feb 28, 2019 at 12:33:26PM -0500, Paul Thomas wrote:
> > > > > > Yes changing it to TSTAMP_ALL_PTP_FRAMES instead of TSTAMP_ALL_FRAMES
> > > > > > does seem to fix the ssh issue. My worry is that there is still a bug
> > > > > > somewhere in the network stack that this is just masking.
> > > >
> > > > Ok thanks.
> > > > One place to check in the driver will be:
> > > > if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
> > > > /* skb now belongs to timestamp buffer
> > > > * and will be removed later
> > > > */
> > > > tx_skb->skb = NULL;
> > > > }
> > > > When all TX packets are timestamped, the skb always belongs to the
> > > > timestamp buffer.
> > > >
> > > > >
> > > > > Or the HW isn't sending the frames in the first place.
> > > > >
> > > > > Check that first!
> > > >
> > > > To check this, the statistics registers in MAC will be one way.
> > > > But if there is no TX completion interrupt, then I wouldn't expect
> > > > these statistics to increase either. The used bit status in BD dump
> > > > might be of more use.
> > > >
> > > > I will also try to reproduce (with TX timestamp ALL) and see if any of
> > > > the above gives some clue.
> > > >
> > > > Regards,
> > > > Harini
> > >
> > > Hi Harini, any luck looking at this?
> >
> > I'm sorry, I was not able to debug this further.
> >
> > >
> > > I didn't get very far, even in the "broken" state I see plenty of tx_frames:
> > > root@xu5:/opt/linuxptp# ethtool -S eth0
> > > NIC statistics:
> > >      ...
> > >      tx_frames: 39763
> > >      ...
> > >
> > > When you said "registers in the MAC" is ethtool -S displaying that?
> >
> > Yes, ethtool does display these statistics.
> > I was referring to the registers starting offset 0xFF0B0108 (for GEM0) here:
> > https://www.xilinx.com/html_docs/registers/ug1087/ug1087-zynq-ultrascale-registers.html
> > If you see this value increasing, then the MAC is transmitting successfully.
> > Although, I realize it could be other traffic. To see if specific
> > packets (for the
> > failed SSH connection) are not being queued, a BD dump might help.
> >
> > Regards,
> > Harini
>
> OK, I think things are becoming more clear. After just doing ioctl(fd,
> SIOCSHWTSTAMP, &ifreq) from userspace (tx_bd_control =
> TSTAMP_ALL_FRAMES in macb_ptp.c) then with the nc experiment some udp
> transmits do not make it to macb_start_xmit() until receive traffic on
> the nc connection comes in (one-to-one, one new rx packet means one
> old tx packet goes out).

Could you please share any wireshark log or dump for what is being
received here?

>
> Working setup:
> Before the tx_bd_control = TSTAMP_ALL_FRAMES.
> Every time I hit "sN Enter" from nc I see a macb_start_xmit
> print_hex_dump() and I see the packet on the nc client side:
> # nc -l -u -p 9999
> ...
> s11
> [  347.517080] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
> f7 04 0a 26 08 00 45 00   ....) ....&..E.
> s12
> [  348.964369] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
> f7 04 0a 26 08 00 45 00   ....) ....&..E.
> ...
>
> Broken setup:
> After the tx_bd_control = TSTAMP_ALL_FRAMES.
> Not the first nc packet, but many of the subsequent ones never make it
> to macb_start_xmit()
> # nc -l -u -p 9999
> ...
> s3
> s4
> s5
> ...
> Eventually after I send data from the client nc I do see the
> macb_start_xmit() lines.

Thanks for this debug. If macb_start_xmit is never called, one of
the preceeding checks (such as if skb is present or if the TX queues
are off  etc)
should fail. I'm still tracing this but I'm not sure under what
circumstances only
some UDP packets will be prevented from being transmitted.
Just to be sure, could you please confirm you are not seeing any
"buffer exhausted" messaged from TX error tasks?

Regards,
Harini

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Linuxptp-devel] strangeness
  2019-03-08  6:08                 ` Harini Katakam
@ 2019-03-08 18:07                   ` Paul Thomas
  2019-03-08 21:41                     ` Paul Thomas
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Thomas @ 2019-03-08 18:07 UTC (permalink / raw)
  To: Harini Katakam; +Cc: Richard Cochran, linuxptp-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 5075 bytes --]

Hi Harini,

On Fri, Mar 8, 2019 at 1:08 AM Harini Katakam <harinik@xilinx.com> wrote:
>
> Hi Paul,
> On Fri, Mar 8, 2019 at 12:33 AM Paul Thomas <pthomas8589@gmail.com> wrote:
> >
> > On Thu, Mar 7, 2019 at 12:32 AM Harini Katakam <harinik@xilinx.com> wrote:
> > >
> > > Hi Paul,
> > > On Thu, Mar 7, 2019 at 4:38 AM Paul Thomas <pthomas8589@gmail.com> wrote:
> > > >
> > > > On Fri, Mar 1, 2019 at 1:24 AM Harini Katakam <harinik@xilinx.com> wrote:
> > > > >
> > > > > +netdev
> > > > >
> > > > > Hi Paul,
> > > > > On Fri, Mar 1, 2019 at 12:29 AM Richard Cochran
> > > > > <richardcochran@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Feb 28, 2019 at 12:33:26PM -0500, Paul Thomas wrote:
> > > > > > > Yes changing it to TSTAMP_ALL_PTP_FRAMES instead of TSTAMP_ALL_FRAMES
> > > > > > > does seem to fix the ssh issue. My worry is that there is still a bug
> > > > > > > somewhere in the network stack that this is just masking.
> > > > >
> > > > > Ok thanks.
> > > > > One place to check in the driver will be:
> > > > > if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
> > > > > /* skb now belongs to timestamp buffer
> > > > > * and will be removed later
> > > > > */
> > > > > tx_skb->skb = NULL;
> > > > > }
> > > > > When all TX packets are timestamped, the skb always belongs to the
> > > > > timestamp buffer.
> > > > >
> > > > > >
> > > > > > Or the HW isn't sending the frames in the first place.
> > > > > >
> > > > > > Check that first!
> > > > >
> > > > > To check this, the statistics registers in MAC will be one way.
> > > > > But if there is no TX completion interrupt, then I wouldn't expect
> > > > > these statistics to increase either. The used bit status in BD dump
> > > > > might be of more use.
> > > > >
> > > > > I will also try to reproduce (with TX timestamp ALL) and see if any of
> > > > > the above gives some clue.
> > > > >
> > > > > Regards,
> > > > > Harini
> > > >
> > > > Hi Harini, any luck looking at this?
> > >
> > > I'm sorry, I was not able to debug this further.
> > >
> > > >
> > > > I didn't get very far, even in the "broken" state I see plenty of tx_frames:
> > > > root@xu5:/opt/linuxptp# ethtool -S eth0
> > > > NIC statistics:
> > > >      ...
> > > >      tx_frames: 39763
> > > >      ...
> > > >
> > > > When you said "registers in the MAC" is ethtool -S displaying that?
> > >
> > > Yes, ethtool does display these statistics.
> > > I was referring to the registers starting offset 0xFF0B0108 (for GEM0) here:
> > > https://www.xilinx.com/html_docs/registers/ug1087/ug1087-zynq-ultrascale-registers.html
> > > If you see this value increasing, then the MAC is transmitting successfully.
> > > Although, I realize it could be other traffic. To see if specific
> > > packets (for the
> > > failed SSH connection) are not being queued, a BD dump might help.
> > >
> > > Regards,
> > > Harini
> >
> > OK, I think things are becoming more clear. After just doing ioctl(fd,
> > SIOCSHWTSTAMP, &ifreq) from userspace (tx_bd_control =
> > TSTAMP_ALL_FRAMES in macb_ptp.c) then with the nc experiment some udp
> > transmits do not make it to macb_start_xmit() until receive traffic on
> > the nc connection comes in (one-to-one, one new rx packet means one
> > old tx packet goes out).
>
> Could you please share any wireshark log or dump for what is being
> received here?

Here are two wireshark captures, the thing to note in the bad one is
that packets No. 5, 7, 9 from .102 to .103 were actually sent just
after packet No. 2 but they don't show up on the wire until the
packets the other way (one for one).

>
> >
> > Working setup:
> > Before the tx_bd_control = TSTAMP_ALL_FRAMES.
> > Every time I hit "sN Enter" from nc I see a macb_start_xmit
> > print_hex_dump() and I see the packet on the nc client side:
> > # nc -l -u -p 9999
> > ...
> > s11
> > [  347.517080] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
> > f7 04 0a 26 08 00 45 00   ....) ....&..E.
> > s12
> > [  348.964369] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
> > f7 04 0a 26 08 00 45 00   ....) ....&..E.
> > ...
> >
> > Broken setup:
> > After the tx_bd_control = TSTAMP_ALL_FRAMES.
> > Not the first nc packet, but many of the subsequent ones never make it
> > to macb_start_xmit()
> > # nc -l -u -p 9999
> > ...
> > s3
> > s4
> > s5
> > ...
> > Eventually after I send data from the client nc I do see the
> > macb_start_xmit() lines.
>
> Thanks for this debug. If macb_start_xmit is never called, one of
> the preceeding checks (such as if skb is present or if the TX queues
> are off  etc)
> should fail. I'm still tracing this but I'm not sure under what
> circumstances only
> some UDP packets will be prevented from being transmitted.
In this specific test the first tx packets always goes through, and
the subsequent ones don't until rx packets. So it's not random when
they go through, I could have been clearer about that.

> Just to be sure, could you please confirm you are not seeing any
> "buffer exhausted" messaged from TX error tasks?
Correct, I'm not seeing any "buffer exhausted" errors.

thanks,
Paul

[-- Attachment #2: badcapt20190308a.pcap --]
[-- Type: application/vnd.tcpdump.pcap, Size: 695 bytes --]

[-- Attachment #3: goodcapt20190308a.pcap --]
[-- Type: application/vnd.tcpdump.pcap, Size: 313 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Linuxptp-devel] strangeness
  2019-03-08 18:07                   ` Paul Thomas
@ 2019-03-08 21:41                     ` Paul Thomas
  2019-03-09  5:38                       ` Harini Katakam
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Thomas @ 2019-03-08 21:41 UTC (permalink / raw)
  To: Harini Katakam; +Cc: linuxptp-devel, netdev

On Fri, Mar 8, 2019 at 1:07 PM Paul Thomas <pthomas8589@gmail.com> wrote:
>
> Hi Harini,
>
> On Fri, Mar 8, 2019 at 1:08 AM Harini Katakam <harinik@xilinx.com> wrote:
> >
> > Hi Paul,
> > On Fri, Mar 8, 2019 at 12:33 AM Paul Thomas <pthomas8589@gmail.com> wrote:
> > >
> > > On Thu, Mar 7, 2019 at 12:32 AM Harini Katakam <harinik@xilinx.com> wrote:
> > > >
> > > > Hi Paul,
> > > > On Thu, Mar 7, 2019 at 4:38 AM Paul Thomas <pthomas8589@gmail.com> wrote:
> > > > >
> > > > > On Fri, Mar 1, 2019 at 1:24 AM Harini Katakam <harinik@xilinx.com> wrote:
> > > > > >
> > > > > > +netdev
> > > > > >
> > > > > > Hi Paul,
> > > > > > On Fri, Mar 1, 2019 at 12:29 AM Richard Cochran
> > > > > > <richardcochran@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Feb 28, 2019 at 12:33:26PM -0500, Paul Thomas wrote:
> > > > > > > > Yes changing it to TSTAMP_ALL_PTP_FRAMES instead of TSTAMP_ALL_FRAMES
> > > > > > > > does seem to fix the ssh issue. My worry is that there is still a bug
> > > > > > > > somewhere in the network stack that this is just masking.
> > > > > >
> > > > > > Ok thanks.
> > > > > > One place to check in the driver will be:
> > > > > > if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
> > > > > > /* skb now belongs to timestamp buffer
> > > > > > * and will be removed later
> > > > > > */
> > > > > > tx_skb->skb = NULL;
> > > > > > }
> > > > > > When all TX packets are timestamped, the skb always belongs to the
> > > > > > timestamp buffer.
> > > > > >
> > > > > > >
> > > > > > > Or the HW isn't sending the frames in the first place.
> > > > > > >
> > > > > > > Check that first!
> > > > > >
> > > > > > To check this, the statistics registers in MAC will be one way.
> > > > > > But if there is no TX completion interrupt, then I wouldn't expect
> > > > > > these statistics to increase either. The used bit status in BD dump
> > > > > > might be of more use.
> > > > > >
> > > > > > I will also try to reproduce (with TX timestamp ALL) and see if any of
> > > > > > the above gives some clue.
> > > > > >
> > > > > > Regards,
> > > > > > Harini
> > > > >
> > > > > Hi Harini, any luck looking at this?
> > > >
> > > > I'm sorry, I was not able to debug this further.
> > > >
> > > > >
> > > > > I didn't get very far, even in the "broken" state I see plenty of tx_frames:
> > > > > root@xu5:/opt/linuxptp# ethtool -S eth0
> > > > > NIC statistics:
> > > > >      ...
> > > > >      tx_frames: 39763
> > > > >      ...
> > > > >
> > > > > When you said "registers in the MAC" is ethtool -S displaying that?
> > > >
> > > > Yes, ethtool does display these statistics.
> > > > I was referring to the registers starting offset 0xFF0B0108 (for GEM0) here:
> > > > https://www.xilinx.com/html_docs/registers/ug1087/ug1087-zynq-ultrascale-registers.html
> > > > If you see this value increasing, then the MAC is transmitting successfully.
> > > > Although, I realize it could be other traffic. To see if specific
> > > > packets (for the
> > > > failed SSH connection) are not being queued, a BD dump might help.
> > > >
> > > > Regards,
> > > > Harini
> > >
> > > OK, I think things are becoming more clear. After just doing ioctl(fd,
> > > SIOCSHWTSTAMP, &ifreq) from userspace (tx_bd_control =
> > > TSTAMP_ALL_FRAMES in macb_ptp.c) then with the nc experiment some udp
> > > transmits do not make it to macb_start_xmit() until receive traffic on
> > > the nc connection comes in (one-to-one, one new rx packet means one
> > > old tx packet goes out).
> >
> > Could you please share any wireshark log or dump for what is being
> > received here?
>
> Here are two wireshark captures, the thing to note in the bad one is
> that packets No. 5, 7, 9 from .102 to .103 were actually sent just
> after packet No. 2 but they don't show up on the wire until the
> packets the other way (one for one).
>
> >
> > >
> > > Working setup:
> > > Before the tx_bd_control = TSTAMP_ALL_FRAMES.
> > > Every time I hit "sN Enter" from nc I see a macb_start_xmit
> > > print_hex_dump() and I see the packet on the nc client side:
> > > # nc -l -u -p 9999
> > > ...
> > > s11
> > > [  347.517080] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
> > > f7 04 0a 26 08 00 45 00   ....) ....&..E.
> > > s12
> > > [  348.964369] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
> > > f7 04 0a 26 08 00 45 00   ....) ....&..E.
> > > ...
> > >
> > > Broken setup:
> > > After the tx_bd_control = TSTAMP_ALL_FRAMES.
> > > Not the first nc packet, but many of the subsequent ones never make it
> > > to macb_start_xmit()
> > > # nc -l -u -p 9999
> > > ...
> > > s3
> > > s4
> > > s5
> > > ...
> > > Eventually after I send data from the client nc I do see the
> > > macb_start_xmit() lines.
> >
> > Thanks for this debug. If macb_start_xmit is never called, one of
> > the preceeding checks (such as if skb is present or if the TX queues
> > are off  etc)
> > should fail. I'm still tracing this but I'm not sure under what
> > circumstances only
> > some UDP packets will be prevented from being transmitted.
> In this specific test the first tx packets always goes through, and
> the subsequent ones don't until rx packets. So it's not random when
> they go through, I could have been clearer about that.
>
> > Just to be sure, could you please confirm you are not seeing any
> > "buffer exhausted" messaged from TX error tasks?
> Correct, I'm not seeing any "buffer exhausted" errors.
>
> thanks,
> Paul

And one more piece that may be helpful. I think I narrowed down what's
happening in the receive that finally flushes out a pending tx packet.
It seems to be the netif_receive_skb(skb); line in gem_rx() (line
1067). I tested with an mdelay before and after this call:
        mdelay(1000);//mdelay here is slow to flush the pending tx
packet (as seen by nc client)
        netif_receive_skb(skb);
        //mdelay(1000);//mdelay here is fast to flush the pending tx
packet (as seen by nc client)
This seems very strange to me, I quickly glanced at what
netif_receive_skb() is doing and didn't see anything connected with
the TX path, but those are the symptoms.

thanks,
Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Linuxptp-devel] strangeness
  2019-03-08 21:41                     ` Paul Thomas
@ 2019-03-09  5:38                       ` Harini Katakam
  2019-03-09 21:06                         ` Paul Thomas
  0 siblings, 1 reply; 13+ messages in thread
From: Harini Katakam @ 2019-03-09  5:38 UTC (permalink / raw)
  To: Paul Thomas; +Cc: linuxptp-devel, netdev

Hi Paul,
On Sat, Mar 9, 2019 at 3:13 AM Paul Thomas <pthomas8589@gmail.com> wrote:
>
> On Fri, Mar 8, 2019 at 1:07 PM Paul Thomas <pthomas8589@gmail.com> wrote:
> >
> > Hi Harini,
> >
> > On Fri, Mar 8, 2019 at 1:08 AM Harini Katakam <harinik@xilinx.com> wrote:
> > >
> > > Hi Paul,
> > > On Fri, Mar 8, 2019 at 12:33 AM Paul Thomas <pthomas8589@gmail.com> wrote:
> > > >
> > > > On Thu, Mar 7, 2019 at 12:32 AM Harini Katakam <harinik@xilinx.com> wrote:
> > > > >
> > > > > Hi Paul,
<snip>
> > > >
> > > > OK, I think things are becoming more clear. After just doing ioctl(fd,
> > > > SIOCSHWTSTAMP, &ifreq) from userspace (tx_bd_control =
> > > > TSTAMP_ALL_FRAMES in macb_ptp.c) then with the nc experiment some udp
> > > > transmits do not make it to macb_start_xmit() until receive traffic on
> > > > the nc connection comes in (one-to-one, one new rx packet means one
> > > > old tx packet goes out).
> > >
> > > Could you please share any wireshark log or dump for what is being
> > > received here?
> >
> > Here are two wireshark captures, the thing to note in the bad one is
> > that packets No. 5, 7, 9 from .102 to .103 were actually sent just
> > after packet No. 2 but they don't show up on the wire until the
> > packets the other way (one for one).
> >
> > >
> > > >
> > > > Working setup:
> > > > Before the tx_bd_control = TSTAMP_ALL_FRAMES.
> > > > Every time I hit "sN Enter" from nc I see a macb_start_xmit
> > > > print_hex_dump() and I see the packet on the nc client side:
> > > > # nc -l -u -p 9999
> > > > ...
> > > > s11
> > > > [  347.517080] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
> > > > f7 04 0a 26 08 00 45 00   ....) ....&..E.
> > > > s12
> > > > [  348.964369] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
> > > > f7 04 0a 26 08 00 45 00   ....) ....&..E.
> > > > ...
> > > >
> > > > Broken setup:
> > > > After the tx_bd_control = TSTAMP_ALL_FRAMES.
> > > > Not the first nc packet, but many of the subsequent ones never make it
> > > > to macb_start_xmit()
> > > > # nc -l -u -p 9999
> > > > ...
> > > > s3
> > > > s4
> > > > s5
> > > > ...
> > > > Eventually after I send data from the client nc I do see the
> > > > macb_start_xmit() lines.
> > >
> > > Thanks for this debug. If macb_start_xmit is never called, one of
> > > the preceeding checks (such as if skb is present or if the TX queues
> > > are off  etc)
> > > should fail. I'm still tracing this but I'm not sure under what
> > > circumstances only
> > > some UDP packets will be prevented from being transmitted.
> > In this specific test the first tx packets always goes through, and
> > the subsequent ones don't until rx packets. So it's not random when
> > they go through, I could have been clearer about that.
> >
> > > Just to be sure, could you please confirm you are not seeing any
> > > "buffer exhausted" messaged from TX error tasks?
> > Correct, I'm not seeing any "buffer exhausted" errors.
> >
> > thanks,
> > Paul
>
> And one more piece that may be helpful. I think I narrowed down what's
> happening in the receive that finally flushes out a pending tx packet.
> It seems to be the netif_receive_skb(skb); line in gem_rx() (line
> 1067). I tested with an mdelay before and after this call:
>         mdelay(1000);//mdelay here is slow to flush the pending tx
> packet (as seen by nc client)
>         netif_receive_skb(skb);
>         //mdelay(1000);//mdelay here is fast to flush the pending tx
> packet (as seen by nc client)
> This seems very strange to me, I quickly glanced at what
> netif_receive_skb() is doing and didn't see anything connected with
> the TX path, but those are the symptoms.

Thanks for the logs and debug.
I'm afraid I can't think of how this receive affects TX path.
Even if the IP somehow has any dependency between TX and RX path
(which is doesn't, on ZynqMP, to my knowledge), it wouldn't explain
why packets dont reach _xmit function at all.
Let me debug a little more.

Regards,
Harini

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Linuxptp-devel] strangeness
  2019-03-09  5:38                       ` Harini Katakam
@ 2019-03-09 21:06                         ` Paul Thomas
  2019-03-12  2:55                           ` Paul Thomas
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Thomas @ 2019-03-09 21:06 UTC (permalink / raw)
  To: Harini Katakam; +Cc: linuxptp-devel, netdev

Hi Harini,

On Sat, Mar 9, 2019 at 12:38 AM Harini Katakam <harinik@xilinx.com> wrote:
>
> Hi Paul,
> On Sat, Mar 9, 2019 at 3:13 AM Paul Thomas <pthomas8589@gmail.com> wrote:
> >
> > On Fri, Mar 8, 2019 at 1:07 PM Paul Thomas <pthomas8589@gmail.com> wrote:
> > >
> > > Hi Harini,
> > >
> > > On Fri, Mar 8, 2019 at 1:08 AM Harini Katakam <harinik@xilinx.com> wrote:
> > > >
> > > > Hi Paul,
> > > > On Fri, Mar 8, 2019 at 12:33 AM Paul Thomas <pthomas8589@gmail.com> wrote:
> > > > >
> > > > > On Thu, Mar 7, 2019 at 12:32 AM Harini Katakam <harinik@xilinx.com> wrote:
> > > > > >
> > > > > > Hi Paul,
> <snip>
> > > > >
> > > > > OK, I think things are becoming more clear. After just doing ioctl(fd,
> > > > > SIOCSHWTSTAMP, &ifreq) from userspace (tx_bd_control =
> > > > > TSTAMP_ALL_FRAMES in macb_ptp.c) then with the nc experiment some udp
> > > > > transmits do not make it to macb_start_xmit() until receive traffic on
> > > > > the nc connection comes in (one-to-one, one new rx packet means one
> > > > > old tx packet goes out).
> > > >
> > > > Could you please share any wireshark log or dump for what is being
> > > > received here?
> > >
> > > Here are two wireshark captures, the thing to note in the bad one is
> > > that packets No. 5, 7, 9 from .102 to .103 were actually sent just
> > > after packet No. 2 but they don't show up on the wire until the
> > > packets the other way (one for one).
> > >
> > > >
> > > > >
> > > > > Working setup:
> > > > > Before the tx_bd_control = TSTAMP_ALL_FRAMES.
> > > > > Every time I hit "sN Enter" from nc I see a macb_start_xmit
> > > > > print_hex_dump() and I see the packet on the nc client side:
> > > > > # nc -l -u -p 9999
> > > > > ...
> > > > > s11
> > > > > [  347.517080] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
> > > > > f7 04 0a 26 08 00 45 00   ....) ....&..E.
> > > > > s12
> > > > > [  348.964369] macb_start_xmit data: 00000000: 20 b0 f7 04 0a 29 20 b0
> > > > > f7 04 0a 26 08 00 45 00   ....) ....&..E.
> > > > > ...
> > > > >
> > > > > Broken setup:
> > > > > After the tx_bd_control = TSTAMP_ALL_FRAMES.
> > > > > Not the first nc packet, but many of the subsequent ones never make it
> > > > > to macb_start_xmit()
> > > > > # nc -l -u -p 9999
> > > > > ...
> > > > > s3
> > > > > s4
> > > > > s5
> > > > > ...
> > > > > Eventually after I send data from the client nc I do see the
> > > > > macb_start_xmit() lines.
> > > >
> > > > Thanks for this debug. If macb_start_xmit is never called, one of
> > > > the preceeding checks (such as if skb is present or if the TX queues
> > > > are off  etc)
> > > > should fail. I'm still tracing this but I'm not sure under what
> > > > circumstances only
> > > > some UDP packets will be prevented from being transmitted.
> > > In this specific test the first tx packets always goes through, and
> > > the subsequent ones don't until rx packets. So it's not random when
> > > they go through, I could have been clearer about that.
> > >
> > > > Just to be sure, could you please confirm you are not seeing any
> > > > "buffer exhausted" messaged from TX error tasks?
> > > Correct, I'm not seeing any "buffer exhausted" errors.
> > >
> > > thanks,
> > > Paul
> >
> > And one more piece that may be helpful. I think I narrowed down what's
> > happening in the receive that finally flushes out a pending tx packet.
> > It seems to be the netif_receive_skb(skb); line in gem_rx() (line
> > 1067). I tested with an mdelay before and after this call:
> >         mdelay(1000);//mdelay here is slow to flush the pending tx
> > packet (as seen by nc client)
> >         netif_receive_skb(skb);
> >         //mdelay(1000);//mdelay here is fast to flush the pending tx
> > packet (as seen by nc client)
> > This seems very strange to me, I quickly glanced at what
> > netif_receive_skb() is doing and didn't see anything connected with
> > the TX path, but those are the symptoms.
>
> Thanks for the logs and debug.
> I'm afraid I can't think of how this receive affects TX path.
> Even if the IP somehow has any dependency between TX and RX path
> (which is doesn't, on ZynqMP, to my knowledge), it wouldn't explain
> why packets dont reach _xmit function at all.
> Let me debug a little more.
I looked into this a little bit more and maybe see something. Just to
get started I did a backtrace from macb_start_xmit(), which shows the
following (I'm sure sure this code path is is clear to lots of folks,
but it was helpful for me):
[  106.758748] dump_backtrace
(/home/share/kernels/linux-5.0-rc8/arch/arm64/kernel/traps.c:103)
[  106.762400] show_stack
(/home/share/kernels/linux-5.0-rc8/arch/arm64/kernel/traps.c:155)
[  106.765700] dump_stack
(/home/share/kernels/linux-5.0-rc8/lib/dump_stack.c:115)
[  106.769006] macb_start_xmit
(/home/share/kernels/linux-5.0-rc8/drivers/net/ethernet/cadence/macb_main.c:1809)
[  106.772912] dev_hard_start_xmit
(/home/share/kernels/linux-5.0-rc8/./include/linux/netdevice.h:4395
/home/share/kernels/linux-5.0-rc8/net/core/dev.c:3278
/home/share/kernels/linux-5.0-rc8/net/core/dev.c:3294)
[  106.777078] sch_direct_xmit
(/home/share/kernels/linux-5.0-rc8/net/sched/sch_generic.c:327)
[  106.780984] __qdisc_run
(/home/share/kernels/linux-5.0-rc8/net/sched/sch_generic.c:398)
[  106.784543] __dev_queue_xmit
(/home/share/kernels/linux-5.0-rc8/./include/linux/seqlock.h:234
/home/share/kernels/linux-5.0-rc8/./include/linux/seqlock.h:388
/home/share/kernels/linux-5.0-rc8/./include/net/sch_generic.h:160
/home/share/kernels/linux-5.0-rc8/./include/net/pkt_sched.h:122
/home/share/kernels/linux-5.0-rc8/net/core/dev.c:3473
/home/share/kernels/linux-5.0-rc8/net/core/dev.c:3832)
[  106.788535] dev_queue_xmit
(/home/share/kernels/linux-5.0-rc8/net/core/dev.c:3898)
[  106.792182] ip_finish_output2
(/home/share/kernels/linux-5.0-rc8/./include/net/neighbour.h:498
/home/share/kernels/linux-5.0-rc8/./include/net/neighbour.h:506
/home/share/kernels/linux-5.0-rc8/net/ipv4/ip_output.c:229)
[  106.796261] ip_finish_output
(/home/share/kernels/linux-5.0-rc8/net/ipv4/ip_output.c:317)
[  106.800253] ip_output
(/home/share/kernels/linux-5.0-rc8/net/ipv4/ip_output.c:409)
[  106.803639] ip_local_out
(/home/share/kernels/linux-5.0-rc8/net/ipv4/ip_output.c:127)
[  106.807110] ip_send_skb
(/home/share/kernels/linux-5.0-rc8/net/ipv4/ip_output.c:1466)
[  106.810496] udp_send_skb.isra.12
(/home/share/kernels/linux-5.0-rc8/net/ipv4/udp.c:902)
[  106.814836] udp_sendmsg
(/home/share/kernels/linux-5.0-rc8/net/ipv4/udp.c:1188)
[  106.818395] inet_sendmsg
(/home/share/kernels/linux-5.0-rc8/net/ipv4/af_inet.c:799)
[  106.821868] sock_sendmsg
(/home/share/kernels/linux-5.0-rc8/net/socket.c:622
/home/share/kernels/linux-5.0-rc8/net/socket.c:631)
[  106.825348] sock_write_iter
(/home/share/kernels/linux-5.0-rc8/net/socket.c:901)
[  106.829080] __vfs_write
(/home/share/kernels/linux-5.0-rc8/fs/read_write.c:475
/home/share/kernels/linux-5.0-rc8/fs/read_write.c:487)
[  106.832638] vfs_write (/home/share/kernels/linux-5.0-rc8/fs/read_write.c:549)
[  106.835937] ksys_write
(/home/share/kernels/linux-5.0-rc8/fs/read_write.c:598)
[  106.839236] __arm64_sys_write
(/home/share/kernels/linux-5.0-rc8/fs/read_write.c:607)
[  106.843143] el0_svc_common
(/home/share/kernels/linux-5.0-rc8/arch/arm64/kernel/syscall.c:52
/home/share/kernels/linux-5.0-rc8/arch/arm64/kernel/syscall.c:83)
[  106.846875] el0_svc_handler
(/home/share/kernels/linux-5.0-rc8/arch/arm64/kernel/syscall.c:130)
[  106.850607] el0_svc
(/home/share/kernels/linux-5.0-rc8/arch/arm64/kernel/entry.S:915)
I thought I could put some prints in to see where things are getting
hung up, but I traced this back all the way to sock_write_iter() and
this doesn't happen until right when the packets go out. So next I
turned to strace to actually see what nc was doing. In the normal
condition nc has a pselect6() that it is waiting on for the reads, but
in the broken case it seems to get through the pselect6() and is
blocking on the read(). Hopefully, for someone who understands how
pselect is handled from the networking stack and driver, this will be
very clear what is going on and what needs to be done to fix this.

-Paul

>
> Regards,
> Harini

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Linuxptp-devel] strangeness
  2019-03-09 21:06                         ` Paul Thomas
@ 2019-03-12  2:55                           ` Paul Thomas
  2019-03-12 10:10                             ` Harini Katakam
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Thomas @ 2019-03-12  2:55 UTC (permalink / raw)
  To: Harini Katakam; +Cc: linuxptp-devel, netdev

Hi All,

Let me do a quick clean recap of this issue.

On a Debian arm64 system with a 5.0rc8 kernel using the macb driver on
zynqmp, enabling tx timestamping (1) breaks networking! The first and
most noticeable way is that you can no longer connect with ssh. This
is a serious bug somewhere and merits some attention.

Trying to debug ssh is a possibility, but I was trying to debug with
something easier and thus the netcat testing. The specific issue can
be seen in the following strace. In this setup nc just connects to a
server and tries to send two packets (2). The first packet goes
through fine, but the second doesn't because nc is stuck forever
trying to read from the socket.
pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [0]) <-- waiting on
stdin and UDP sock
read(0, "c1\n", 8192) = 3 <-- read three chars from stdin
write(3, "c1\n", 3) = 3 <-- write those out on the UDP sock
pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [3])  <-- waiting
on stdin and UDP sock
read(3, <-- waits forever here as there is no data to read

I've been reading more, an old patch and the timestamping.txt doc
helped me understand a little more of what's going on:
https://lore.kernel.org/netdev/20130328211925.7644.15781.stgit@jekeller-hub.jf.intel.com/
https://www.kernel.org/doc/Documentation/networking/timestamping.txt

So it is clear that if the SO_SELECT_ERR_QUEUE flag is set then in
fact the select should return, but it is not set in this case. I can
see everything that is going on in datagram_poll() in datagram.c. The
main difference being that in the broken case the mask is 0x30c and in
the working case it is 0x304. The difference is EPOLLERR, which is
there clearly in the code if !skb_queue_empty(&sk->sk_error_queue).

Then in select.c POLLIN_SET includes EPOLLERR. It almost looks as if
it's behaving as it should (except that things break). My first
question is should the sk_error_queue be empty if there is a tx
timestamp available (in datagram_poll() in datagram.c)? If it's not
empty I don't see what else SO_SELECT_ERR_QUEUE flag is doing for the
select() and I don't see what would be different about the macb/arm64
setup?

Any insight here would be very much appreciated.

thanks,
Paul

(1) hwstamp_ctl -i eth0 -t 1

(2) The actual script to be able to run nc and strace from a single
serial console is slightly clever:
(sleep 3; echo "c1"; sleep 1; echo "c2") | nc -u 10.1.155.100 9999 &
strace -p $(ps -A | grep nc | awk '{print $1}')

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Linuxptp-devel] strangeness
  2019-03-12  2:55                           ` Paul Thomas
@ 2019-03-12 10:10                             ` Harini Katakam
       [not found]                               ` <02874ECE860811409154E81DA85FBB5892324F9C@ORSMSX121.amr.corp.intel.com>
  0 siblings, 1 reply; 13+ messages in thread
From: Harini Katakam @ 2019-03-12 10:10 UTC (permalink / raw)
  To: Paul Thomas; +Cc: linuxptp-devel, netdev

Hi Paul,
On Tue, Mar 12, 2019 at 8:26 AM Paul Thomas <pthomas8589@gmail.com> wrote:
>
> Hi All,
>
> Let me do a quick clean recap of this issue.
>
> On a Debian arm64 system with a 5.0rc8 kernel using the macb driver on
> zynqmp, enabling tx timestamping (1) breaks networking! The first and
> most noticeable way is that you can no longer connect with ssh. This
> is a serious bug somewhere and merits some attention.
>
> Trying to debug ssh is a possibility, but I was trying to debug with
> something easier and thus the netcat testing. The specific issue can
> be seen in the following strace. In this setup nc just connects to a
> server and tries to send two packets (2). The first packet goes
> through fine, but the second doesn't because nc is stuck forever
> trying to read from the socket.
> pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [0]) <-- waiting on
> stdin and UDP sock
> read(0, "c1\n", 8192) = 3 <-- read three chars from stdin
> write(3, "c1\n", 3) = 3 <-- write those out on the UDP sock
> pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [3])  <-- waiting
> on stdin and UDP sock
> read(3, <-- waits forever here as there is no data to read
>
> I've been reading more, an old patch and the timestamping.txt doc
> helped me understand a little more of what's going on:
> https://lore.kernel.org/netdev/20130328211925.7644.15781.stgit@jekeller-hub.jf.intel.com/
> https://www.kernel.org/doc/Documentation/networking/timestamping.txt
>
> So it is clear that if the SO_SELECT_ERR_QUEUE flag is set then in
> fact the select should return, but it is not set in this case. I can
> see everything that is going on in datagram_poll() in datagram.c. The
> main difference being that in the broken case the mask is 0x30c and in
> the working case it is 0x304. The difference is EPOLLERR, which is
> there clearly in the code if !skb_queue_empty(&sk->sk_error_queue).
>
> Then in select.c POLLIN_SET includes EPOLLERR. It almost looks as if
> it's behaving as it should (except that things break). My first
> question is should the sk_error_queue be empty if there is a tx
> timestamp available (in datagram_poll() in datagram.c)? If it's not
> empty I don't see what else SO_SELECT_ERR_QUEUE flag is doing for the
> select() and I don't see what would be different about the macb/arm64
> setup?

Thanks for the summary.
I think sk_error_queue should be empty because packets are queued to
that via skb_complete_timestamp (sock_queue_err_skb) and this should
not be called in this flow. I'm sorry if I'm missing something - I'll let others
from netdev comment.
I'm not sure why EPOLLERR in being set in this case.

Regards,
Harini

>
> Any insight here would be very much appreciated.
>
> thanks,
> Paul
>
> (1) hwstamp_ctl -i eth0 -t 1
>
> (2) The actual script to be able to run nc and strace from a single
> serial console is slightly clever:
> (sleep 3; echo "c1"; sleep 1; echo "c2") | nc -u 10.1.155.100 9999 &
> strace -p $(ps -A | grep nc | awk '{print $1}')

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Linuxptp-devel] strangeness
       [not found]                                       ` <02874ECE860811409154E81DA85FBB58923251BB@ORSMSX121.amr.corp.intel.com>
@ 2019-03-12 20:00                                         ` Paul Thomas
  2019-03-12 21:40                                           ` Keller, Jacob E
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Thomas @ 2019-03-12 20:00 UTC (permalink / raw)
  To: Keller, Jacob E, netdev, Harini Katakam; +Cc: linuxptp-devel

On Tue, Mar 12, 2019 at 3:22 PM Keller, Jacob E
<jacob.e.keller@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Paul Thomas [mailto:pthomas8589@gmail.com]
> > Sent: Tuesday, March 12, 2019 10:50 AM
> > To: Keller, Jacob E <jacob.e.keller@intel.com>
> > Cc: linuxptp-devel@lists.sourceforge.net
> > Subject: Re: [Linuxptp-devel] strangeness
> >
> > On Tue, Mar 12, 2019 at 1:11 PM Keller, Jacob E
> > <jacob.e.keller@intel.com> wrote:
> > >
> > > > -----Original Message-----
> > > > From: Paul Thomas [mailto:pthomas8589@gmail.com]
> > > > Sent: Tuesday, March 12, 2019 10:00 AM
> > > > To: Keller, Jacob E <jacob.e.keller@intel.com>
> > > > Subject: Re: [Linuxptp-devel] strangeness
> > > >
> > > > On Tue, Mar 12, 2019 at 12:26 PM Keller, Jacob E
> > > > <jacob.e.keller@intel.com> wrote:
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> > On
> > > > > > Behalf Of Harini Katakam
> > > > > > Sent: Tuesday, March 12, 2019 3:10 AM
> > > > > > To: Paul Thomas <pthomas8589@gmail.com>
> > > > > > Cc: linuxptp-devel@lists.sourceforge.net; netdev@vger.kernel.org
> > > > > > Subject: Re: [Linuxptp-devel] strangeness
> > > > > >
> > > > > > Hi Paul,
> > > > > > On Tue, Mar 12, 2019 at 8:26 AM Paul Thomas <pthomas8589@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > Let me do a quick clean recap of this issue.
> > > > > > >
> > > > > > > On a Debian arm64 system with a 5.0rc8 kernel using the macb driver on
> > > > > > > zynqmp, enabling tx timestamping (1) breaks networking! The first and
> > > > > > > most noticeable way is that you can no longer connect with ssh. This
> > > > > > > is a serious bug somewhere and merits some attention.
> > > > > > >
> > > > > > > Trying to debug ssh is a possibility, but I was trying to debug with
> > > > > > > something easier and thus the netcat testing. The specific issue can
> > > > > > > be seen in the following strace. In this setup nc just connects to a
> > > > > > > server and tries to send two packets (2). The first packet goes
> > > > > > > through fine, but the second doesn't because nc is stuck forever
> > > > > > > trying to read from the socket.
> > > > > > > pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [0]) <-- waiting on
> > > > > > > stdin and UDP sock
> > > > > > > read(0, "c1\n", 8192) = 3 <-- read three chars from stdin
> > > > > > > write(3, "c1\n", 3) = 3 <-- write those out on the UDP sock
> > > > > > > pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [3])  <-- waiting
> > > > > > > on stdin and UDP sock
> > > > > > > read(3, <-- waits forever here as there is no data to read
> > > > > > >
> > > > > > > I've been reading more, an old patch and the timestamping.txt doc
> > > > > > > helped me understand a little more of what's going on:
> > > > > > > https://lore.kernel.org/netdev/20130328211925.7644.15781.stgit@jekeller-
> > > > > > hub.jf.intel.com/
> > > > > > > https://www.kernel.org/doc/Documentation/networking/timestamping.txt
> > > > > > >
> > > > > > > So it is clear that if the SO_SELECT_ERR_QUEUE flag is set then in
> > > > > > > fact the select should return, but it is not set in this case. I can
> > > > > > > see everything that is going on in datagram_poll() in datagram.c. The
> > > > > > > main difference being that in the broken case the mask is 0x30c and in
> > > > > > > the working case it is 0x304. The difference is EPOLLERR, which is
> > > > > > > there clearly in the code if !skb_queue_empty(&sk->sk_error_queue).
> > > > > > >
> > > > > > > Then in select.c POLLIN_SET includes EPOLLERR. It almost looks as if
> > > > > > > it's behaving as it should (except that things break). My first
> > > > > > > question is should the sk_error_queue be empty if there is a tx
> > > > > > > timestamp available (in datagram_poll() in datagram.c)? If it's not
> > > > > > > empty I don't see what else SO_SELECT_ERR_QUEUE flag is doing for the
> > > > > > > select() and I don't see what would be different about the macb/arm64
> > > > > > > setup?
> > > > > >
> > > > > > Thanks for the summary.
> > > > > > I think sk_error_queue should be empty because packets are queued to
> > > > > > that via skb_complete_timestamp (sock_queue_err_skb) and this should
> > > > > > not be called in this flow. I'm sorry if I'm missing something - I'll let others
> > > > > > from netdev comment.
> > > > > > I'm not sure why EPOLLERR in being set in this case.
> > > > > >
> > > > >
> > > > > I believe at least historically that the Tx timestamps were notified to the
> > > > applications using the error queue.
> > > > The documentation (Documentation/networking/timestamping.txt) says
> > > > "recvmsg() with flag MSG_ERRQUEUE", so it's returned to userspace
> > > > through the error queue, but at datagram_poll() in datagram.c:
> > > > /* exceptional events? */
> > > > if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue))
> > > >
> > > > Should this already be non-empty and EPOLLERR be returned as part of
> > > > mask? Or should _only_ EPOLLPRI be returned when SOCK_SELECT_ERR_QUEUE
> > > > is true? Surly someone on this list knows what should be happening? It
> > > > will help direct where we are looking. Should we be looking at the
> > > > queues? Or should be looking at do_select() in fs/select.c or
> > > > somewhere else?
> > > >
> > >
> >
> > HI Jake,
> > Thanks for looking at this.
> >
> > > It's been quite some time since I looked at this. Hmm. Can you reproduce this with
> > another board? I haven't seen the issue before.
> > >
> > > I remember adding SO_SELECT_ERR_QUEUE as an option, and it made the socket
> > wake up when a Tx timestamp occurred. I believe this was eventually fixed in a more
> > direct way, and now the socket option doesn't really do much.
> > >
> > > See Linux commit 7d4c04fc1700 ("net: add option to enable error queue packets
> > waking select", 2013-03-31)
> > That looks like your netdev patch, and the text makes sense.
> > >
> > > Basically, the option now just adds POLLPRI along with POLLERR when the Tx
> > timestamp is ready, and the socket will wake up when there is a Timestamp.
> > >
> > > This was eventually fixed so that you could select on just POLLERR, and have it
> > actually wake up on the error messages, and the option was no longer necessary.
> > Ah, so is selecting on POLLERR a seperate flag?
> >
> > >
> > > That's what I recall about SO_SELECT_ERR_QUEUE at least.
> > >
> > > So.. to clarify the behavior you're seeing is that the socket wakes up on pselect, and
> > then you go to read it, but it never returns because there's no data available? And the
> > socket woke up because there was an error, but the error queue is also empty?
> > The tx timestams are enabled globally across the system with
> > hwstamp_ctl, there is just one option on or off. The issue seems to be
> > with applications that are not expecting select to return on POLLERR.
> > This certainly the case with netcat, and I suspect it is a similar
> > thing that breaks ssh. I haven't checked if there is actually a tx
> > timestamp in the error queue for the nc case (probably is). But maybe
> > this points us in the right direction for a specific socket if the
> > SOF_TIMESTAMPING_TX_HARDWARE option hasn't been requested should the
> > timestamps be pushed to the error queue in the first place?
> >
> > thanks,
> > Paul
>
> Hmm. I haven't been able to reproduce this using other device drivers, so I wonder if the specific device driver is doing something weird and indicating that it can Tx timestamp every packet, in its hard_xmit routine... But it should only be doing that for packets which are requested to timestamp from the socket.

Yes! I think this could be it. I just submitted this patch so that we
can talk about it:
https://lore.kernel.org/netdev/20190312195053.21741-1-pthomas8589@gmail.com/

This fixes the issue I'm having with nc and ssh.

-Paul

>
> Thanks,
> Jake
>
> >
> > >
> > > Thanks,
> > > Jake
> > >
> > > > thanks,
> > > > Paul
> > > >
> > > > > Thanks,
> > > > > Jake

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [Linuxptp-devel] strangeness
  2019-03-12 20:00                                         ` Paul Thomas
@ 2019-03-12 21:40                                           ` Keller, Jacob E
  0 siblings, 0 replies; 13+ messages in thread
From: Keller, Jacob E @ 2019-03-12 21:40 UTC (permalink / raw)
  To: Paul Thomas, netdev, Harini Katakam; +Cc: linuxptp-devel



> -----Original Message-----
> From: Paul Thomas [mailto:pthomas8589@gmail.com]
> Sent: Tuesday, March 12, 2019 1:00 PM
> To: Keller, Jacob E <jacob.e.keller@intel.com>; netdev@vger.kernel.org; Harini
> Katakam <harinik@xilinx.com>
> Cc: linuxptp-devel@lists.sourceforge.net
> Subject: Re: [Linuxptp-devel] strangeness
> 
> On Tue, Mar 12, 2019 at 3:22 PM Keller, Jacob E
> <jacob.e.keller@intel.com> wrote:
> >
> >
> > Hmm. I haven't been able to reproduce this using other device drivers, so I wonder
> if the specific device driver is doing something weird and indicating that it can Tx
> timestamp every packet, in its hard_xmit routine... But it should only be doing that for
> packets which are requested to timestamp from the socket.
> 
> Yes! I think this could be it. I just submitted this patch so that we
> can talk about it:
> https://lore.kernel.org/netdev/20190312195053.21741-1-pthomas8589@gmail.com/
> 
> This fixes the issue I'm having with nc and ssh.
> 
> -Paul
> 

The patch looks good. I think it's correct, since only sockets which expect to get Tx timestamps should be waiting and checking for them. Other sockets won't be checking, but also won't get betting timestamps.

I'm not sure if there is some improvement we could do in the future for if a Tx timestamp is sent for a packet that didn't request it.

Maybe others know more about that flow?

Thanks,
Jake

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-03-12 21:41 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAD56B7dA6jwPuTLjdRWJJhMpsbQBS8Bux3fo+e0ux5RZFGoHnQ@mail.gmail.com>
     [not found] ` <DM6PR02MB521169FFDB2902CA0AF7C27EC9750@DM6PR02MB5211.namprd02.prod.outlook.com>
     [not found]   ` <CAD56B7dOJfL9Y9sgTjDrHQXajDsjWAuF+fycvFL6ib3WX=-VDw@mail.gmail.com>
     [not found]     ` <CAD56B7eytgBherR7FNus1ZZ295c3+hMNp+KJCxBf71EVez998A@mail.gmail.com>
     [not found]       ` <20190228185758.6f37i2sfnr2mdrce@localhost>
2019-03-01  6:24         ` [Linuxptp-devel] strangeness Harini Katakam
2019-03-06 23:06           ` Paul Thomas
2019-03-07  5:32             ` Harini Katakam
2019-03-07 19:02               ` Paul Thomas
2019-03-08  6:08                 ` Harini Katakam
2019-03-08 18:07                   ` Paul Thomas
2019-03-08 21:41                     ` Paul Thomas
2019-03-09  5:38                       ` Harini Katakam
2019-03-09 21:06                         ` Paul Thomas
2019-03-12  2:55                           ` Paul Thomas
2019-03-12 10:10                             ` Harini Katakam
     [not found]                               ` <02874ECE860811409154E81DA85FBB5892324F9C@ORSMSX121.amr.corp.intel.com>
     [not found]                                 ` <CAD56B7cgf8zhBdPXV1wLQEMUaO1ZYR1iK7w5hY7zSfcty17Czg@mail.gmail.com>
     [not found]                                   ` <02874ECE860811409154E81DA85FBB5892325034@ORSMSX121.amr.corp.intel.com>
     [not found]                                     ` <CAD56B7cMYZP9vHijLQe0x=2mwnCFuromp4aAJf=np+FHEc=gZA@mail.gmail.com>
     [not found]                                       ` <02874ECE860811409154E81DA85FBB58923251BB@ORSMSX121.amr.corp.intel.com>
2019-03-12 20:00                                         ` Paul Thomas
2019-03-12 21:40                                           ` Keller, Jacob E

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.