* SCTP abort with T-bit set after handshake
@ 2018-03-16 9:33 David Neil
2018-03-16 15:14 ` Marcelo Ricardo Leitner
` (12 more replies)
0 siblings, 13 replies; 14+ messages in thread
From: David Neil @ 2018-03-16 9:33 UTC (permalink / raw)
To: linux-sctp
Hi,
I have been having intermittent problems with SCTP connection failures; I have finally caught one on Wireshark.
From the application code, I see task-A opening an SCTP connection to task-B and messages being exchanged in both directions. At some subsequent point task-A attempts to send a message and receives an SCTP_COMM_LOST followed by an SCTP_SEND_FAILED. Nothing is seen at task-B during this attempt to send.
When viewing the same event in wirehark, we see a successful handshake and an exchange of one data chunk in each direction during connection setup. Afew milliseconds later, task-A sends an SCTP message and receives an ABORT chunk from the other end with the T-bit set. In the seconds that follow task-B continues to send heartbeats for the association until finally the association times out at task-B.
Task A Task B
--SCTP-INIT-------------------------->
<---------------------------INIT_ACK--
--COOKIE-ECHO + Diameter msg--------->
<------------------COOKIE-ACK + SACK--
<-----------------------Diameter msg--
--SACK------------------------------->
--Diameter msg----------------------->
<---------------ABORT with T-bit set--
<--------------------------Heartbeat--
<--------------------------Heartbeat--
<--------------------------Heartbeat--
<--------------------------Heartbeat--
<----ABORT assoc max retransmissions--
I have put more detailed textual output from Wireshark at the bottom of the post.
If I understand the SCTP specs correctly, an ABORT with the T-bit set indicates that the receiver of the message did not recognize the verification tag. Any ideas how this can happen on an apparently healthy association?
I am running lksctp 1.0.17 on Debian 9, but I have also seen what I believe is the same problem on lksctp 1.0.16 on Debian 8.
The application code is made up of multiple tasks, each one running in its own thread and using SCTP to communicate with other tasks. The tasks all use one-to-many style SCTP sockets.
The failures occur while running an automated test suite. Each testcase instantiates an appropriate set of tasks (up to 50 or so), tests a sequence of events and then shuts down the tasks.
I am typically seeing one test failure every couple of hours which can be traced back to this problem.
Do you need any more information?
Regards,
Dave.
Frame 620: 84 bytes on wire (672 bits), 84 bytes captured (672 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.12, Dst: 127.0.0.1
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0x00000000
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
INIT chunk (Outbound streams: 2, inbound streams: 32)
Chunk type: INIT (1)
Chunk flags: 0x00
Chunk length: 36
Initiate tag: 0xd2059903
Advertised receiver window credit (a_rwnd): 106496
Number of outbound streams: 2
Number of inbound streams: 32
Initial TSN: 4239369224
Supported address types parameter (Supported types: IPv4)
ECN parameter
Forward TSN supported parameter
Frame 621: 308 bytes on wire (2464 bits), 308 bytes captured (2464 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.1, Dst: 127.0.0.12
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0xd2059903
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
INIT_ACK chunk (Outbound streams: 2, inbound streams: 2)
Chunk type: INIT_ACK (2)
Chunk flags: 0x00
Chunk length: 260
Initiate tag: 0x574885aa
Advertised receiver window credit (a_rwnd): 106496
Number of outbound streams: 2
Number of inbound streams: 2
Initial TSN: 3027575157
State cookie parameter (Cookie length: 228 bytes)
ECN parameter
Forward TSN supported parameter
Frame 622: 440 bytes on wire (3520 bits), 440 bytes captured (3520 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.12, Dst: 127.0.0.1
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0x574885aa
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
COOKIE_ECHO chunk (Cookie length: 228 bytes)
Chunk type: COOKIE_ECHO (10)
Chunk flags: 0x00
Chunk length: 232
Cookie: 9a8b766df4a9cb098575172259c06bc90000000000000000...
DATA chunk(ordered, complete segment, TSN: 4239369224, SID: 0, SSN: 0, PPID: 46, payload length: 144 bytes)
Chunk type: DATA (0)
Chunk flags: 0x03
Chunk length: 160
Transmission sequence number: 4239369224
Stream identifier: 0x0000
Stream sequence number: 0
Payload protocol identifier: DIAMETER (46)
Diameter Protocol
Frame 623: 68 bytes on wire (544 bits), 68 bytes captured (544 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.1, Dst: 127.0.0.12
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0xd2059903
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
COOKIE_ACK chunk
Chunk type: COOKIE_ACK (11)
Chunk flags: 0x00
Chunk length: 4
SACK chunk (Cumulative TSN: 4239369224, a_rwnd: 106352, gaps: 0, duplicate TSNs: 0)
Chunk type: SACK (3)
Chunk flags: 0x00
Chunk length: 16
Cumulative TSN ACK: 4239369224
Advertised receiver window credit (a_rwnd): 106352
Number of gap acknowledgement blocks: 0
Number of duplicated TSNs: 0
Frame 624: 216 bytes on wire (1728 bits), 216 bytes captured (1728 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.1, Dst: 127.0.0.12
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0xd2059903
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
DATA chunk(ordered, complete segment, TSN: 3027575157, SID: 0, SSN: 0, PPID: 46, payload length: 152 bytes)
Chunk type: DATA (0)
Chunk flags: 0x03
Chunk length: 168
Transmission sequence number: 3027575157
Stream identifier: 0x0000
Stream sequence number: 0
Payload protocol identifier: DIAMETER (46)
Diameter Protocol
Frame 625: 64 bytes on wire (512 bits), 64 bytes captured (512 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.12, Dst: 127.0.0.1
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0x574885aa
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
SACK chunk (Cumulative TSN: 3027575157, a_rwnd: 106344, gaps: 0, duplicate TSNs: 0)
Chunk type: SACK (3)
Chunk flags: 0x00
Chunk length: 16
Cumulative TSN ACK: 3027575157
Advertised receiver window credit (a_rwnd): 106344
Number of gap acknowledgement blocks: 0
Number of duplicated TSNs: 0
Frame 796: 288 bytes on wire (2304 bits), 288 bytes captured (2304 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.12, Dst: 127.0.0.1
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0x574885aa
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
DATA chunk(ordered, complete segment, TSN: 4239369225, SID: 0, SSN: 1, PPID: 46, payload length: 224 bytes)
Chunk type: DATA (0)
Chunk flags: 0x03
Chunk length: 240
Transmission sequence number: 4239369225
Stream identifier: 0x0000
Stream sequence number: 1
Payload protocol identifier: DIAMETER (46)
Diameter Protocol
Frame 797: 52 bytes on wire (416 bits), 52 bytes captured (416 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.1, Dst: 127.0.0.12
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0x574885aa
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
ABORT chunk
Chunk type: ABORT (6)
Chunk flags: 0x01
.... ...1 = T-Bit: Tag reflected
Chunk length: 4
Frame 1178: 100 bytes on wire (800 bits), 100 bytes captured (800 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.1, Dst: 127.0.0.12
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0xd2059903
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
HEARTBEAT chunk (Information: 48 bytes)
Chunk type: HEARTBEAT (4)
Chunk flags: 0x00
Chunk length: 52
Heartbeat info parameter (Information: 44 bytes)
Frame 1492: 100 bytes on wire (800 bits), 100 bytes captured (800 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.1, Dst: 127.0.0.12
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0xd2059903
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
HEARTBEAT chunk (Information: 48 bytes)
Chunk type: HEARTBEAT (4)
Chunk flags: 0x00
Chunk length: 52
Heartbeat info parameter (Information: 44 bytes)
Frame 1565: 100 bytes on wire (800 bits), 100 bytes captured (800 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.1, Dst: 127.0.0.12
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0xd2059903
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
HEARTBEAT chunk (Information: 48 bytes)
Chunk type: HEARTBEAT (4)
Chunk flags: 0x00
Chunk length: 52
Heartbeat info parameter (Information: 44 bytes)
Frame 1960: 100 bytes on wire (800 bits), 100 bytes captured (800 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.1, Dst: 127.0.0.12
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0xd2059903
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
HEARTBEAT chunk (Information: 48 bytes)
Chunk type: HEARTBEAT (4)
Chunk flags: 0x00
Chunk length: 52
Heartbeat info parameter (Information: 44 bytes)
Frame 2816: 100 bytes on wire (800 bits), 100 bytes captured (800 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 127.0.0.1, Dst: 127.0.0.12
Stream Control Transmission Protocol, Src Port: 3868 (3868), Dst Port: 3868 (3868)
Source port: 3868
Destination port: 3868
Verification tag: 0xd2059903
[Association index: 45]
Checksum: 0x00000000 [unverified]
[Checksum Status: Unverified]
ABORT chunk
Chunk type: ABORT (6)
Chunk flags: 0x00
.... ...0 = T-Bit: Tag not reflected
Chunk length: 50
Protocol violation cause
Cause code: Protocol violation (0x000d)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
@ 2018-03-16 15:14 ` Marcelo Ricardo Leitner
2018-03-16 15:54 ` David Neil
` (11 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-03-16 15:14 UTC (permalink / raw)
To: linux-sctp
On Fri, Mar 16, 2018 at 09:33:12AM +0000, David Neil wrote:
> Hi,
> I have been having intermittent problems with SCTP connection failures; I have finally caught one on Wireshark.
Hi David,
Two things. 1. Which kernel version are you using?
2. Not sure if debian has dynamic_printk enabled, I guess so.
Please run this command to turn on some debug printk's, which should
get activated in the case the asoc is not found.
As root:
# echo 'func __sctp_rcv_lookup +p' > /sys/kernel/debug/dynamic_debug/control
And then reproduce the issue again. If you get a printk on dmesg, it's
because the association is not being located when the packet came in,
and thus why the Abort.
This message is available on kernel >= 4.11.
M.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
2018-03-16 15:14 ` Marcelo Ricardo Leitner
@ 2018-03-16 15:54 ` David Neil
2018-03-16 17:36 ` Marcelo Ricardo Leitner
` (10 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: David Neil @ 2018-03-16 15:54 UTC (permalink / raw)
To: linux-sctp
Marcelo,
Thanks for the quick response.
The output of uname -a is:
Linux li1074-63 4.15.7-x86_64-linode102 #1 SMP Wed Feb 28 16:39:31 UTC 2018 x86_64 GNU/Linux
This is a Linode pre-built Debian 9 kernel with CONFIG_DYNAMIC_DEBUG not set.
I assume it will be a non-trivial job to build my own kernel to run on Linode - I will look into this.
Are there any other options which do not rely on dynamic debug.
Regards,
Dave.
> On 16 Mar 2018, at 15:14, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
>
> On Fri, Mar 16, 2018 at 09:33:12AM +0000, David Neil wrote:
>> Hi,
>> I have been having intermittent problems with SCTP connection failures; I have finally caught one on Wireshark.
>
> Hi David,
>
> Two things. 1. Which kernel version are you using?
>
> 2. Not sure if debian has dynamic_printk enabled, I guess so.
> Please run this command to turn on some debug printk's, which should
> get activated in the case the asoc is not found.
>
> As root:
> # echo 'func __sctp_rcv_lookup +p' > /sys/kernel/debug/dynamic_debug/control
>
> And then reproduce the issue again. If you get a printk on dmesg, it's
> because the association is not being located when the packet came in,
> and thus why the Abort.
>
> This message is available on kernel >= 4.11.
>
> M.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
2018-03-16 15:14 ` Marcelo Ricardo Leitner
2018-03-16 15:54 ` David Neil
@ 2018-03-16 17:36 ` Marcelo Ricardo Leitner
2018-03-16 19:05 ` Neil Horman
` (9 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-03-16 17:36 UTC (permalink / raw)
To: linux-sctp
On Fri, Mar 16, 2018 at 03:54:49PM +0000, David Neil wrote:
> Marcelo,
> Thanks for the quick response.
>
> The output of uname -a is:
> Linux li1074-63 4.15.7-x86_64-linode102 #1 SMP Wed Feb 28 16:39:31 UTC 2018 x86_64 GNU/Linux
Nice, a very updated one.
>
> This is a Linode pre-built Debian 9 kernel with CONFIG_DYNAMIC_DEBUG not set.
If it's off than that code isn't there, meaning we can't attach a perf
probe or so. The function is also inlined, so we can't attach to its
return point.
A few lines below it will check if an asoc couldn't be found and will
increment SCTP_MIB_OUTOFBLUES. There are more places that inc it, but
it's a start.
It should show up in netstat -s or /proc/net/sctp/snmp.
> I assume it will be a non-trivial job to build my own kernel to run on Linode - I will look into this.
The pr_debug's in there are nice because they will give you the
ips and ports involved AND as it can't be hit from other places
(unlike the counter above), it would make sure the issue is related to
that.
Btw, is this test public? Can I run it too? Or if you can create a
small reproducer, that would be great.
>
> Are there any other options which do not rely on dynamic debug.
> Regards,
> Dave.
>
>
>
> > On 16 Mar 2018, at 15:14, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> >
> > On Fri, Mar 16, 2018 at 09:33:12AM +0000, David Neil wrote:
> >> Hi,
> >> I have been having intermittent problems with SCTP connection failures; I have finally caught one on Wireshark.
> >
> > Hi David,
> >
> > Two things. 1. Which kernel version are you using?
> >
> > 2. Not sure if debian has dynamic_printk enabled, I guess so.
> > Please run this command to turn on some debug printk's, which should
> > get activated in the case the asoc is not found.
> >
> > As root:
> > # echo 'func __sctp_rcv_lookup +p' > /sys/kernel/debug/dynamic_debug/control
> >
> > And then reproduce the issue again. If you get a printk on dmesg, it's
> > because the association is not being located when the packet came in,
> > and thus why the Abort.
> >
> > This message is available on kernel >= 4.11.
> >
> > M.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
` (2 preceding siblings ...)
2018-03-16 17:36 ` Marcelo Ricardo Leitner
@ 2018-03-16 19:05 ` Neil Horman
2018-03-19 17:06 ` David Neil
` (8 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Neil Horman @ 2018-03-16 19:05 UTC (permalink / raw)
To: linux-sctp
On Fri, Mar 16, 2018 at 02:36:14PM -0300, Marcelo Ricardo Leitner wrote:
> On Fri, Mar 16, 2018 at 03:54:49PM +0000, David Neil wrote:
> > Marcelo,
> > Thanks for the quick response.
> >
> > The output of uname -a is:
> > Linux li1074-63 4.15.7-x86_64-linode102 #1 SMP Wed Feb 28 16:39:31 UTC 2018 x86_64 GNU/Linux
>
> Nice, a very updated one.
>
> >
> > This is a Linode pre-built Debian 9 kernel with CONFIG_DYNAMIC_DEBUG not set.
>
> If it's off than that code isn't there, meaning we can't attach a perf
> probe or so. The function is also inlined, so we can't attach to its
> return point.
>
We could write an system tap script to trace an inline function, no?
Neil
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
` (3 preceding siblings ...)
2018-03-16 19:05 ` Neil Horman
@ 2018-03-19 17:06 ` David Neil
2018-03-19 18:38 ` Marcelo Ricardo Leitner
` (7 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: David Neil @ 2018-03-19 17:06 UTC (permalink / raw)
To: linux-sctp
Marcelo,
Sorry for the slow reply, have been away and then have been struggling to reproduce the problem.
>
> A few lines below it will check if an asoc couldn't be found and will
> increment SCTP_MIB_OUTOFBLUES. There are more places that inc it, but
> it's a start.
>
> It should show up in netstat -s or /proc/net/sctp/snmp.
>
Have finally caught another instance of the problem while monitoring the SCTP statistics.
This is not helped by the fact that the out-of-blue counter goes up in total by about 600 while running a complete set of tests (I assume this is mainly at the end of each test when conections are abruptly terminated).
I have therefore been capturing the stats every 100msec and looking at the counters at the moment when the problem occurred.
This shows the out-of-blue counter being incremented at the same time as the SCTP connection failure.
16:07:12.708
SctpCurrEstab 22
SctpActiveEstabs 64619
SctpPassiveEstabs 64618
SctpAborteds 108462
SctpShutdowns 24922
SctpOutOfBlues 3471
SctpChecksumErrors 0
SctpOutCtrlChunks 915680
SctpOutOrderChunks 708312
SctpOutUnorderChunks 0
SctpInCtrlChunks 1314834
SctpInOrderChunks 704751
SctpInUnorderChunks 0
SctpFragUsrMsgs 0
SctpReasmUsrMsgs 0
SctpOutSCTPPacks 1489904
SctpInSCTPPacks 1488886
SctpT1InitExpireds 108
SctpT1CookieExpireds 2
SctpT2ShutdownExpireds 80
SctpT3RtxExpireds 162
SctpT4RtoExpireds 0
SctpT5ShutdownGuardExpireds 0
SctpDelaySackExpireds 54915
SctpAutocloseExpireds 0
SctpT3Retransmits 157
SctpPmtudRetransmits 0
SctpFastRetransmits 0
SctpInPktSoftirq 1217809
SctpInPktBacklog 270941
SctpInPktDiscards 3483
SctpInDataChunkDiscards 0
16:07:12.810
SctpCurrEstab 38
SctpActiveEstabs 64627
SctpPassiveEstabs 64627
SctpAborteds 108463
SctpShutdowns 24922
SctpOutOfBlues 3472
SctpChecksumErrors 0
SctpOutCtrlChunks 915742
SctpOutOrderChunks 708342
SctpOutUnorderChunks 0
SctpInCtrlChunks 1314899
SctpInOrderChunks 704781
SctpInUnorderChunks 0
SctpFragUsrMsgs 0
SctpReasmUsrMsgs 0
SctpOutSCTPPacks 1489978
SctpInSCTPPacks 1488960
SctpT1InitExpireds 108
SctpT1CookieExpireds 2
SctpT2ShutdownExpireds 80
SctpT3RtxExpireds 162
SctpT4RtoExpireds 0
SctpT5ShutdownGuardExpireds 0
SctpDelaySackExpireds 54920
SctpAutocloseExpireds 0
SctpT3Retransmits 157
SctpPmtudRetransmits 0
SctpFastRetransmits 0
SctpInPktSoftirq 1217854
SctpInPktBacklog 270970
SctpInPktDiscards 3484
SctpInDataChunkDiscards 0
>
> Btw, is this test public? Can I run it too?
Unfortunately, it is private.
> Or if you can create a
> small reproducer, that would be great.
This would be great if I could figure out what the important elements are in what I am doing.
The tests are opening and closing and aborting large numbers of connections.
Some of the connections are used to exchange a lot of data, others hardly carry anything.
The connection that fails appears to be fairly random. The timing of when it fails appears to be fairly random.
The failure only occurs after an average of over an hour of running.
Any hints at the kind of behaviour that could trigger a failure like this?
Thanks,
Dave.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
` (4 preceding siblings ...)
2018-03-19 17:06 ` David Neil
@ 2018-03-19 18:38 ` Marcelo Ricardo Leitner
2018-03-19 20:28 ` Marcelo Ricardo Leitner
` (6 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-03-19 18:38 UTC (permalink / raw)
To: linux-sctp
On Mon, Mar 19, 2018 at 05:06:05PM +0000, David Neil wrote:
> Marcelo,
> Sorry for the slow reply, have been away and then have been struggling to reproduce the problem.
No problem.
>
>
> >
> > A few lines below it will check if an asoc couldn't be found and will
> > increment SCTP_MIB_OUTOFBLUES. There are more places that inc it, but
> > it's a start.
> >
> > It should show up in netstat -s or /proc/net/sctp/snmp.
> >
>
> Have finally caught another instance of the problem while monitoring
> the SCTP statistics.
> This is not helped by the fact that the out-of-blue counter goes up
> in total by about 600 while running a complete set of tests (I
> assume this is mainly at the end of each test when conections are
> abruptly terminated).
Ouch. This will make it very hard to debug. Even with Neil's
idea of using systemtap, it will likely get too much noise with it.
> I have therefore been capturing the stats every 100msec and looking
> at the counters at the moment when the problem occurred.
>
> This shows the out-of-blue counter being incremented at the same
> time as the SCTP connection failure.
>
...
Ok. This didn't help much, sorry. Just the fact that the counter is
going up, on this situation of several tests going on, won't give us
much. It is a good info, it's just that now we have to remove all the
noise together with it.
>
> >
> > Btw, is this test public? Can I run it too?
>
> Unfortunately, it is private.
>
>
> > Or if you can create a
> > small reproducer, that would be great.
>
> This would be great if I could figure out what the important elements are in what I am doing.
> The tests are opening and closing and aborting large numbers of connections.
> Some of the connections are used to exchange a lot of data, others hardly carry anything.
> The connection that fails appears to be fairly random. The timing of when it fails appears to be fairly random.
> The failure only occurs after an average of over an hour of running.
> Any hints at the kind of behaviour that could trigger a failure like this?
I noticed that the association you referenced used the same port at
both hosts. You don't have a port re-use happening in there, do you?
I fear you won't have other choice other than trimming this down to a
more specific test.
We could, for example, trigger a Panic when the test fails, but then
it's probably too late for us to do any analysis in the vmcore. And we
can't trigger the panic on Abort generation because it will catch the
other expected failures.
One other idea is, if it takes ~1hr to reproduce, try reducing the
pool of tests that are executed in that window and see how it goes.
M.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
` (5 preceding siblings ...)
2018-03-19 18:38 ` Marcelo Ricardo Leitner
@ 2018-03-19 20:28 ` Marcelo Ricardo Leitner
2018-03-19 20:29 ` Marcelo Ricardo Leitner
` (5 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-03-19 20:28 UTC (permalink / raw)
To: linux-sctp
On Mon, Mar 19, 2018 at 03:38:00PM -0300, Marcelo Ricardo Leitner wrote:
> > > Or if you can create a
> > > small reproducer, that would be great.
> >
> > This would be great if I could figure out what the important elements are in what I am doing.
> > The tests are opening and closing and aborting large numbers of connections.
> > Some of the connections are used to exchange a lot of data, others hardly carry anything.
> > The connection that fails appears to be fairly random. The timing of when it fails appears to be fairly random.
> > The failure only occurs after an average of over an hour of running.
> > Any hints at the kind of behaviour that could trigger a failure like this?
>
> I noticed that the association you referenced used the same port at
> both hosts. You don't have a port re-use happening in there, do you?
If you have several associations using the same (src ip, dst ip, dst
port) tuple, you may be facing an issue with rhlists.
(netdev patchset Subject rhashtable: Fix rhltable duplicates insertion)
We use rhltable for the transport list and their description of the
issue matches your situation too AFAICT.
M.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
` (6 preceding siblings ...)
2018-03-19 20:28 ` Marcelo Ricardo Leitner
@ 2018-03-19 20:29 ` Marcelo Ricardo Leitner
2018-03-19 22:05 ` David Neil
` (4 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-03-19 20:29 UTC (permalink / raw)
To: linux-sctp
On Mon, Mar 19, 2018 at 05:28:13PM -0300, Marcelo Ricardo Leitner wrote:
> On Mon, Mar 19, 2018 at 03:38:00PM -0300, Marcelo Ricardo Leitner wrote:
> > > > Or if you can create a
> > > > small reproducer, that would be great.
> > >
> > > This would be great if I could figure out what the important elements are in what I am doing.
> > > The tests are opening and closing and aborting large numbers of connections.
> > > Some of the connections are used to exchange a lot of data, others hardly carry anything.
> > > The connection that fails appears to be fairly random. The timing of when it fails appears to be fairly random.
> > > The failure only occurs after an average of over an hour of running.
> > > Any hints at the kind of behaviour that could trigger a failure like this?
> >
> > I noticed that the association you referenced used the same port at
> > both hosts. You don't have a port re-use happening in there, do you?
>
> If you have several associations using the same (src ip, dst ip, dst
> port) tuple, you may be facing an issue with rhlists.
> (netdev patchset Subject rhashtable: Fix rhltable duplicates insertion)
https://www.mail-archive.com/netdev@vger.kernel.org/msg220650.html
>
> We use rhltable for the transport list and their description of the
> issue matches your situation too AFAICT.
>
> M.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
` (7 preceding siblings ...)
2018-03-19 20:29 ` Marcelo Ricardo Leitner
@ 2018-03-19 22:05 ` David Neil
2018-03-19 22:24 ` Marcelo Ricardo Leitner
` (3 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: David Neil @ 2018-03-19 22:05 UTC (permalink / raw)
To: linux-sctp
There are two patterns of SCTP connections that we use; I believe we have seen the SCTP connection failures on both types of connection.
1) Every task is assigned a unique SCTP port. All tasks then communicate with each other using the standard localhost address 127.0.0.1. Where TASKa and TASKb both connect to TASKc we would end in the situation where the src IP, dst IP and dst port are the same for two connections, the connections only differ by the src port.
2) Where we are using protocols with well known port numbers (e.g Diameter and S1AP), and have multiple tasks that want to use that port, then we separate the connections by using multiple loopback interfaces. For example with S1AP, we may have one connection with src IP\x127.0.0.4, src port6412, dst IP\x127.0.0.1, dst port6412, and a second connection with src IP\x127.0.0.3, src port6412, dst IP\x127.0.0.1, dst port6412. In this case the connections only differ by the src IP.
Can both these scenarios be explained by this issue with rhlists?
Thanks,
Dave.
> On 19 Mar 2018, at 20:29, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
>
> On Mon, Mar 19, 2018 at 05:28:13PM -0300, Marcelo Ricardo Leitner wrote:
>> On Mon, Mar 19, 2018 at 03:38:00PM -0300, Marcelo Ricardo Leitner wrote:
>>>>> Or if you can create a
>>>>> small reproducer, that would be great.
>>>>
>>>> This would be great if I could figure out what the important elements are in what I am doing.
>>>> The tests are opening and closing and aborting large numbers of connections.
>>>> Some of the connections are used to exchange a lot of data, others hardly carry anything.
>>>> The connection that fails appears to be fairly random. The timing of when it fails appears to be fairly random.
>>>> The failure only occurs after an average of over an hour of running.
>>>> Any hints at the kind of behaviour that could trigger a failure like this?
>>>
>>> I noticed that the association you referenced used the same port at
>>> both hosts. You don't have a port re-use happening in there, do you?
>>
>> If you have several associations using the same (src ip, dst ip, dst
>> port) tuple, you may be facing an issue with rhlists.
>> (netdev patchset Subject rhashtable: Fix rhltable duplicates insertion)
>
> https://www.mail-archive.com/netdev@vger.kernel.org/msg220650.html
>
>>
>> We use rhltable for the transport list and their description of the
>> issue matches your situation too AFAICT.
>>
>> M.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
` (8 preceding siblings ...)
2018-03-19 22:05 ` David Neil
@ 2018-03-19 22:24 ` Marcelo Ricardo Leitner
2018-03-21 16:09 ` David Neil
` (2 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-03-19 22:24 UTC (permalink / raw)
To: linux-sctp
On Mon, Mar 19, 2018 at 10:05:56PM +0000, David Neil wrote:
> There are two patterns of SCTP connections that we use; I believe we have seen the SCTP connection failures on both types of connection.
>
> 1) Every task is assigned a unique SCTP port. All tasks then communicate with each other using the standard localhost address 127.0.0.1. Where TASKa and TASKb both connect to TASKc we would end in the situation where the src IP, dst IP and dst port are the same for two connections, the connections only differ by the src port.
>
> 2) Where we are using protocols with well known port numbers (e.g Diameter and S1AP), and have multiple tasks that want to use that port, then we separate the connections by using multiple loopback interfaces. For example with S1AP, we may have one connection with src IP\x127.0.0.4, src port6412, dst IP\x127.0.0.1, dst port6412, and a second connection with src IP\x127.0.0.3, src port6412, dst IP\x127.0.0.1, dst port6412. In this case the connections only differ by the src IP.
>
> Can both these scenarios be explained by this issue with rhlists?
AFAIU both situations, yes. At the very least, worth a try.
Maybe it's easier for you to add some randomness to the src port than
to test a new kernel? This would give a good hint I think.
>
> Thanks,
> Dave.
>
>
> > On 19 Mar 2018, at 20:29, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> >
> > On Mon, Mar 19, 2018 at 05:28:13PM -0300, Marcelo Ricardo Leitner wrote:
> >> On Mon, Mar 19, 2018 at 03:38:00PM -0300, Marcelo Ricardo Leitner wrote:
> >>>>> Or if you can create a
> >>>>> small reproducer, that would be great.
> >>>>
> >>>> This would be great if I could figure out what the important elements are in what I am doing.
> >>>> The tests are opening and closing and aborting large numbers of connections.
> >>>> Some of the connections are used to exchange a lot of data, others hardly carry anything.
> >>>> The connection that fails appears to be fairly random. The timing of when it fails appears to be fairly random.
> >>>> The failure only occurs after an average of over an hour of running.
> >>>> Any hints at the kind of behaviour that could trigger a failure like this?
> >>>
> >>> I noticed that the association you referenced used the same port at
> >>> both hosts. You don't have a port re-use happening in there, do you?
> >>
> >> If you have several associations using the same (src ip, dst ip, dst
> >> port) tuple, you may be facing an issue with rhlists.
> >> (netdev patchset Subject rhashtable: Fix rhltable duplicates insertion)
> >
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg220650.html
> >
> >>
> >> We use rhltable for the transport list and their description of the
> >> issue matches your situation too AFAICT.
> >>
> >> M.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
` (9 preceding siblings ...)
2018-03-19 22:24 ` Marcelo Ricardo Leitner
@ 2018-03-21 16:09 ` David Neil
2018-03-21 16:35 ` Marcelo Ricardo Leitner
2018-03-24 7:32 ` David Neil
12 siblings, 0 replies; 14+ messages in thread
From: David Neil @ 2018-03-21 16:09 UTC (permalink / raw)
To: linux-sctp
Marcelo,
It would not have been easy to fix the connections of the first type described below as this is a fundamental part of the design of the software.
But it was possible to change the second type of connection.
In all cases, where we had multiple SCTP connection differing only by the source IP address, I changed them so that they also had different source ports.
i.e. 127.0.0.3,36412 => 127.0.0.1,36412 and 127.0.0.4,36412 => 127.0.0.1,36412
became 127.0.0.3,2001 => 127.0.0.1,36412 and 127.0.0.4,2002 => 127.0.0.1,36412
Somewhat surprisingly, this seems to have fixed everything.
I have now been running the tests in a loop for nearly 36 hours and there have been no failures.
I was expecting this change to fix the failures for the second type of connection, but not expecting it to fix the failures for the first type of connection; but it appears that it has fixed both.
It appears that having multiple connections differing only in the source IP address could cause connection failures on other unrelated SCTP connections.
I am assuming this decription I have given still fits in with the theory that the failures were casued by the rhlists bug. Do you need any more info to confirm this?
From my point of view, this issue is now resolved.
Dave.
> On 19 Mar 2018, at 22:24, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
>
> On Mon, Mar 19, 2018 at 10:05:56PM +0000, David Neil wrote:
>> There are two patterns of SCTP connections that we use; I believe we have seen the SCTP connection failures on both types of connection.
>>
>> 1) Every task is assigned a unique SCTP port. All tasks then communicate with each other using the standard localhost address 127.0.0.1. Where TASKa and TASKb both connect to TASKc we would end in the situation where the src IP, dst IP and dst port are the same for two connections, the connections only differ by the src port.
>>
>> 2) Where we are using protocols with well known port numbers (e.g Diameter and S1AP), and have multiple tasks that want to use that port, then we separate the connections by using multiple loopback interfaces. For example with S1AP, we may have one connection with src IP\x127.0.0.4, src port6412, dst IP\x127.0.0.1, dst port6412, and a second connection with src IP\x127.0.0.3, src port6412, dst IP\x127.0.0.1, dst port6412. In this case the connections only differ by the src IP.
>>
>> Can both these scenarios be explained by this issue with rhlists?
>
> AFAIU both situations, yes. At the very least, worth a try.
>
> Maybe it's easier for you to add some randomness to the src port than
> to test a new kernel? This would give a good hint I think.
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
` (10 preceding siblings ...)
2018-03-21 16:09 ` David Neil
@ 2018-03-21 16:35 ` Marcelo Ricardo Leitner
2018-03-24 7:32 ` David Neil
12 siblings, 0 replies; 14+ messages in thread
From: Marcelo Ricardo Leitner @ 2018-03-21 16:35 UTC (permalink / raw)
To: linux-sctp
Hi David,
On Wed, Mar 21, 2018 at 04:09:30PM +0000, David Neil wrote:
> Marcelo,
> It would not have been easy to fix the connections of the first type
> described below as this is a fundamental part of the design of the
> software.
>
> But it was possible to change the second type of connection.
> In all cases, where we had multiple SCTP connection differing only
> by the source IP address, I changed them so that they also had
> different source ports.
>
> i.e. 127.0.0.3,36412 => 127.0.0.1,36412 and 127.0.0.4,36412 => 127.0.0.1,36412
> became 127.0.0.3,2001 => 127.0.0.1,36412 and 127.0.0.4,2002 => 127.0.0.1,36412
>
> Somewhat surprisingly, this seems to have fixed everything.
> I have now been running the tests in a loop for nearly 36 hours and
> there have been no failures.
Yay, nice!
>
> I was expecting this change to fix the failures for the second type
> of connection, but not expecting it to fix the failures for the
> first type of connection; but it appears that it has fixed both.
> It appears that having multiple connections differing only in the
> source IP address could cause connection failures on other unrelated
> SCTP connections.
I understand that the first type already used a different src port for
each connection.
But yes, if the hashes of the first type ended up being the same as
the hashes of the second type, the bug could affect both. Consider
that for the rhashtable, once hashed, it doesn't matter what the
actual keys were.
>
> I am assuming this decription I have given still fits in with the
> theory that the failures were casued by the rhlists bug. Do you need
> any more info to confirm this?
If doable, you could apply those fixes and run the original tests
again. That would be very good to confirm it, but not really needed.
I'll try to write some test case for this. Will let you know once I
have it.
Note sure if you know about it, but we have a growing collection of
test cases @ https://github.com/sctp/sctp-tests
Pull requests are very welcomed. :-)
>
> From my point of view, this issue is now resolved.
Great! But note that the real fix is to apply the rhashtable patches.
Changing the src port is just a workaround and you may still hit the
issue if the stars align again.
Thanks,
M.
> Dave.
>
>
> > On 19 Mar 2018, at 22:24, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> >
> > On Mon, Mar 19, 2018 at 10:05:56PM +0000, David Neil wrote:
> >> There are two patterns of SCTP connections that we use; I believe we have seen the SCTP connection failures on both types of connection.
> >>
> >> 1) Every task is assigned a unique SCTP port. All tasks then communicate with each other using the standard localhost address 127.0.0.1. Where TASKa and TASKb both connect to TASKc we would end in the situation where the src IP, dst IP and dst port are the same for two connections, the connections only differ by the src port.
> >>
> >> 2) Where we are using protocols with well known port numbers (e.g Diameter and S1AP), and have multiple tasks that want to use that port, then we separate the connections by using multiple loopback interfaces. For example with S1AP, we may have one connection with src IP\x127.0.0.4, src port6412, dst IP\x127.0.0.1, dst port6412, and a second connection with src IP\x127.0.0.3, src port6412, dst IP\x127.0.0.1, dst port6412. In this case the connections only differ by the src IP.
> >>
> >> Can both these scenarios be explained by this issue with rhlists?
> >
> > AFAIU both situations, yes. At the very least, worth a try.
> >
> > Maybe it's easier for you to add some randomness to the src port than
> > to test a new kernel? This would give a good hint I think.
> >
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: SCTP abort with T-bit set after handshake
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
` (11 preceding siblings ...)
2018-03-21 16:35 ` Marcelo Ricardo Leitner
@ 2018-03-24 7:32 ` David Neil
12 siblings, 0 replies; 14+ messages in thread
From: David Neil @ 2018-03-24 7:32 UTC (permalink / raw)
To: linux-sctp
>
>>
>> I am assuming this decription I have given still fits in with the
>> theory that the failures were casued by the rhlists bug. Do you need
>> any more info to confirm this?
>
> If doable, you could apply those fixes and run the original tests
> again. That would be very good to confirm it, but not really needed.
>
> I'll try to write some test case for this. Will let you know once I
> have it.
>
> Note sure if you know about it, but we have a growing collection of
> test cases @ https://github.com/sctp/sctp-tests
>
> Pull requests are very welcomed. :-)
>
The testing I have been doing has all been with the original test cases. These all work fine now.
I have been trying to put together a cut down standalone test that could recreate this problem, but I have not had any luck.
I guess if it was easy to reproduce then this linux bug would have been picked up ages ago!
Hopefully you have more luck.
Regards,
Dave.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2018-03-24 7:32 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-16 9:33 SCTP abort with T-bit set after handshake David Neil
2018-03-16 15:14 ` Marcelo Ricardo Leitner
2018-03-16 15:54 ` David Neil
2018-03-16 17:36 ` Marcelo Ricardo Leitner
2018-03-16 19:05 ` Neil Horman
2018-03-19 17:06 ` David Neil
2018-03-19 18:38 ` Marcelo Ricardo Leitner
2018-03-19 20:28 ` Marcelo Ricardo Leitner
2018-03-19 20:29 ` Marcelo Ricardo Leitner
2018-03-19 22:05 ` David Neil
2018-03-19 22:24 ` Marcelo Ricardo Leitner
2018-03-21 16:09 ` David Neil
2018-03-21 16:35 ` Marcelo Ricardo Leitner
2018-03-24 7:32 ` David Neil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.