From mboxrd@z Thu Jan  1 00:00:00 1970
From: Guillaume Nault <gnault@redhat.com>
Date: Fri, 2 Apr 2021 22:53:51 +0200
Subject: [Cluster-devel] [PATCHv3 dlm/next 7/8] fs: dlm: add reliable
 connection if reconnect
In-Reply-To: <20210326173337.44231-8-aahringo@redhat.com>
References: <20210326173337.44231-1-aahringo@redhat.com>
	<20210326173337.44231-8-aahringo@redhat.com>
Message-ID: <20210402205351.GA24027@linux-2.home>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

On Fri, Mar 26, 2021 at 01:33:36PM -0400, Alexander Aring wrote:
> This patch introduce to make a tcp lowcomms connection reliable even if
> reconnects occurs. This is done by an application layer re-transmission
> handling and sequence numbers in dlm protocols. There are three new dlm
> commands:
> 
> DLM_OPTS:
> 
> This will encapsulate an existing dlm message (and rcom message if they
> don't have an own application side re-transmission handling). As optional
> handling additional tlv's (type length fields) can be appended. This can
> be for example a sequence number field. However because in DLM_OPTS the
> lockspace field is unused and a sequence number is a mandatory field it
> isn't made as a tlv and we put the sequence number inside the lockspace
> id. The possibility to add optional options are still there for future
> purposes.
> 
> DLM_ACK:
> 
> Just a dlm header to acknowledge the receive of a DLM_OPTS message to
> it's sender.
> 
> DLM_FIN:
> 
> A new DLM message to synchronize pending message to the other dlm end if
> the node want to disconnects. Each side waits until it receives this
> message and disconnects. It's important that this message has nothing to
> do with the application logic because it might run in a timeout if
> acknowledge messages are dropped. The problem which we try to solve with
> DLM_FIN is that the node membership is handled by corosync and not the
> kernel dlm protocol itself, DLM_FIN tries to synchronize the kernel dlm
> protocol with corosync join/leave membership callbacks.

If I understand correctly, currently, when DLM faces a hard but
temporary connection failure (like receiving a stray TCP RST), it
automatically recreates a new connection. However, the in-flight data
of the previous connection are lost. The problem is that such data loss
can turn the participating nodes into inconsistent states.

Therefore this patch series implements sequence number tracking at the
application level, so that a new connection can retransmit
unacknowledged data from the previous, failed, connection.

Is that an accurate summary or is there anything I've missed?

I feel that this patch goes very far into re-implementing TCP-like
features. For example, why is fast-retransmit even needed? DLM seems to
always uses a transport protocol that guarantees reliable and in order
delivery. Therefore, duplicate DLM acknowledgements can't happen on an
established connection. Even when reconnecting, it'd be the sender's
responsibility to resend the last unackowledged data first. How could
this lead to holes in the DLM sequence numbers on the receiver's side?

It seems to me that the only time DLM might need to retransmit data, is
when recovering from a connection failure. So why can't we just resend
unacknowledged data at reconnection time? That'd probably simplify the
code a lot (no need to maintain a retransmission timeout on TX, no need
to handle sequence numbers that are in the future on RX).

Also, couldn't we set the DLM sequence numbers in
dlm_midcomms_commit_buffer_3_2() rather than using a callback function
in dlm_lowcomms_new_buffer()?

Finally, I wonder if we could avoid adding DLM sequence numbers
entirely. Have you considered using the TCP_REPAIR infrastructure to
retrieve the send queue of the failed socket and resend that data once
the new socket is connected?

Please correct me if the above suggestions don't make sense. I'm not
familiar with DLM so I can very well be missing important points.

Thanks,

> To explain the basic functionality take a look into the
> dlm_midcomms_receive_buffer() function. This function will take care
> that dlm messages are delivered according to their sequence numbers and
> request re-transmission via sending acknowledge messages. However there
> exists three cases:
> 
> 1. sequence number is the one which is expected. That means everything
>    is working fine. Additional there is always a check if the next
>    message was already queued for future, this will occur when there was
>    some messages drops before.
> 
> 2. A sequence number is in the future, in this case we queue it for might
>    future delivery, see case 1.
> 
> 3. A sequence number is in the past, in this case we drop this message
>    because it was already delivered.
> 
> To send acknowledge we always send the sequence number which is
> expected, if the other node sends multiple acknowledge for the same
> sequence numbers it will trigger a re-transmission. In case no acknowledge
> is send back, a timer with a timeout handling is running and will trigger
> a re-tranmission as well. Sending multiple ack's with the same sequence or
> messages with the same sequence should not have any effects that breaks
> dlm. Only messages in the far future can break dlm, that's why important
> that the closing connection is right synchronized with DLM_FIN which
> also resets the sequence numbers.
> 
> As RCOM_STATUS and RCOM_NAMES messages are the first messages which are
> exchanged and they have they own re-transmission handling, there exists
> logic that these messages must be first. If these messages arrives we
> store the dlm version field. This handling is on DLM 3.1 and after this
> patch 3.2 the same. A backwards compatibility handling has been added
> which seems to work on tests without tcpkill, however it's not recommended
> to use DLM 3.1 and 3.2 at the same time, because DLM 3.2 tries to fix long
> term bugs in the DLM protocol.
> 
> Signed-off-by: Alexander Aring <aahringo@redhat.com>