From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alexander Ahring Oder Aring <aahringo@redhat.com>
Date: Sat, 3 Apr 2021 11:34:33 -0400
Subject: [Cluster-devel] [PATCHv3 dlm/next 7/8] fs: dlm: add reliable
	connection if reconnect
In-Reply-To: <20210402205351.GA24027@linux-2.home>
References: <20210326173337.44231-1-aahringo@redhat.com>
	<20210326173337.44231-8-aahringo@redhat.com>
	<20210402205351.GA24027@linux-2.home>
Message-ID: <CAK-6q+hnj94xQS+QceDF3GyDR78ns61-T1UVLs7o6kJsPzT=Fw@mail.gmail.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

Hi Guillaume,

On Fri, Apr 2, 2021 at 4:54 PM Guillaume Nault <gnault@redhat.com> wrote:
>
> On Fri, Mar 26, 2021 at 01:33:36PM -0400, Alexander Aring wrote:
> > This patch introduce to make a tcp lowcomms connection reliable even if
> > reconnects occurs. This is done by an application layer re-transmission
> > handling and sequence numbers in dlm protocols. There are three new dlm
> > commands:
> >
> > DLM_OPTS:
> >
> > This will encapsulate an existing dlm message (and rcom message if they
> > don't have an own application side re-transmission handling). As optional
> > handling additional tlv's (type length fields) can be appended. This can
> > be for example a sequence number field. However because in DLM_OPTS the
> > lockspace field is unused and a sequence number is a mandatory field it
> > isn't made as a tlv and we put the sequence number inside the lockspace
> > id. The possibility to add optional options are still there for future
> > purposes.
> >
> > DLM_ACK:
> >
> > Just a dlm header to acknowledge the receive of a DLM_OPTS message to
> > it's sender.
> >
> > DLM_FIN:
> >
> > A new DLM message to synchronize pending message to the other dlm end if
> > the node want to disconnects. Each side waits until it receives this
> > message and disconnects. It's important that this message has nothing to
> > do with the application logic because it might run in a timeout if
> > acknowledge messages are dropped. The problem which we try to solve with
> > DLM_FIN is that the node membership is handled by corosync and not the
> > kernel dlm protocol itself, DLM_FIN tries to synchronize the kernel dlm
> > protocol with corosync join/leave membership callbacks.
>
> If I understand correctly, currently, when DLM faces a hard but
> temporary connection failure (like receiving a stray TCP RST), it
> automatically recreates a new connection. However, the in-flight data
> of the previous connection are lost. The problem is that such data loss
> can turn the participating nodes into inconsistent states.
>
> Therefore this patch series implements sequence number tracking at the
> application level, so that a new connection can retransmit
> unacknowledged data from the previous, failed, connection.
>
> Is that an accurate summary or is there anything I've missed?
>

Yes, there is one thing missing. I detected a synchronization issue
because the node membership of a lockspace is done by a different
protocol handled in user space. This protocol declares the actual peer
of the DLM protocol. In short, there are two protocols here, whereas
one handles the membership of peers and the other handles the actual
operations. There is a missing synchronization between operations and
membership which I solved with a DLM_FIN message.

> I feel that this patch goes very far into re-implementing TCP-like
> features. For example, why is fast-retransmit even needed? DLM seems to
> always uses a transport protocol that guarantees reliable and in order
> delivery. Therefore, duplicate DLM acknowledgements can't happen on an
> established connection. Even when reconnecting, it'd be the sender's
> responsibility to resend the last unackowledged data first. How could
> this lead to holes in the DLM sequence numbers on the receiver's side?
>

I agree it's a TCP re-implementation on the application layer.

Fast-retransmit is not needed as it is not needed in TCP. However it
solves drops faster, in DLM and GFS2 you would experience a stuck in
some filesystem syscalls until the drop is resolved.

> It seems to me that the only time DLM might need to retransmit data, is
> when recovering from a connection failure. So why can't we just resend
> unacknowledged data at reconnection time? That'd probably simplify the
> code a lot (no need to maintain a retransmission timeout on TX, no need
> to handle sequence numbers that are in the future on RX).
>

I can try to remove the timer, timeout and do the above approach to
retransmit at reconnect. Then I test it again and I will report back
to see if it works or why we have other problems.

> Also, couldn't we set the DLM sequence numbers in
> dlm_midcomms_commit_buffer_3_2() rather than using a callback function
> in dlm_lowcomms_new_buffer()?
>

That has something to do with the internal buffer allocator. The order
is defined in dlm_lowcomms_new_buffer() and there is an internal lock
which needs to be held there. I can't do this without somehow making
the lock accessible in dlm_midcomms_commit_buffer_3_2(). The code is
also optimized that the lock isn't held during kmalloc().

> Finally, I wonder if we could avoid adding DLM sequence numbers
> entirely. Have you considered using the TCP_REPAIR infrastructure to
> retrieve the send queue of the failed socket and resend that data once
> the new socket is connected?
>

Yes, I looked into TCP_REPAIR at first and I agree it can be used to
solve this problem. However TCP_REPAIR can be used as a part of a more
generic solution, there needs to be something "additional handling"
done e.g. additional socket options to let the application layer save
states before receiving errors. I am also concerned how it would work
with haproxy, etc. It also has some advantages to restore window size,
etc. I was thinking about submitting a proposal to the next netdevconf
after doing some experiments with it.

We decided to solve this problem at the application layer first, then
look into how to make a more generic solution. For me, I am not sure
how cifs deals with reconnects or it just allows drops.

> Please correct me if the above suggestions don't make sense. I'm not
> familiar with DLM so I can very well be missing important points.
>

No problem, it does make sense for me.

- Alex