[Cluster-devel] [PATCHv3 dlm/next 7/8] fs: dlm: add reliable connection if reconnect

From: Guillaume Nault <gnault@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCHv3 dlm/next 7/8] fs: dlm: add reliable connection if reconnect
Date: Fri, 9 Apr 2021 22:32:21 +0200	[thread overview]
Message-ID: <20210409203221.GB30244@linux-2.home> (raw)
In-Reply-To: <CAK-6q+hnj94xQS+QceDF3GyDR78ns61-T1UVLs7o6kJsPzT=Fw@mail.gmail.com>

On Sat, Apr 03, 2021 at 11:34:33AM -0400, Alexander Ahring Oder Aring wrote:
> Hi Guillaume,
> 
> On Fri, Apr 2, 2021 at 4:54 PM Guillaume Nault <gnault@redhat.com> wrote:
> >
> > On Fri, Mar 26, 2021 at 01:33:36PM -0400, Alexander Aring wrote:
> > > This patch introduce to make a tcp lowcomms connection reliable even if
> > > reconnects occurs. This is done by an application layer re-transmission
> > > handling and sequence numbers in dlm protocols. There are three new dlm
> > > commands:
> > >
> > > DLM_OPTS:
> > >
> > > This will encapsulate an existing dlm message (and rcom message if they
> > > don't have an own application side re-transmission handling). As optional
> > > handling additional tlv's (type length fields) can be appended. This can
> > > be for example a sequence number field. However because in DLM_OPTS the
> > > lockspace field is unused and a sequence number is a mandatory field it
> > > isn't made as a tlv and we put the sequence number inside the lockspace
> > > id. The possibility to add optional options are still there for future
> > > purposes.
> > >
> > > DLM_ACK:
> > >
> > > Just a dlm header to acknowledge the receive of a DLM_OPTS message to
> > > it's sender.
> > >
> > > DLM_FIN:
> > >
> > > A new DLM message to synchronize pending message to the other dlm end if
> > > the node want to disconnects. Each side waits until it receives this
> > > message and disconnects. It's important that this message has nothing to
> > > do with the application logic because it might run in a timeout if
> > > acknowledge messages are dropped. The problem which we try to solve with
> > > DLM_FIN is that the node membership is handled by corosync and not the
> > > kernel dlm protocol itself, DLM_FIN tries to synchronize the kernel dlm
> > > protocol with corosync join/leave membership callbacks.
> >
> > If I understand correctly, currently, when DLM faces a hard but
> > temporary connection failure (like receiving a stray TCP RST), it
> > automatically recreates a new connection. However, the in-flight data
> > of the previous connection are lost. The problem is that such data loss
> > can turn the participating nodes into inconsistent states.
> >
> > Therefore this patch series implements sequence number tracking at the
> > application level, so that a new connection can retransmit
> > unacknowledged data from the previous, failed, connection.
> >
> > Is that an accurate summary or is there anything I've missed?
> >
> 
> Yes, there is one thing missing. I detected a synchronization issue
> because the node membership of a lockspace is done by a different
> protocol handled in user space. This protocol declares the actual peer
> of the DLM protocol. In short, there are two protocols here, whereas
> one handles the membership of peers and the other handles the actual
> operations. There is a missing synchronization between operations and
> membership which I solved with a DLM_FIN message.

Thanks, I'll re-read the parts about DLM_FIN with this explanation in
mind.

> > I feel that this patch goes very far into re-implementing TCP-like
> > features. For example, why is fast-retransmit even needed? DLM seems to
> > always uses a transport protocol that guarantees reliable and in order
> > delivery. Therefore, duplicate DLM acknowledgements can't happen on an
> > established connection. Even when reconnecting, it'd be the sender's
> > responsibility to resend the last unackowledged data first. How could
> > this lead to holes in the DLM sequence numbers on the receiver's side?
> >
> 
> I agree it's a TCP re-implementation on the application layer.
> 
> Fast-retransmit is not needed as it is not needed in TCP. However it
> solves drops faster, in DLM and GFS2 you would experience a stuck in
> some filesystem syscalls until the drop is resolved.

I was not talking about fast retransmits in general. My question was:
how could a DLM fast retransmit even be triggered? This can only happen
when some DLM messages get lost or are received out of order. But the
transport layer guarantees lossless and in order delivery. Even in case
of a reconnection, I can't see how the peer could get holes in its
receive queue. What am I missing?

> > It seems to me that the only time DLM might need to retransmit data, is
> > when recovering from a connection failure. So why can't we just resend
> > unacknowledged data at reconnection time? That'd probably simplify the
> > code a lot (no need to maintain a retransmission timeout on TX, no need
> > to handle sequence numbers that are in the future on RX).
> >
> 
> I can try to remove the timer, timeout and do the above approach to
> retransmit at reconnect. Then I test it again and I will report back
> to see if it works or why we have other problems.

I see you've posted a new version of the patch series. I'll look at it
soon.

> > Also, couldn't we set the DLM sequence numbers in
> > dlm_midcomms_commit_buffer_3_2() rather than using a callback function
> > in dlm_lowcomms_new_buffer()?
> >
> 
> That has something to do with the internal buffer allocator. The order
> is defined in dlm_lowcomms_new_buffer() and there is an internal lock
> which needs to be held there. I can't do this without somehow making
> the lock accessible in dlm_midcomms_commit_buffer_3_2(). The code is
> also optimized that the lock isn't held during kmalloc().

I missed that. Thanks for the explanation.

> > Finally, I wonder if we could avoid adding DLM sequence numbers
> > entirely. Have you considered using the TCP_REPAIR infrastructure to
> > retrieve the send queue of the failed socket and resend that data once
> > the new socket is connected?
> >
> 
> Yes, I looked into TCP_REPAIR at first and I agree it can be used to
> solve this problem. However TCP_REPAIR can be used as a part of a more
> generic solution, there needs to be something "additional handling"
> done e.g. additional socket options to let the application layer save
> states before receiving errors. I am also concerned how it would work
> with haproxy, etc. It also has some advantages to restore window size,
> etc. I was thinking about submitting a proposal to the next netdevconf
> after doing some experiments with it.

Well, I was thinking of using TCP_REPAIR only to retrieve the send
queue. I didn't mean to avoid the hand-check like in the CRIU use case.
A TCP connection break is a sign that something went wrong between the
peers, so a full reconnection is in order IMHO.

> We decided to solve this problem at the application layer first, then
> look into how to make a more generic solution. For me, I am not sure
> how cifs deals with reconnects or it just allows drops.

Indeed, it'd be interesting to know how other protocols handle this
case.

> > Please correct me if the above suggestions don't make sense. I'm not
> > familiar with DLM so I can very well be missing important points.
> >
> 
> No problem, it does make sense for me.
> 
> - Alex
>