From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexander Aring Date: Thu, 23 Jul 2020 10:49:07 -0400 Subject: [Cluster-devel] [PATCH dlm-next 3/4] fs: dlm: change handling of reconnects In-Reply-To: <20200723144908.271110-1-aahringo@redhat.com> References: <20200723144908.271110-1-aahringo@redhat.com> Message-ID: <20200723144908.271110-4-aahringo@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit This patch changes the handling of reconnects. At first we only close the connection related to the communication failure. If we get a new connection for an already existing connection we close the existing connection and take the new one. This patch improves significantly the stability of tcp connections while running "tcpkill -9 -i $IFACE port 21064" while generating a lot of dlm messages e.g. on a gfs2 mount and [0]. My test setup shows that a deadlock is "more" unlikely. Before this patch I wasn't able to get not a deadlock after 5 seconds. After this patch my observation is that it's more likely to survive after 5 seconds and more, but still a deadlock occurs after certain time. My guess is that there are still "segments" inside the tcp writequeue or retransmit queue which get dropped when receiving a tcp reset [1]. Hard to reproduce because the right message need to be inside these queues, which might even be in the 5 first seconds with this patch. [0] https://fedorapeople.org/cgit/teigland/public_git/dct-stuff.git/tree/fs/make_panic [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_input.c?h=v5.8-rc6#n4122 Signed-off-by: Alexander Aring --- fs/dlm/lowcomms.c | 25 ++++++++++--------------- 1 file changed, 10 insertions(+), 15 deletions(-) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 19b50d69babef..f5ac40db5e7c1 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -711,7 +711,7 @@ static int receive_from_sock(struct connection *con) out_close: mutex_unlock(&con->sock_mutex); if (ret != -EAGAIN) { - close_connection(con, true, true, false); + close_connection(con, false, true, false); /* Reconnect when there is something to send */ } /* Don't return success if we really got EOF */ @@ -802,21 +802,16 @@ static int accept_from_sock(struct connection *con) INIT_WORK(&othercon->swork, process_send_sockets); INIT_WORK(&othercon->rwork, process_recv_sockets); set_bit(CF_IS_OTHERCON, &othercon->flags); + } else { + /* close other sock con if we have something new */ + close_connection(othercon, false, true, false); } + mutex_lock_nested(&othercon->sock_mutex, 2); - if (!othercon->sock) { - newcon->othercon = othercon; - add_sock(newsock, othercon); - addcon = othercon; - mutex_unlock(&othercon->sock_mutex); - } - else { - printk("Extra connection from node %d attempted\n", nodeid); - result = -EAGAIN; - mutex_unlock(&othercon->sock_mutex); - mutex_unlock(&newcon->sock_mutex); - goto accept_err; - } + newcon->othercon = othercon; + add_sock(newsock, othercon); + addcon = othercon; + mutex_unlock(&othercon->sock_mutex); } else { newcon->rx_action = receive_from_sock; @@ -1413,7 +1408,7 @@ static void send_to_sock(struct connection *con) send_error: mutex_unlock(&con->sock_mutex); - close_connection(con, true, false, true); + close_connection(con, false, false, true); /* Requeue the send work. When the work daemon runs again, it will try a new connection, then call this function again. */ queue_work(send_workqueue, &con->swork); -- 2.26.2