All of lore.kernel.org
 help / color / mirror / Atom feed
* [regression 6.1.67] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()")
       [not found] <38f51dbb-65aa-4ec2-bed2-e914aef27d25@vrvis.at>
@ 2024-02-07 10:39 ` Salvatore Bonaccorso
  2024-02-07 18:33   ` Jordan Rife
  0 siblings, 1 reply; 9+ messages in thread
From: Salvatore Bonaccorso @ 2024-02-07 10:39 UTC (permalink / raw)
  To: Valentin Kleibel, Jordan Rife, David Teigland, Alexander Aring
  Cc: 1063338, gfs2, linux-kernel, stable, gregkh, regressions

Hi Valentin, hi all

[This is about a regression reported in Debian for 6.1.67]

On Tue, Feb 06, 2024 at 01:00:11PM +0100, Valentin Kleibel wrote:
> Package: linux-image-amd64
> Version: 6.1.76+1
> Source: linux
> Source-Version: 6.1.76+1
> Severity: important
> Control: notfound -1 6.6.15-2
> 
> Dear Maintainers,
> 
> We discovered a bug affecting dlm that prevents any tcp communications by
> dlm when booted with debian kernel 6.1.76-1.
> 
> Dlm startup works (corosync-cpgtool shows the dlm:controld group with all
> expected nodes) but as soon as we try to add a lockspace dmesg shows:
> ```
> dlm: Using TCP for communications
> dlm: cannot start dlm midcomms -97
> ```
> 
> It seems that commit "dlm: use kernel_connect() and kernel_bind()"
> (e9cdebbe) was merged to 6.1.
> 
> Checking the code it seems that the changed function dlm_tcp_listen_bind()
> fails with exit code 97 (EAFNOSUPPORT)
> It is called from
> 
> dlm/lockspace.c: threads_start() -> dlm_midcomms_start()
> dlm/midcomms.c: dlm_midcomms_start() -> dlm_lowcomms_start()
> dlm/lowcomms.c: dlm_lowcomms_start() -> dlm_listen_for_all() ->
> dlm_proto_ops->listen_bind() = dlm_tcp_listen_bind()
> 
> The error code is returned all the way to threads_start() where the error
> message is emmitted.
> 
> Booting with the unsigned kernel from testing (6.6.15-2), which also
> contains this commit, works without issues.
> 
> I'm not sure what additional changes are required to get this working or if
> rolling back this change is an option.
> 
> We'd be happy to test patches that might fix this issue.

Thanks for your report. So we have a 6.1.76 specific regression for
the backport of e9cdebbe23f1 ("dlm: use kernel_connect() and
kernel_bind()") .

Let's loop in the upstream regression list for tracking and people
involved for the subsystem to see if the issue can be identified. As
it is working for 6.6.15 which includes the commit backport as well it
might be very well that a prerequisite is missing.

# annotate regression with 6.1.y specific commit
#regzbot ^introduced e11dea8f503341507018b60906c4a9e7332f3663
#regzbot link: https://bugs.debian.org/1063338

Any ideas?

Regards,
Salvatore

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [regression 6.1.67] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()")
  2024-02-07 10:39 ` [regression 6.1.67] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()") Salvatore Bonaccorso
@ 2024-02-07 18:33   ` Jordan Rife
  2024-02-07 21:27     ` Alexander Aring
  2024-02-08 11:37     ` Valentin Kleibel
  0 siblings, 2 replies; 9+ messages in thread
From: Jordan Rife @ 2024-02-07 18:33 UTC (permalink / raw)
  To: Salvatore Bonaccorso
  Cc: Valentin Kleibel, David Teigland, Alexander Aring, 1063338, gfs2,
	linux-kernel, stable, gregkh, regressions

On Wed, Feb 7, 2024 at 2:39 AM Salvatore Bonaccorso <carnil@debian.org> wrote:
>
> Hi Valentin, hi all
>
> [This is about a regression reported in Debian for 6.1.67]
>
> On Tue, Feb 06, 2024 at 01:00:11PM +0100, Valentin Kleibel wrote:
> > Package: linux-image-amd64
> > Version: 6.1.76+1
> > Source: linux
> > Source-Version: 6.1.76+1
> > Severity: important
> > Control: notfound -1 6.6.15-2
> >
> > Dear Maintainers,
> >
> > We discovered a bug affecting dlm that prevents any tcp communications by
> > dlm when booted with debian kernel 6.1.76-1.
> >
> > Dlm startup works (corosync-cpgtool shows the dlm:controld group with all
> > expected nodes) but as soon as we try to add a lockspace dmesg shows:
> > ```
> > dlm: Using TCP for communications
> > dlm: cannot start dlm midcomms -97
> > ```
> >
> > It seems that commit "dlm: use kernel_connect() and kernel_bind()"
> > (e9cdebbe) was merged to 6.1.
> >
> > Checking the code it seems that the changed function dlm_tcp_listen_bind()
> > fails with exit code 97 (EAFNOSUPPORT)
> > It is called from
> >
> > dlm/lockspace.c: threads_start() -> dlm_midcomms_start()
> > dlm/midcomms.c: dlm_midcomms_start() -> dlm_lowcomms_start()
> > dlm/lowcomms.c: dlm_lowcomms_start() -> dlm_listen_for_all() ->
> > dlm_proto_ops->listen_bind() = dlm_tcp_listen_bind()
> >
> > The error code is returned all the way to threads_start() where the error
> > message is emmitted.
> >
> > Booting with the unsigned kernel from testing (6.6.15-2), which also
> > contains this commit, works without issues.
> >
> > I'm not sure what additional changes are required to get this working or if
> > rolling back this change is an option.
> >
> > We'd be happy to test patches that might fix this issue.
>
> Thanks for your report. So we have a 6.1.76 specific regression for
> the backport of e9cdebbe23f1 ("dlm: use kernel_connect() and
> kernel_bind()") .
>
> Let's loop in the upstream regression list for tracking and people
> involved for the subsystem to see if the issue can be identified. As
> it is working for 6.6.15 which includes the commit backport as well it
> might be very well that a prerequisite is missing.
>
> # annotate regression with 6.1.y specific commit
> #regzbot ^introduced e11dea8f503341507018b60906c4a9e7332f3663
> #regzbot link: https://bugs.debian.org/1063338
>
> Any ideas?
>
> Regards,
> Salvatore


Just a quick look comparing dlm_tcp_listen_bind between the latest 6.1
and 6.6 stable branches,
it looks like there is a mismatch here with the dlm_local_addr[0] parameter.

6.1
----

static int dlm_tcp_listen_bind(struct socket *sock)
{
int addr_len;

/* Bind to our port */
make_sockaddr(dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
   addr_len);
}

6.6
----
static int dlm_tcp_listen_bind(struct socket *sock)
{
int addr_len;

/* Bind to our port */
make_sockaddr(&dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
   addr_len);
}

6.6 contains commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on heap) which
changed

static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];

to

static struct sockaddr_storage dlm_local_addr[DLM_MAX_ADDR_COUNT];

It looks like kernel_bind() in 6.1 needs to be modified to match.


-Jordan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [regression 6.1.67] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()")
  2024-02-07 18:33   ` Jordan Rife
@ 2024-02-07 21:27     ` Alexander Aring
  2024-02-07 21:39       ` [regression 6.1.76] " Salvatore Bonaccorso
  2024-02-08 11:37     ` Valentin Kleibel
  1 sibling, 1 reply; 9+ messages in thread
From: Alexander Aring @ 2024-02-07 21:27 UTC (permalink / raw)
  To: Jordan Rife
  Cc: Salvatore Bonaccorso, Valentin Kleibel, David Teigland, 1063338,
	gfs2, linux-kernel, stable, gregkh, regressions

Hi,

On Wed, Feb 7, 2024 at 1:33 PM Jordan Rife <jrife@google.com> wrote:
>
> On Wed, Feb 7, 2024 at 2:39 AM Salvatore Bonaccorso <carnil@debian.org> wrote:
> >
> > Hi Valentin, hi all
> >
> > [This is about a regression reported in Debian for 6.1.67]
> >
> > On Tue, Feb 06, 2024 at 01:00:11PM +0100, Valentin Kleibel wrote:
> > > Package: linux-image-amd64
> > > Version: 6.1.76+1
> > > Source: linux
> > > Source-Version: 6.1.76+1
> > > Severity: important
> > > Control: notfound -1 6.6.15-2
> > >
> > > Dear Maintainers,
> > >
> > > We discovered a bug affecting dlm that prevents any tcp communications by
> > > dlm when booted with debian kernel 6.1.76-1.
> > >
> > > Dlm startup works (corosync-cpgtool shows the dlm:controld group with all
> > > expected nodes) but as soon as we try to add a lockspace dmesg shows:
> > > ```
> > > dlm: Using TCP for communications
> > > dlm: cannot start dlm midcomms -97
> > > ```
> > >
> > > It seems that commit "dlm: use kernel_connect() and kernel_bind()"
> > > (e9cdebbe) was merged to 6.1.
> > >
> > > Checking the code it seems that the changed function dlm_tcp_listen_bind()
> > > fails with exit code 97 (EAFNOSUPPORT)
> > > It is called from
> > >
> > > dlm/lockspace.c: threads_start() -> dlm_midcomms_start()
> > > dlm/midcomms.c: dlm_midcomms_start() -> dlm_lowcomms_start()
> > > dlm/lowcomms.c: dlm_lowcomms_start() -> dlm_listen_for_all() ->
> > > dlm_proto_ops->listen_bind() = dlm_tcp_listen_bind()
> > >
> > > The error code is returned all the way to threads_start() where the error
> > > message is emmitted.
> > >
> > > Booting with the unsigned kernel from testing (6.6.15-2), which also
> > > contains this commit, works without issues.
> > >
> > > I'm not sure what additional changes are required to get this working or if
> > > rolling back this change is an option.
> > >
> > > We'd be happy to test patches that might fix this issue.
> >
> > Thanks for your report. So we have a 6.1.76 specific regression for
> > the backport of e9cdebbe23f1 ("dlm: use kernel_connect() and
> > kernel_bind()") .
> >
> > Let's loop in the upstream regression list for tracking and people
> > involved for the subsystem to see if the issue can be identified. As
> > it is working for 6.6.15 which includes the commit backport as well it
> > might be very well that a prerequisite is missing.
> >
> > # annotate regression with 6.1.y specific commit
> > #regzbot ^introduced e11dea8f503341507018b60906c4a9e7332f3663
> > #regzbot link: https://bugs.debian.org/1063338
> >
> > Any ideas?
> >
> > Regards,
> > Salvatore
>
>
> Just a quick look comparing dlm_tcp_listen_bind between the latest 6.1
> and 6.6 stable branches,
> it looks like there is a mismatch here with the dlm_local_addr[0] parameter.
>
> 6.1
> ----
>
> static int dlm_tcp_listen_bind(struct socket *sock)
> {
> int addr_len;
>
> /* Bind to our port */
> make_sockaddr(dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
> return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
>    addr_len);
> }
>
> 6.6
> ----
> static int dlm_tcp_listen_bind(struct socket *sock)
> {
> int addr_len;
>
> /* Bind to our port */
> make_sockaddr(&dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
> return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
>    addr_len);
> }
>
> 6.6 contains commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on heap) which
> changed
>
> static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
>
> to
>
> static struct sockaddr_storage dlm_local_addr[DLM_MAX_ADDR_COUNT];
>
> It looks like kernel_bind() in 6.1 needs to be modified to match.
>

makes sense. I tried to cherry-pick e9cdebbe23f1 ("dlm: use
kernel_connect() and kernel_bind()") on v6.1.67 as I don't see it
there. It failed and does not apply cleanly.

Are we talking here about a debian kernel specific backport? If so,
maybe somebody missed to modify those parts you mentioned.

- Alex


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [regression 6.1.76] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()")
  2024-02-07 21:27     ` Alexander Aring
@ 2024-02-07 21:39       ` Salvatore Bonaccorso
  0 siblings, 0 replies; 9+ messages in thread
From: Salvatore Bonaccorso @ 2024-02-07 21:39 UTC (permalink / raw)
  To: Alexander Aring
  Cc: Jordan Rife, Valentin Kleibel, David Teigland, 1063338, gfs2,
	linux-kernel, stable, gregkh, regressions

Hi Alexander,

On Wed, Feb 07, 2024 at 04:27:48PM -0500, Alexander Aring wrote:
> Hi,
> 
> On Wed, Feb 7, 2024 at 1:33 PM Jordan Rife <jrife@google.com> wrote:
> >
> > On Wed, Feb 7, 2024 at 2:39 AM Salvatore Bonaccorso <carnil@debian.org> wrote:
> > >
> > > Hi Valentin, hi all
> > >
> > > [This is about a regression reported in Debian for 6.1.67]
> > >
> > > On Tue, Feb 06, 2024 at 01:00:11PM +0100, Valentin Kleibel wrote:
> > > > Package: linux-image-amd64
> > > > Version: 6.1.76+1
> > > > Source: linux
> > > > Source-Version: 6.1.76+1
> > > > Severity: important
> > > > Control: notfound -1 6.6.15-2
> > > >
> > > > Dear Maintainers,
> > > >
> > > > We discovered a bug affecting dlm that prevents any tcp communications by
> > > > dlm when booted with debian kernel 6.1.76-1.
> > > >
> > > > Dlm startup works (corosync-cpgtool shows the dlm:controld group with all
> > > > expected nodes) but as soon as we try to add a lockspace dmesg shows:
> > > > ```
> > > > dlm: Using TCP for communications
> > > > dlm: cannot start dlm midcomms -97
> > > > ```
> > > >
> > > > It seems that commit "dlm: use kernel_connect() and kernel_bind()"
> > > > (e9cdebbe) was merged to 6.1.
> > > >
> > > > Checking the code it seems that the changed function dlm_tcp_listen_bind()
> > > > fails with exit code 97 (EAFNOSUPPORT)
> > > > It is called from
> > > >
> > > > dlm/lockspace.c: threads_start() -> dlm_midcomms_start()
> > > > dlm/midcomms.c: dlm_midcomms_start() -> dlm_lowcomms_start()
> > > > dlm/lowcomms.c: dlm_lowcomms_start() -> dlm_listen_for_all() ->
> > > > dlm_proto_ops->listen_bind() = dlm_tcp_listen_bind()
> > > >
> > > > The error code is returned all the way to threads_start() where the error
> > > > message is emmitted.
> > > >
> > > > Booting with the unsigned kernel from testing (6.6.15-2), which also
> > > > contains this commit, works without issues.
> > > >
> > > > I'm not sure what additional changes are required to get this working or if
> > > > rolling back this change is an option.
> > > >
> > > > We'd be happy to test patches that might fix this issue.
> > >
> > > Thanks for your report. So we have a 6.1.76 specific regression for
> > > the backport of e9cdebbe23f1 ("dlm: use kernel_connect() and
> > > kernel_bind()") .
> > >
> > > Let's loop in the upstream regression list for tracking and people
> > > involved for the subsystem to see if the issue can be identified. As
> > > it is working for 6.6.15 which includes the commit backport as well it
> > > might be very well that a prerequisite is missing.
> > >
> > > # annotate regression with 6.1.y specific commit
> > > #regzbot ^introduced e11dea8f503341507018b60906c4a9e7332f3663
> > > #regzbot link: https://bugs.debian.org/1063338
> > >
> > > Any ideas?
> > >
> > > Regards,
> > > Salvatore
> >
> >
> > Just a quick look comparing dlm_tcp_listen_bind between the latest 6.1
> > and 6.6 stable branches,
> > it looks like there is a mismatch here with the dlm_local_addr[0] parameter.
> >
> > 6.1
> > ----
> >
> > static int dlm_tcp_listen_bind(struct socket *sock)
> > {
> > int addr_len;
> >
> > /* Bind to our port */
> > make_sockaddr(dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
> > return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
> >    addr_len);
> > }
> >
> > 6.6
> > ----
> > static int dlm_tcp_listen_bind(struct socket *sock)
> > {
> > int addr_len;
> >
> > /* Bind to our port */
> > make_sockaddr(&dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
> > return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
> >    addr_len);
> > }
> >
> > 6.6 contains commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on heap) which
> > changed
> >
> > static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
> >
> > to
> >
> > static struct sockaddr_storage dlm_local_addr[DLM_MAX_ADDR_COUNT];
> >
> > It looks like kernel_bind() in 6.1 needs to be modified to match.
> >
> 
> makes sense. I tried to cherry-pick e9cdebbe23f1 ("dlm: use
> kernel_connect() and kernel_bind()") on v6.1.67 as I don't see it
> there. It failed and does not apply cleanly.
> 
> Are we talking here about a debian kernel specific backport? If so,
> maybe somebody missed to modify those parts you mentioned.

Thanks all for looking into it.

No it's not a Debian specific backport, e9cdebbe23f1 ("dlm: use
kernel_connect() and kernel_bind()") got in fact backported upstream
in 6.1.76, 6.6.15 and 6.7.3. The respective commits are:

v6.1.76: e11dea8f503341507018b60906c4a9e7332f3663 dlm: use kernel_connect() and kernel_bind()
v6.6.15: c018ab3e31b16ff97b9b95b69904104c9fcca95b dlm: use kernel_connect() and kernel_bind()
v6.7.3: 4ecf1864f2076872b7aea29d463e785ef6fc9909 dlm: use kernel_connect() and kernel_bind()
v6.8-rc1: e9cdebbe23f1aa9a1caea169862f479ab3fa2773 dlm: use kernel_connect() and kernel_bind()

But for the 6.1.76 case there is the above regression (while it works
for 6.6.15 as confirmed by the reporter).

I'm very sorry I see where I have caused you confusion: The regression
is in 6.1.*76* not 6.1.*67* and I misstyped the version in two places.

Regards,
Salvatore

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [regression 6.1.76] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()")
  2024-02-07 18:33   ` Jordan Rife
  2024-02-07 21:27     ` Alexander Aring
@ 2024-02-08 11:37     ` Valentin Kleibel
  2024-02-08 17:42       ` Jordan Rife
  1 sibling, 1 reply; 9+ messages in thread
From: Valentin Kleibel @ 2024-02-08 11:37 UTC (permalink / raw)
  To: Jordan Rife, Salvatore Bonaccorso
  Cc: David Teigland, Alexander Aring, 1063338, gfs2, linux-kernel,
	stable, gregkh, regressions

[-- Attachment #1: Type: text/plain, Size: 1459 bytes --]

Hi Jordan, hi all

> Just a quick look comparing dlm_tcp_listen_bind between the latest 6.1
> and 6.6 stable branches,
> it looks like there is a mismatch here with the dlm_local_addr[0] parameter.
> 
> 6.1
> ----
> 
> static int dlm_tcp_listen_bind(struct socket *sock)
> {
> int addr_len;
> 
> /* Bind to our port */
> make_sockaddr(dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
> return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
>     addr_len);
> }
> 
> 6.6
> ----
> static int dlm_tcp_listen_bind(struct socket *sock)
> {
> int addr_len;
> 
> /* Bind to our port */
> make_sockaddr(&dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
> return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
>     addr_len);
> }
> 
> 6.6 contains commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on heap) which
> changed
> 
> static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
> 
> to
> 
> static struct sockaddr_storage dlm_local_addr[DLM_MAX_ADDR_COUNT];
> 
> It looks like kernel_bind() in 6.1 needs to be modified to match.

We tried to apply commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on 
heap) to the debian kernel 6.1.76 and came up with the attached patch. 
Besides the different offsets there is a slight change dlm_tcp_bind() 
where in 6.1.76 kernel_bind() is used instead of sock->ops->bind() in 
the original commit.

This patch solves the issue we experienced.

Thanks for your help,
Valentin

[-- Attachment #2: dlm_dont_put_dlm_local_addrs_on_heap.patch --]
[-- Type: text/x-patch, Size: 3717 bytes --]

--- a/fs/dlm/lowcomms.c	2024-02-08 10:42:19.328861479 +0100
+++ b/fs/dlm/lowcomms.c	2024-02-08 10:57:22.900463149 +0100
@@ -174,7 +174,7 @@
 static DEFINE_SPINLOCK(dlm_node_addrs_spin);
 
 static struct listen_connection listen_con;
-static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
+static struct sockaddr_storage dlm_local_addr[DLM_MAX_ADDR_COUNT];
 static int dlm_local_count;
 int dlm_allow_conn;
 
@@ -398,7 +398,7 @@
 	if (!sa_out)
 		return 0;
 
-	if (dlm_local_addr[0]->ss_family == AF_INET) {
+	if (dlm_local_addr[0].ss_family == AF_INET) {
 		struct sockaddr_in *in4  = (struct sockaddr_in *) &sas;
 		struct sockaddr_in *ret4 = (struct sockaddr_in *) sa_out;
 		ret4->sin_addr.s_addr = in4->sin_addr.s_addr;
@@ -727,7 +727,7 @@
 static void make_sockaddr(struct sockaddr_storage *saddr, uint16_t port,
 			  int *addr_len)
 {
-	saddr->ss_family =  dlm_local_addr[0]->ss_family;
+	saddr->ss_family =  dlm_local_addr[0].ss_family;
 	if (saddr->ss_family == AF_INET) {
 		struct sockaddr_in *in4_addr = (struct sockaddr_in *)saddr;
 		in4_addr->sin_port = cpu_to_be16(port);
@@ -1167,7 +1167,7 @@
 	int i, addr_len, result = 0;
 
 	for (i = 0; i < dlm_local_count; i++) {
-		memcpy(&localaddr, dlm_local_addr[i], sizeof(localaddr));
+		memcpy(&localaddr, &dlm_local_addr[i], sizeof(localaddr));
 		make_sockaddr(&localaddr, port, &addr_len);
 
 		if (!i)
@@ -1187,7 +1187,7 @@
 /* Get local addresses */
 static void init_local(void)
 {
-	struct sockaddr_storage sas, *addr;
+	struct sockaddr_storage sas;
 	int i;
 
 	dlm_local_count = 0;
@@ -1195,21 +1195,10 @@
 		if (dlm_our_addr(&sas, i))
 			break;
 
-		addr = kmemdup(&sas, sizeof(*addr), GFP_NOFS);
-		if (!addr)
-			break;
-		dlm_local_addr[dlm_local_count++] = addr;
+		memcpy(&dlm_local_addr[dlm_local_count++], &sas, sizeof(sas));
 	}
 }
 
-static void deinit_local(void)
-{
-	int i;
-
-	for (i = 0; i < dlm_local_count; i++)
-		kfree(dlm_local_addr[i]);
-}
-
 static struct writequeue_entry *new_writequeue_entry(struct connection *con)
 {
 	struct writequeue_entry *entry;
@@ -1575,7 +1564,7 @@
 	}
 
 	/* Create a socket to communicate with */
-	result = sock_create_kern(&init_net, dlm_local_addr[0]->ss_family,
+	result = sock_create_kern(&init_net, dlm_local_addr[0].ss_family,
 				  SOCK_STREAM, dlm_proto_ops->proto, &sock);
 	if (result < 0)
 		goto socket_err;
@@ -1786,7 +1775,6 @@
 	foreach_conn(free_conn);
 	srcu_read_unlock(&connections_srcu, idx);
 	work_stop();
-	deinit_local();
 
 	dlm_proto_ops = NULL;
 }
@@ -1803,7 +1791,7 @@
 	if (result < 0)
 		return result;
 
-	result = sock_create_kern(&init_net, dlm_local_addr[0]->ss_family,
+	result = sock_create_kern(&init_net, dlm_local_addr[0].ss_family,
 				  SOCK_STREAM, dlm_proto_ops->proto, &sock);
 	if (result < 0) {
 		log_print("Can't create comms socket: %d", result);
@@ -1842,7 +1830,7 @@
 	/* Bind to our cluster-known address connecting to avoid
 	 * routing problems.
 	 */
-	memcpy(&src_addr, dlm_local_addr[0], sizeof(src_addr));
+	memcpy(&src_addr, &dlm_local_addr[0], sizeof(src_addr));
 	make_sockaddr(&src_addr, 0, &addr_len);
 
 	result = kernel_bind(sock, (struct sockaddr *)&src_addr,
@@ -1899,7 +1887,7 @@
 	int addr_len;
 
 	/* Bind to our port */
-	make_sockaddr(dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
+	make_sockaddr(&dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
 	return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
 			   addr_len);
 }
@@ -1992,7 +1980,7 @@
 
 	error = work_start();
 	if (error)
-		goto fail_local;
+		goto fail;
 
 	dlm_allow_conn = 1;
 
@@ -2022,8 +2010,6 @@
 fail_proto_ops:
 	dlm_allow_conn = 0;
 	work_stop();
-fail_local:
-	deinit_local();
 fail:
 	return error;
 }

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [regression 6.1.76] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()")
  2024-02-08 11:37     ` Valentin Kleibel
@ 2024-02-08 17:42       ` Jordan Rife
  2024-02-08 21:17         ` Jordan Rife
  0 siblings, 1 reply; 9+ messages in thread
From: Jordan Rife @ 2024-02-08 17:42 UTC (permalink / raw)
  To: Valentin Kleibel
  Cc: Salvatore Bonaccorso, David Teigland, Alexander Aring, 1063338,
	gfs2, linux-kernel, stable, gregkh, regressions

On Thu, Feb 8, 2024 at 3:37 AM Valentin Kleibel <valentin@vrvis.at> wrote:
>
> Hi Jordan, hi all
>
> > Just a quick look comparing dlm_tcp_listen_bind between the latest 6.1
> > and 6.6 stable branches,
> > it looks like there is a mismatch here with the dlm_local_addr[0] parameter.
> >
> > 6.1
> > ----
> >
> > static int dlm_tcp_listen_bind(struct socket *sock)
> > {
> > int addr_len;
> >
> > /* Bind to our port */
> > make_sockaddr(dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
> > return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
> >     addr_len);
> > }
> >
> > 6.6
> > ----
> > static int dlm_tcp_listen_bind(struct socket *sock)
> > {
> > int addr_len;
> >
> > /* Bind to our port */
> > make_sockaddr(&dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
> > return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
> >     addr_len);
> > }
> >
> > 6.6 contains commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on heap) which
> > changed
> >
> > static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
> >
> > to
> >
> > static struct sockaddr_storage dlm_local_addr[DLM_MAX_ADDR_COUNT];
> >
> > It looks like kernel_bind() in 6.1 needs to be modified to match.
>
> We tried to apply commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on
> heap) to the debian kernel 6.1.76 and came up with the attached patch.
> Besides the different offsets there is a slight change dlm_tcp_bind()
> where in 6.1.76 kernel_bind() is used instead of sock->ops->bind() in
> the original commit.
>
> This patch solves the issue we experienced.
>
> Thanks for your help,
> Valentin

Good to hear that works for you! We should fix this in the 6.1 stable
kernel as well.

IMO it may be less risky and simpler to fix the backport of my patch
e9cdebbe23f1 ("dlm: use kernel_connect() and
kernel_bind()") and just switch (struct sockaddr *)&dlm_local_addr[0]
to (struct sockaddr *)dlm_local_addr[0]
in the call to kernel_bind() rather than backporting c51c9cd8 (fs:
dlm: don't put dlm_local_addrs on
heap) to 6.1.

I will have some time soon to fix the 6.1 backport, but it may make
sense just to revert in the meantime.

-Jordan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [regression 6.1.76] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()")
  2024-02-08 17:42       ` Jordan Rife
@ 2024-02-08 21:17         ` Jordan Rife
  2024-02-09 11:06           ` Valentin Kleibel
  0 siblings, 1 reply; 9+ messages in thread
From: Jordan Rife @ 2024-02-08 21:17 UTC (permalink / raw)
  To: Valentin Kleibel
  Cc: Salvatore Bonaccorso, David Teigland, Alexander Aring, 1063338,
	gfs2, linux-kernel, stable, gregkh, regressions

[-- Attachment #1: Type: text/plain, Size: 2558 bytes --]

Hi Valentin,

Would you be able to confirm that the attached patch fixes your issue as well?

-Jordan

On Thu, Feb 8, 2024 at 9:42 AM Jordan Rife <jrife@google.com> wrote:
>
> On Thu, Feb 8, 2024 at 3:37 AM Valentin Kleibel <valentin@vrvis.at> wrote:
> >
> > Hi Jordan, hi all
> >
> > > Just a quick look comparing dlm_tcp_listen_bind between the latest 6.1
> > > and 6.6 stable branches,
> > > it looks like there is a mismatch here with the dlm_local_addr[0] parameter.
> > >
> > > 6.1
> > > ----
> > >
> > > static int dlm_tcp_listen_bind(struct socket *sock)
> > > {
> > > int addr_len;
> > >
> > > /* Bind to our port */
> > > make_sockaddr(dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
> > > return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
> > >     addr_len);
> > > }
> > >
> > > 6.6
> > > ----
> > > static int dlm_tcp_listen_bind(struct socket *sock)
> > > {
> > > int addr_len;
> > >
> > > /* Bind to our port */
> > > make_sockaddr(&dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
> > > return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
> > >     addr_len);
> > > }
> > >
> > > 6.6 contains commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on heap) which
> > > changed
> > >
> > > static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
> > >
> > > to
> > >
> > > static struct sockaddr_storage dlm_local_addr[DLM_MAX_ADDR_COUNT];
> > >
> > > It looks like kernel_bind() in 6.1 needs to be modified to match.
> >
> > We tried to apply commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on
> > heap) to the debian kernel 6.1.76 and came up with the attached patch.
> > Besides the different offsets there is a slight change dlm_tcp_bind()
> > where in 6.1.76 kernel_bind() is used instead of sock->ops->bind() in
> > the original commit.
> >
> > This patch solves the issue we experienced.
> >
> > Thanks for your help,
> > Valentin
>
> Good to hear that works for you! We should fix this in the 6.1 stable
> kernel as well.
>
> IMO it may be less risky and simpler to fix the backport of my patch
> e9cdebbe23f1 ("dlm: use kernel_connect() and
> kernel_bind()") and just switch (struct sockaddr *)&dlm_local_addr[0]
> to (struct sockaddr *)dlm_local_addr[0]
> in the call to kernel_bind() rather than backporting c51c9cd8 (fs:
> dlm: don't put dlm_local_addrs on
> heap) to 6.1.
>
> I will have some time soon to fix the 6.1 backport, but it may make
> sense just to revert in the meantime.
>
> -Jordan

[-- Attachment #2: 0001-dlm-Treat-dlm_local_addr-0-as-sockaddr_storage.patch --]
[-- Type: text/x-patch, Size: 1364 bytes --]

From dec5ffd309967e429b616a9d498037a5eb437c54 Mon Sep 17 00:00:00 2001
From: Jordan Rife <jrife@google.com>
Date: Thu, 8 Feb 2024 12:09:55 -0600
Subject: [PATCH] dlm: Treat dlm_local_addr[0] as sockaddr_storage *

Backport e11dea8 ("dlm: use kernel_connect() and kernel_bind()") to
Linux stable 6.1 caused a regression. The original patch expected
dlm_local_addrs[0] to be of type sockaddr_storage, because c51c9cd ("fs:
dlm: don't put dlm_local_addrs on heap") changed its type from
sockaddr_storage* to sockaddr_storage in Linux 6.5+ while in older Linux
versions this is still the original sockaddr_storage*.

Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1063338
Fixes: e11dea8 ("dlm: use kernel_connect() and kernel_bind()")
Signed-off-by: Jordan Rife <jrife@google.com>
---
 fs/dlm/lowcomms.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 72f34f96d0155..8426073e73cf2 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1900,7 +1900,7 @@ static int dlm_tcp_listen_bind(struct socket *sock)
 
 	/* Bind to our port */
 	make_sockaddr(dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
-	return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
+	return kernel_bind(sock, (struct sockaddr *)dlm_local_addr[0],
 			   addr_len);
 }
 
-- 
2.43.0.687.g38aa6559b0-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [regression 6.1.76] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()")
  2024-02-08 21:17         ` Jordan Rife
@ 2024-02-09 11:06           ` Valentin Kleibel
  2024-02-09 16:28             ` Jordan Rife
  0 siblings, 1 reply; 9+ messages in thread
From: Valentin Kleibel @ 2024-02-09 11:06 UTC (permalink / raw)
  To: Jordan Rife
  Cc: Salvatore Bonaccorso, David Teigland, Alexander Aring, 1063338,
	gfs2, linux-kernel, stable, gregkh, regressions

Hi

> Would you be able to confirm that the attached patch fixes your issue as well?

Yes it does.

@debian maintainers: is it possible to include this patch in the next 
point release?

Thank you for your work,
Valentin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [regression 6.1.76] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()")
  2024-02-09 11:06           ` Valentin Kleibel
@ 2024-02-09 16:28             ` Jordan Rife
  0 siblings, 0 replies; 9+ messages in thread
From: Jordan Rife @ 2024-02-09 16:28 UTC (permalink / raw)
  To: Valentin Kleibel
  Cc: Salvatore Bonaccorso, David Teigland, Alexander Aring, 1063338,
	gfs2, linux-kernel, stable, gregkh, regressions

I sent this patch out to stable@vger.kernel.org. Everyone should be
CCd. Thanks for your help in confirming the fix works.

-Jordan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-02-09 16:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <38f51dbb-65aa-4ec2-bed2-e914aef27d25@vrvis.at>
2024-02-07 10:39 ` [regression 6.1.67] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()") Salvatore Bonaccorso
2024-02-07 18:33   ` Jordan Rife
2024-02-07 21:27     ` Alexander Aring
2024-02-07 21:39       ` [regression 6.1.76] " Salvatore Bonaccorso
2024-02-08 11:37     ` Valentin Kleibel
2024-02-08 17:42       ` Jordan Rife
2024-02-08 21:17         ` Jordan Rife
2024-02-09 11:06           ` Valentin Kleibel
2024-02-09 16:28             ` Jordan Rife

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.